van der Schaar Lab

Demonstrator: Synthetic Data


Synthetic Data


Machine learning has the potential to catalyse a complete transformation in healthcare, but both clinicians and researchers are still hamstrung by a lack of access to high-quality data, which is the result of perfectly valid concerns regarding privacy.

With the aim of overcoming these issues, our lab has devoted substantial resources to researching machine learning techniques for synthetic data generation and assessment. In this demonstrator, we bring to life the potential of synthetic data generation tools to revolutionise how we interact with healthcare datasets.

Watch the Video: Introduction to Synthetic Data and the Demonstrator by Dr Jem Rashbass

On 19 April, 2022, we ran the sixteenth virtual Revolutionizing Healthcare engagement sessions of the van der Schaar Lab and its audience of practising clinicians.  Part of that session was the presentation of the novel synthetic data demonstrator by Dr Jem Rashbass, developed and designed by members of the van der Schaar lab, which you can watch here.

What are the Clinical Use Cases?

We envisage that our synthetic data tools will help you tackle the following challenges:

  • Data Privacy: you wish to share some medical data with your collaborators, but it is private.
  • Domain Adaptation: you have data on a certain topic from multiple healthcare centres, and you wish to transfer the properties of the datasets between each other.
  • Fairness: your data has class unbalances in some protected characteristics, or you wish to ensure that some variable has no influence on a target or treatment variable.

Use Case: Data Privacy

In clinical practice, a tension often arises between preserving patient data privacy and being able to share the data with data scientists or machine learning researchers to get the best insights for the greater societal good.  A way to resolve this is through the use of synthetic data, as our demonstrator shows.

The demonstrator guides the user in making this trade-off through balancing a privacy metric (Epsilon-identifiability [1]) and a faithfulness metric (Wasserstein Distance): the former measures the likelihood of synthetic patient re-identification, and the latter captures the similarity to the real data.  Furthermore, depending on the privacy guarantees desired, the user is directed to the appropriate underlying algorithm (e.g. PATE-GAN [2] or ADS-GAN [1]).

Use Case: Domain Adaptation

Another problem that our demonstrator illustrates solving concerns the differences between various datasets that fall under the same category.  For instance, two datasets of prostate cancer patients, one from the UK and one from the US, may differ in patient demographics (e.g. different proportions of ethnicities), different treatments (one favouring more intervention and another active monitoring), or indeed, the features that are collected (some information recorded in one dataset may not be recorded in the other).

Our synthetic data demonstrator once again comes to the rescue.  A user may upload several “source” datasets, which may all have different features and statistical properties, and a “target” dataset.  An algorithm (such as RadialGAN [3]) will then learn from the source data and match the properties and features of the target data (i.e. perform class balancing, feature matching etc.) in the synthetic data that it generates.

Use Case: Fairness

Another important concept is, of course, fairness.  It is well known that ML models may underperform for the underrepresented classes in the data (which presents clear ethical issues when these are, e.g., ethnicity, gender, etc.).  Our demonstrator illustrates how this can be tackled through class balancing, as part of the synthetic data generation process (see the Domain Adaptation use case above). 

Fairness entails more than simply unbalanced class representation – in some cases it is not ethically acceptable for certain characteristics to causally influence an outcome in a model, for instance, ethnicity affecting treatment, or private healthcare status affecting patient outcomes.  For this, our demonstrator offers a tool where the user can define which variables are allowed to causally influence a target variable.  The data is then generated, incorporating a sampling-time fairness guarantee (based on this causal graph), using an algorithm such as our lab’s DECAF [4].

Synthetic Data Metrics

The demonstrator encourages the user to consider a range of possible synthetic data quality metrics, including measures of downstream predictive model performance.  For instance, how well would a regression model perform when Trained on the Synthetic data, but Tested on the Real data (TS-TR)?  Or, what is the level of agreement between different predictive models when trained on the real vs. the synthetic data (Synthetic Ranking Agreement, SRA [5])?

In fact, since the demonstrator platform is not limited to a single metric, multidimensional metrics could be incorporated and visualised, allowing the user a more sophisticated view into the nature of the synthetic data they generate.  A great example of such a set of metrics is the Alpha-Precision, Beta-Recall, and Authenticity trio proposed by our lab [6], measuring, in turn, the fidelity, diversity, and generalisation performance of any synthetic data generation model.


[1] J. Yoon, L. N. Drumright, M. van der Schaar, “Anonymization through Data Synthesis using Generative Adversarial Networks (ADS-GAN)”, IEEE journal of biomedical and health informatics 24.8, 2020.

[2] J. Yoon, J. Jordon, M. van der Schaar, “PATE-GAN: Generating Synthetic Data with Differential Privacy Guarantees”, International Conference on Learning Representations (ICLR), 2019.

[3] J. Yoon, J. Jordon, M. van der Schaar, “RadialGAN: Leveraging multiple datasets to improve target-specific predictive models using Generative Adversarial Networks”, International Conference on Machine Learning (ICML), 2018.

[4] T. Kyono*, B. van Breugel*, J. Berrevoets, M. van der Schaar, “DECAF: Generating Fair Synthetic Data Using Causally-Aware Generative Networks”, Advances in Neural Information Processing Systems 34, 2021.

[5] J. Jordon, J. Yoon, M. van der Schaar, “Measuring the quality of Synthetic data for use in competitions”, KDD Workshop on Machine Learning for Medicine and Healthcare, 2018.

[6] A. M. Alaa, B. van Breugel, E. S. Saveliev, M. van der Schaar, “How Faithful is your Synthetic Data? Sample-level Metrics for Evaluating and Auditing Generative Models”, International Conference on Machine Learning (ICML), 2022.