This post on synthetic data was created to accompany Mihaela van der Schaar’s invited talk at the 2021 Deep Generative Models and Downstream Applications Workshop, held alongside NeurIPS 2021.
This invited talk, entitled “Synthetic Data Generation and Assessment: Challenges, Methods, Impact,” was given by Mihaela van der Schaar on December 14, 2021, as part of the Deep Generative Models and Downstream Applications Workshop running alongside NeurIPS 2021.
If you’d like to learn more about our lab’s research in the area of synthetic data generation and evaluation, you can find a full overview here.
Also, consider watching our Inspiration Exchange engagement series and registering to join upcoming sessions.
Other useful links:
– Our lab’s publications
– Mihaela van der Schaar on Twitter and LinkedIn
In this short post, we will explain the importance of synthetic data, and outline our lab’s ongoing work to create adaptable and logical frameworks and processes for synthetic data generation and evaluation.
Why do we need synthetic data?
Our purpose as a lab is to create new and powerful machine learning techniques and methods that can revolutionize healthcare. To catalyze such a revolution, we need high-quality data resources in a multitude of forms, including electronic health records, biobanks, and disease registries.
Access to such data, however, is complicated due to strict regulatory constraints (under frameworks such as HIPAA and GDPR), which are the result of perfectly valid concerns regarding the privacy of such data. As we have pointed out in the past, the lack of access to high-quality healthcare data represents a logjam that impedes machine learning research.
Several approaches—chiefly anonymization and deidentification—have been developed with the aim of rendering such datasets shareable without compromising privacy. Unfortunately, such approaches tend to be either highly disclosive or yield low-quality datasets due to the removal of too many fields.
Our lab has invested heavily in synthetic data research, which we see as the only way to break the data logjam in machine learning for healthcare. Using synthetic data approaches, a proximal version of the data can be shared that resembles real data, but contains no real samples for any specific individual.
As explained below, our research agenda has two sides: one exploring how synthetic data can be generated, and one seeking to establish standards and methods for evaluating synthetic datasets.
Towards a common “recipe” for synthetic data
Synthetic data has a broad range of potential uses. In healthcare, key applications include, but are not limited to:
– developing analytics (such as risk predictors or treatment effect estimators);
– facilitating reproducibility of clinical studies and analyses (due to the need to share the basis for such studies and analyses);
– augmenting small-sample datasets (such as for rare diseases or underrepresented patient subgroups; see RadialGAN);
– increasing the robustness and adaptability of machine learning models (for instance, transferring across hospitals); and
– simulating forward-looking data (including test new policies).
All of these potential uses of synthetic data come with their own requirements and criteria for suitability. Naturally, there are also many different methods and approaches to generating synthetic data. To try to standardize this into a common framework, our lab has created a common “recipe” for synthetic data generation. This recipe comprises three steps, as outlined below.
Step 1: determine which generative model class to use
First, we must determine which generative modeling class to use. For instance, depending on the purpose of the synthetic dataset, we may use GANs, variational autoencoders, normalizing flows, or any number of other methods.
Step 2: construct an appropriate representation/structure for the type of data
Next, we need to construct an appropriate representation (for example, recurrent neural networks vs. convolutional neural networks, and so forth) for the type of data under consideration, which may vary from time-series datasets to images, notes, biomarkers, and beyond.
Step 3: incorporate required notions of privacy
Finally, it is important to ensure that the required type of privacy is incorporated, given the source dataset and the intended purpose of the synthetic dataset. This is relatively open to interpretation, since GDPR and HIPAA do not specify rigorous mathematical formulations of privacy requirements. This is why our lab proposed a new formalism for privacy using k-anonymity, based on GDPR and HIPA when working on ADS-GAN in 2020.
Evaluating synthetic data: the other side of the coin
As we have hinted at the start of this post, generating synthetic data is only half of the challenge. We also need to be able to determine whether synthetic datasets are actually any good—and yet again, this is particularly challenging given the diversity of potential purposes for synthetic datasets.
A three-dimensional scale for evaluating synthetic data
In recent years, our lab has committed a great deal of time to exploring different approaches to evaluating synthetic datasets. One such project was our hide-and-seek privacy challenge, which ran as part of the NeurIPS 2020 competition track. Along the way, we have learned a number of important lessons—chief among which is the fact that a single-dimensional metric for evaluation is not enough.
What we need is to evaluate model performance as a point in a space that contains and assesses the following three dimensions:
– fidelity (how “good” are the synthetic samples?);
– diversity (how much of the real data is covered, and how representative is this?); and
– generalization (how often does the model copy the training data?).
For this, we have developed new probabilistic, interpretable, and multidimensional quantities for assessing synthetic data. Further details can be found in a paper published in early 2021, or in Mihaela van der Schaar’s ICML 2021 tutorial on synthetic data.
New frontiers for synthetic data
We see synthetic data as an exciting, diverse, and highly promising area with many unexplored frontiers. Some of these are listed below.
Looking forward using synthetic data
As mentioned above, one particularly intriguing application of synthetic data is to simulate forward-looking data. We can, for example, create a simulation ecosystem that allows us to test a variety of new healthcare policies using synthetic data based on real observational datasets.
One noteworthy example of this is Medkit-Learn, a publicly available Python package providing simple and easy access to high-fidelity synthetic medical data, which we introduced at NeurIPS 2021. Medkit-Learn is more than “just” synthetic data: it offers a full benchmarking suite designed specifically for medical sequential decision modelling. It provides a standardized way to compare algorithms in a realistic medical setting, employing a generation process that disentangles the policy and environment dynamics. This allows for a range of customizations and thereby enabling systematic evaluation of algorithms’ robustness against specific challenges prevalent in healthcare.
The central object in Medkit is the scenario, made up of a domain, environment, and policy, which fully defines the synthetic setting. By disentangling the environment and policy dynamics, Medkit enables us to simulate decision making behaviors with various tunable parameters. An example scenario is highlighted: ICU patient trajectories with customized environment dynamics and clinical policy. The output from Medkit will be a batch dataset that can be used for training and evaluating methods for modelling human decision-making.
Turning unfair real-world data into fair synthetic data
A key concern of ours is the fairness of synthetic data. This is a particularly important problem since unfair data can lead to unfair downstream predictions. This is why our lab has been exploring approaches to creating fair synthetic data, which can be used to create fair predictive models.
This is a very challenging problem, since:
– there are many different notions of fairness;
– removing protected attributes (such as ethnicity) is generally insufficient;
– fairness of the downstream model must be guaranteed at the data level; and
– data utility must be preserved.
A prime example of our work to date is DECAF, which was first introduced in a paper published at NeurIPS 2021. DECAF aims to generate fair synthetic data using causally-aware generative networks, using this causal perspective to provide an intuitive guideline to achieve different notions of fairness—with fairness guarantees given for the downstream setting. We have found DECAF to be a very effective approach to generating unbiased synthetic data.
In addition to the new frontiers described above, our lab is currently working on a range of other future directions, including:
– synthetic multi-modal data (genetic, images, time-series, etc.);
– generative models for asynchronous or sparse follow-up clinic visits; and
– domain and task-specific evaluation metrics.
If you’d like learn more about our work on synthetic data, you can:
– join us for Mihaela van der Schaar’s invited talk on December 14 at the Deep Generative Models and Downstream Applications Workshop, held alongside NeurIPS 2021.
– Read an overview of our work to date on synthetic data.
– Watch Mihaela van der Schaar’s ICML 2021 tutorial on synthetic data generation and assessment.
For a full list of the van der Schaar Lab’s publications, click here.