van der Schaar Lab

Generating and evaluating synthetic data: a two-sided research agenda

This post on synthetic data was created to accompany Mihaela van der Schaar’s invited talk at the 2021 Deep Generative Models and Downstream Applications Workshop, held alongside NeurIPS 2021.

This invited talk, entitled “Synthetic Data Generation and Assessment: Challenges, Methods, Impact,” was given by Mihaela van der Schaar on December 14, 2021, as part of the Deep Generative Models and Downstream Applications Workshop running alongside NeurIPS 2021.

If you’d like to learn more about our lab’s research in the area of synthetic data generation and evaluation, you can find a full overview here.

Also, consider watching our Inspiration Exchange engagement series and registering to join upcoming sessions.

Other useful links:
– Our lab’s publications
– Mihaela van der Schaar on Twitter and LinkedIn

In this short post, we will explain the importance of synthetic data, and outline our lab’s ongoing work to create adaptable and logical frameworks and processes for synthetic data generation and evaluation.

Why do we need synthetic data?

Our purpose as a lab is to create new and powerful machine learning techniques and methods that can revolutionize healthcare. To catalyze such a revolution, we need high-quality data resources in a multitude of forms, including electronic health records, biobanks, and disease registries.

Access to such data, however, is complicated due to strict regulatory constraints (under frameworks such as HIPAA and GDPR), which are the result of perfectly valid concerns regarding the privacy of such data. As we have pointed out in the past, the lack of access to high-quality healthcare data represents a logjam that impedes machine learning research.

Several approaches—chiefly anonymization and deidentification—have been developed with the aim of rendering such datasets shareable without compromising privacy. Unfortunately, such approaches tend to be either highly disclosive or yield low-quality datasets due to the removal of too many fields.

Our lab has invested heavily in synthetic data research, which we see as the only way to break the data logjam in machine learning for healthcare. Using synthetic data approaches, a proximal version of the data can be shared that resembles real data, but contains no real samples for any specific individual.

As explained below, our research agenda has two sides: one exploring how synthetic data can be generated, and one seeking to establish standards and methods for evaluating synthetic datasets.

Towards a common “recipe” for synthetic data

Synthetic data has a broad range of potential uses. In healthcare, key applications include, but are not limited to:
– developing analytics (such as risk predictors or treatment effect estimators);
– facilitating reproducibility of clinical studies and analyses (due to the need to share the basis for such studies and analyses);
– augmenting small-sample datasets (such as for rare diseases or underrepresented patient subgroups; see RadialGAN);
– increasing the robustness and adaptability of machine learning models (for instance, transferring across hospitals); and
– simulating forward-looking data (including test new policies).

All of these potential uses of synthetic data come with their own requirements and criteria for suitability. Naturally, there are also many different methods and approaches to generating synthetic data. To try to standardize this into a common framework, our lab has created a common “recipe” for synthetic data generation. This recipe comprises three steps, as outlined below.

Step 1: determine which generative model class to use

First, we must determine which generative modeling class to use. For instance, depending on the purpose of the synthetic dataset, we may use GANs, variational autoencoders, normalizing flows, or any number of other methods.

Step 2: construct an appropriate representation/structure for the type of data

Next, we need to construct an appropriate representation (for example, recurrent neural networks vs. convolutional neural networks, and so forth) for the type of data under consideration, which may vary from time-series datasets to images, notes, biomarkers, and beyond.

Step 3: incorporate required notions of privacy

Finally, it is important to ensure that the required type of privacy is incorporated, given the source dataset and the intended purpose of the synthetic dataset. This is relatively open to interpretation, since GDPR and HIPAA do not specify rigorous mathematical formulations of privacy requirements. This is why our lab proposed a new formalism for privacy using k-anonymity, based on GDPR and HIPA when working on ADS-GAN in 2020.

Evaluating synthetic data: the other side of the coin

As we have hinted at the start of this post, generating synthetic data is only half of the challenge. We also need to be able to determine whether synthetic datasets are actually any good—and yet again, this is particularly challenging given the diversity of potential purposes for synthetic datasets.

A three-dimensional scale for evaluating synthetic data

In recent years, our lab has committed a great deal of time to exploring different approaches to evaluating synthetic datasets. One such project was our hide-and-seek privacy challenge, which ran as part of the NeurIPS 2020 competition track. Along the way, we have learned a number of important lessons—chief among which is the fact that a single-dimensional metric for evaluation is not enough.

What we need is to evaluate model performance as a point in a space that contains and assesses the following three dimensions:
– fidelity (how “good” are the synthetic samples?);
– diversity (how much of the real data is covered, and how representative is this?); and
– generalization (how often does the model copy the training data?).

For this, we have developed new probabilistic, interpretable, and multidimensional quantities for assessing synthetic data. Further details can be found in a paper published in early 2021, or in Mihaela van der Schaar’s ICML 2021 tutorial on synthetic data.

New frontiers for synthetic data

We see synthetic data as an exciting, diverse, and highly promising area with many unexplored frontiers. Some of these are listed below.

Looking forward using synthetic data

As mentioned above, one particularly intriguing application of synthetic data is to simulate forward-looking data. We can, for example, create a simulation ecosystem that allows us to test a variety of new healthcare policies using synthetic data based on real observational datasets.

One noteworthy example of this is Medkit-Learn, a publicly available Python package providing simple and easy access to high-fidelity synthetic medical data, which we introduced at NeurIPS 2021. Medkit-Learn is more than “just” synthetic data: it offers a full benchmarking suite designed specifically for medical sequential decision modelling. It provides a standardized way to compare algorithms in a realistic medical setting, employing a generation process that disentangles the policy and environment dynamics. This allows for a range of customizations and thereby enabling systematic evaluation of algorithms’ robustness against specific challenges prevalent in healthcare.

The central object in Medkit is the scenario, made up of a domain, environment, and policy, which fully defines the synthetic setting. By disentangling the environment and policy dynamics, Medkit enables us to simulate decision making behaviors with various tunable parameters. An example scenario is highlighted: ICU patient trajectories with customized environment dynamics and clinical policy. The output from Medkit will be a batch dataset that can be used for training and evaluating methods for modelling human decision-making.

Turning unfair real-world data into fair synthetic data

A key concern of ours is the fairness of synthetic data. This is a particularly important problem since unfair data can lead to unfair downstream predictions. This is why our lab has been exploring approaches to creating fair synthetic data, which can be used to create fair predictive models.

This is a very challenging problem, since:
– there are many different notions of fairness;
– removing protected attributes (such as ethnicity) is generally insufficient;
– fairness of the downstream model must be guaranteed at the data level; and
– data utility must be preserved.

A prime example of our work to date is DECAF, which was first introduced in a paper published at NeurIPS 2021. DECAF aims to generate fair synthetic data using causally-aware generative networks, using this causal perspective to provide an intuitive guideline to achieve different notions of fairness—with fairness guarantees given for the downstream setting. We have found DECAF to be a very effective approach to generating unbiased synthetic data.

In addition to the new frontiers described above, our lab is currently working on a range of other future directions, including:
synthetic multi-modal data (genetic, images, time-series, etc.);
generative models for asynchronous or sparse follow-up clinic visits; and
domain and task-specific evaluation metrics.

If you’d like learn more about our work on synthetic data, you can:
– join us for Mihaela van der Schaar’s invited talk on December 14 at the Deep Generative Models and Downstream Applications Workshop, held alongside NeurIPS 2021.
– Read an overview of our work to date on synthetic data.
– Watch Mihaela van der Schaar’s ICML 2021 tutorial on synthetic data generation and assessment.

For a full list of the van der Schaar Lab’s publications, click here.

Mihaela van der Schaar

Mihaela van der Schaar is the John Humphrey Plummer Professor of Machine Learning, Artificial Intelligence and Medicine at the University of Cambridge and a Fellow at The Alan Turing Institute in London.

Mihaela has received numerous awards, including the Oon Prize on Preventative Medicine from the University of Cambridge (2018), a National Science Foundation CAREER Award (2004), 3 IBM Faculty Awards, the IBM Exploratory Stream Analytics Innovation Award, the Philips Make a Difference Award and several best paper awards, including the IEEE Darlington Award.

In 2019, she was identified by National Endowment for Science, Technology and the Arts as the most-cited female AI researcher in the UK. She was also elected as a 2019 “Star in Computer Networking and Communications” by N²Women. Her research expertise span signal and image processing, communication networks, network science, multimedia, game theory, distributed systems, machine learning and AI.

Mihaela’s research focus is on machine learning, AI and operations research for healthcare and medicine.

Alex Chan

Alex Chan graduated with a BSc in Statistics at University College London before moving to Cambridge for an MPhil in Machine Learning and Machine Intelligence.

Having started early in research, he won an EPSRC funding grant in his second year of undergraduate for a project on Markov chain Monte Carlo mixing times, and earlier this year had his work on uncertainty calibration presented at ICML.

Much of Alex’s research will focus on understanding and building latent representations of human behavior, with a specific emphasis on understanding clinical decision-making (an important new area of focus for the lab’s research) through imitation, representation learning, and generative modeling. In Alex’s own words, replicating and understanding decision-making at a higher level is, in itself, incredibly interesting, but “also being able to apply it healthcare is hugely important, and promises to actually make a difference to people’s lives in the near future.”

He is particularly interested in developing approximate Bayesian methods to appropriately handle the associated uncertainty that naturally arises in this setting and which is vital to understand.

Drawn to the lab’s special focus on healthcare, Alex notes that “No other area promises the same kind of potential for really having an impact with your research, and the lab benefits from the wide diversity of work being done alongside connections everywhere in both academia and industry.”

Alex’s studentship is sponsored by Microsoft Research.

Outside of machine learning, Alex captains and trains the novices of the Wolfson College Boat Club and occasionally keeps up with Krav Maga as a trainee instructor.

Nick Maxfield

From 2020 to 2022, Nick oversaw the van der Schaar Lab’s communications, including media relations, content creation, and maintenance of the lab’s online presence.