About synthetic data
Machine learning has the potential to catalyze a complete transformation in healthcare, but researchers in our field are still hamstrung by a lack of access to high-quality data, which is the result of perfectly valid concerns regarding privacy.
Synthetic data techniques could offer a powerful solution to this problem by revolutionizing how we access and interact with healthcare datasets. Our lab is one of a small handful of groups cutting a path through this largely uncharted territory. This is a very complicated but uniquely important endeavor, which combines conceptual and technical challenges and has required considerable open-mindedness as we approach it: we have even needed to come up with new ways of understanding notions such as data quality.
Our approach to synthetic data
Synthetic data is one of our lab’s key research pillars, as we explain in an overview of how we’re tackling this exciting new area.
Introducing the hide-and-seek privacy challenge
Pitting “hiders” and “seekers” against one another, this challenge is a novel two-tracked contest to explore the meaning and limitations of data privacy, and first ran as part of the NeurIPS 2020 competition track from July through November 2020. We are now preparing to relaunch the challenge with several design and implementation reworks, and run it on a longer-term basis from 2021.
The original Hide-and-seek Privacy Challenge was administered by the van der Schaar Lab with support from the University of Cambridge, Microsoft Research, and Amsterdam UMC. When the challenge is relaunched in 2021, we expect to retain the same support structure.

Challenge overview
Coupled with advances in machine learning, the vast quantities of clinical data now stored in machine-readable form have the potential to revolutionize healthcare. At the same time, this enterprise is threatened by the fact that patient data are inherently highly sensitive, and privacy concerns have recently been thrown into sharp relief by several high-profile data breaches that have greatly undermined public confidence.
In particular, the recent COVID-19 pandemic has shed light on the critical need for high-quality datasets to be made readily available to the research community without requiring individual agreements between each research group and each entity holding clinical data. At the same time, the pandemic has also highlighted the sheer scale of the organizational and interdisciplinary barriers that continue to prevent this from happening; simply put, existing data anonymization methods are clearly not sufficient to reassure the clinical community at large that machine learning researchers cannot misuse or abuse patient datasets.
We seek novel methods capable of bridging the gap between data-hungry techniques in machine learning and privacy-conscious applications in healthcare settings.
The clinical time-series setting poses a unique combination of challenges to data modeling and sharing. Due to the high dimensionality of clinical time series, adequate de-identification to preserve privacy while retaining data utility is difficult to achieve using common de-identification techniques.
An innovative approach to this problem is synthetic data generation. From a technical perspective, a good generative model for time-series data should preserve temporal dynamics, in the sense that new sequences respect the original relationships between high dimensional variables across time. From the privacy perspective, the model should prevent patient re-identification by limiting vulnerability to membership inference attacks.
The Hide-and-Seek Privacy Challenge is a novel two-tracked competition to simultaneously accelerate progress in tackling both problems. In our head-to-head format, participants in the synthetic data generation track (i.e. “hiders”) and the patient re-identification track (i.e. “seekers”) are directly pitted against each other by way of a new, high-quality intensive care time-series dataset: the AmsterdamUMCdb dataset. Ultimately, we seek to advance generative techniques for dense and high-dimensional temporal data streams that are (1) clinically meaningful in terms of fidelity and predictivity, as well as (2) capable of minimizing membership privacy risks in terms of the concrete notion of patient re-identification.
Importantly, rather than falling back on fixed theoretical notions of anonymity, we allow participants on both sides to uncover the best approaches in practice for launching or defending against privacy attacks.
This competition provides a two-sided platform for synthetic data generation and patient re-identification methods to compete among and against each other. Our aim is to understand—through the practical task of membership inference attacks—the strengths and weaknesses of machine learning techniques on both sides of the privacy battle, in particular to organically uncover what existing (and potentially novel) notions of privacy and anonymity end up being the most meaningful in practice. We therefore invite participants to compete in either or both of two submission tracks of the interactive challenge: (1) the hider (i.e. synthetic data generation) track, and (2) the seeker (i.e. patient re-identification) track.

Original NeurIPS 2020 proposal
This is our original successful proposal for consideration in the NeurIPS 2020 competition track. It serves to explain the broad strokes of the challenge, but certain details have changed since it was accepted (for example, classification is no longer a task).
Please only use this document to get a sense of why we initially decided to hold this competition. For specifics regarding implementation as part of the NeurIPS 2020 competition track, please see the documentation on our BitBucket here.
Competitor tasks
Following relaunch in 2021, the competition will be an ongoing process without a specific end point, during which participants will be free to submit entries (i.e. algorithms) to either the hider or side (or both). Evaluation of all submitted entries will be conducted on a regular basis, from which an evolving leaderboard will be constructed for each track—respectively ranking submissions in order of performance.
Tasks for hiders
In the synthetic data generation track, participants are tasked with developing an algorithm that generates synthetic data on the basis of real data.
Their submission must be an algorithm (i.e. not just a trained model), whose input will be random subsets of an unseen subset of the dataset, and whose output is a synthetic dataset that contains entries from the same space as entries in the original dataset. At competition launch, participants will be given a subset of the dataset. They will be free to use this data to develop their algorithm and perform preliminary hyper-parameter selection, but may not use it to pre-train/initialise a model’s weights.
The synthetic data generated by each model will be evaluated in two ways: (1) similarity to the real data; and (2) resistance to re-identification. For each model this will be done on 10 random subsets of the non-public data.
Tasks for seekers
In the patient re-identification track, participants are tasked with developing an algorithm that performs membership inference (aka patient re-identification) on synthetic data generation algorithms. Their submission must be an algorithm, which may contain trained models from the public data.
At competition launch, participants will be given the same public dataset or datasets as the generation track. In addition, as each generation algorithm is submitted it will be made publicly available alongside 10 synthetic datasets generated using 10 random (known) subsets of the public data (so that re-identification track participants do not need to – but are still welcome to – run the generation models themselves).
Re-identification algorithms will be evaluated on 10 synthetic datasets generated by each generation algorithm according to their classification accuracy. The synthetic datasets used for evaluation will be generated on the basis of random subsets of the unseen data.
Schematics and descriptions for the mechanics of submissions and evaluations
Evaluation and scoring
Briefly, each head-to-head matchup is a zero-sum game.
Hiders
Hiders will be scored according to how well their generation algorithms hold up to membership inference attacks.
In addition, hider submissions are required to adequately capture the feature and temporal correlations in the original data; accordingly, they must also first pass a minimum quality bar (in terms of fidelity and predictivity) in order to qualify for competition. (Although the trade-off between quality and privacy is very interesting in its own right, for purposes of fair comparison we fix the former to allow ranking in terms of the latter).
Seekers
Seekers will be scored according to their accuracy at the membership inference task over each hider submission that is, in correctly identifying whether a given instance was employed in the process of generating of a given synthetic dataset.
Dataset
This dataset used in the original competition for NeurIPS 2020 was provided by AmsterdamUMCdb. It was developed and released by Amsterdam UMC in the Netherlands and the European Society of Intensive Care Medicine (ESICM). It is the first freely accessible comprehensive and high resolution European intensive care database. It is also first to have addressed compliance with General Data Protection Regulation (GDPR, EU 2016/679) using an extensive risk-based de-identification approach.
ICU admissions represent some of the most data-dense patient episodes in healthcare, and these patients represent some of the sickest. Unlike other healthcare domains, ICU data is characterized by its granular, sequential nature, its high-dimensionality and variety of data types, as well as heterogeneous sampling patterns and frequencies. This combination of challenges poses distinctive complexities for modeling; at the same time, it offers huge potential for improving patient care in real-time settings of life-and-death decision-making—where patients are often at risk of deterioration over the span of hours or minutes. Crucially, while a range of diverse models have been investigated in medical literature, they are largely based on a small number of publicly available datasets. To date, the availability of alternative high-quality, dense, and high-dimensional datasets for verifying model generalizability has been limited—precisely due to concerns of privacy.
AmsterdamUMCdb contains approximately 1 billion clinical data points related to 23,106 admissions of 20,109 unique patients between 2003 and 2016. The released data points include patient monitor and life support device data, laboratory measurements, clinical observations and scores, medical procedures and tasks, medication, fluid balance, diagnosis groups and clinical patient outcomes. Data granularity depends on the type of data and admission year, but is up to 1 value every minute for data from patient monitor and life support devices. The data is much richer and granular than those in other well known freely available intensive care databases, such as MIMIC and is comprised of patients with higher illness acuity than is found in US datasets.
We will consider adding new datasets when we relaunch the challenge in 2021.
Prizes (for NeurIPS 2020 competition)
For the original competition, Microsoft generously provided two $5, 000 cash prizes for the winning team from each track.
We do not currently plan to offer prizes for the following relaunch in 2021; while competition will remain an important factor in participation, the emphasis will be on learning and discovery related to synthetic data usages and techniques, and exploration of concepts of privacy.
Schedule and 2021 challenge relaunch
As part of the NeurIPS 2020 competition track, the original challenge ran from July through November 2020, with results announced in December.
As mentioned above, we plan to relaunch the challenge in 2021 with some design and implementation reworks, and hopefully additional datasets available. Further details will be provided soon.
Eligibility and general restrictions
Please note that further details regarding eligibility and restrictions will be provided ahead of the 2021 relaunch of the challenge. We do, however, expect the following aspects to be carried over from the original challenge.
- Participants that have access to the underlying AmsterdamUMCdb dataset will be required to declare this to the organizers.
- To be eligible for scoring, participants are required to release the code of their submissions as open source.
- Generation algorithms may only use the public data to define and tune hyperparameters of their algorithm but may not use the public data to initialise/pre-train a model.
- Each generative and re-idendification algorithm will be required to run within a specific time on a given GPU.
Organizing team
James Jordon
Lead coordinator // competition design // evaluation design // evaluation // baseline method provision
James Jordon is an Engineering Science PhD student at the University of Oxford. His primary research focus has been on generative models and their use for various tasks such as synthetic data generation, treatment-effect estimation and feature selection. He has published papers in several leading machine learning conferences including NeurIPS, ICML and ICLR.
Daniel Jarrett
Lead coordinator // competition design // evaluation design // evaluation // platform design and engineering
Daniel Jarrett is a Mathematics PhD student at the University of Cambridge. His primary research focus has been on representation learning for predictive, generative, and decision-making problems over time with a focus on healthcare. He has published in various journals and conferences including ICLR, NeurIPS, AISTATS, and The British Journal of Radiology.
Jinsung Yoon
Baseline method provision // data analysis // competition design advice // evaluation
Jinsung Yoon is a research scientist at Google Cloud AI. His main research interest has been on data imputation, model interpretation, transfer learning, and synthetic data generation using adversarial learning and reinforcement learning frameworks. He has published various papers and served as a reviewer in top-tier machine learning conferences (NeurIPS, ICML, ICLR, AAAI).
Paul Elbers
Domain expertise // data provision // competition design advice // evaluation design advice
Paul Elbers, MD, PhD, EDIC is a medical specialist in intensive care medicine at Amsterdam UMC, Amsterdam, The Netherlands. He also leads the Right Data Right Now research group at Amsterdam UMC that specifically aims to bring machine learning to the bedside of critically ill patients to improve their outcome. He is the deputy chair of the Data Science Section of the European Society of Intensive Care Medicine and co-chair of Amsterdam Medical Data Science, home of AmsterdamUMCdb, the first freely accessible European Intensive Care database.
Patrick Thoral
Domain expertise // data provision // competition design advice // evaluation design advice
Patrick Thoral, MD, EDIC works as an intensivist, medical specialist for intensive care, at Amsterdam UMC, Amsterdam, The Netherlands. With a background of medicine as well as medical informatics, he’s currently responsible for implementation of the electronic health record system in the ICU. To expedite improving patient outcomes using health care data, he played a major role in releasing AmsterdamUMCdb, the first freely accessible European Intensive Care database.
Ari Ercole
Domain expertise // data provision // competition design advice // evaluation design advice
Ari Ercole MD, PhD, FICM, FRCA, FCI is a research active intensive care attending physician at Cambridge University Hospitals NHS Foundation Trust with a PhD in physics and extensive experience in computing and ICU data modelling. He is chair of the European Society of Intensive Care Medicine Data Science Section and is a founding Fellow of the Faculty of Clinical Informatics. He has authored numerous peer-reviewed publications on the re-use of routinely ICU time-series data to improve predictions and care of intensive care patients and has been involved in a number of big-data projects such as the development of the Critical Care Health Informatics Collaborative database and the recent DAQCORD data curation guidelines.
Cheng Zhang
ML expertise // competition design advice // evaluation design advice
Cheng Zhang, PhD is a senior researcher at Microsoft Research Cambridge, UK. She leads the Data Efficient Decision Making (Project Azua) team in Microsoft. Before joining Microsoft, she was with the statistical machine learning group of Disney Research Pittsburgh, located at Carnegie Mellon University. She is interested in both machine learning theory, including variational inference, deep generative models and sequential decision making under uncertainty, as well as various machine learning applications with social impact such as education and healthcare. She has published many papers in top machine learning venues including NeurIPS, ICML, ICLR, ICLR, UAI etc. She co-organized the Symposium on Advances in Approximate Bayesian Inference from 2017 to 2019.
Danielle Belgrave
ML expertise // competition design advice // evaluation design advice
Danielle Belgrave, PhD is a principal researcher at Microsoft Research Cambridge, working on the intersection of machine learning and healthcare. The primary focus of her work is on developing probabilistic models to understand personalised healthcare strategies. She has published extensively on this intersection in high impact medical journals. She is the tutorial chair of NeurIPS 2019, 2020, diversity and inclusion chair of AISTATS 2020, board member of the Deep Learning Indaba, coorganiser of the first Khipu 2019, is a board member of Women in Machine Learning, program chair of WiML 2017, and has organised several other conferences and workshops.
Mihaela van der Schaar
General coordination and management // ML expertise // competition design advice // evaluation design advice
Mihaela van der Schaar is the John Humphrey Plummer Professor of Machine Learning, Artificial Intelligence and Medicine at the University of Cambridge, a Fellow at The Alan Turing Institute in London, and a Chancellor’s Professor at UCLA. Mihaela was elected IEEE Fellow in 2009. She has received numerous awards, including the Oon Prize on Preventative Medicine from the University of Cambridge (2018), a National Science Foundation CAREER Award (2004), 3 IBM Faculty Awards, the IBM Exploratory Stream Analytics Innovation Award, the Philips Make a Difference Award and several best paper awards, including the IEEE Darlington Award. Mihaela’s work has also led to 35 USA patents (many widely cited and adopted in standards) and 45+ contributions to international standards for which she received 3 International ISO (International Organization for Standardization) Awards. In 2019, she was identified by National Endowment for Science, Technology and the Arts as the most-cited female AI researcher in the UK. She was also elected as a 2019 “Star in Computer Networking and Communications” by N²Women. Her research expertise span signal and image processing, communication networks, network science, multimedia, game theory, distributed systems, machine learning and AI.
Related reading
Our lab has published a number of papers on the topic of synthetic data and machine learning for privacy. To learn more, visit our publications page.