van der Schaar Lab

Announcing the NeurIPS 2020 hide-and-seek privacy challenge

The van der Schaar Lab is teaming up with Microsoft Research, the University of Cambridge and Amsterdam UMC to host a novel two-tracked competition to explore the meaning and limitations of data privacy.

The challenge has been accepted as part of the NeurIPS 2020 competition track, with $10,000 cash prizes to be provided to the winning teams.

Competition format

The NeurIPS 2020 Hide-and-Seek Privacy Challenge is a novel two-tracked competition, pitting synthetic data generation and patient re-identification methods against each other.

Participants in the synthetic data generation track (i.e. “hiders”) and the patient re-identification track (i.e. “seekers”) will compete by way of a new, high-quality intensive care time-series dataset: the AmsterdamUMCdb dataset.

AmsterdamUMCdb is the first freely accessible comprehensive and high resolution European intensive care database. It is also first to have addressed compliance with General Data Protection Regulation using an extensive risk-based de-identification approach. The data is much richer and granular than those in other well-known freely available intensive care databases.

– Paul Elbers, MD, Ph.D., EDIC (co-chair, Amsterdam Medical Data Science)

Rather than falling back on fixed theoretical notions of anonymity, the competition allows participants on both sides to uncover the best approaches in practice for launching or defending against privacy attacks.

Background

The vast quantities of clinical data now stored in machine-readable form have the potential to revolutionize healthcare. At the same time, patient datasets are inherently highly sensitive, and privacy concerns have recently been thrown into sharp relief by several high-profile data breaches. Naturally, these issues extend beyond healthcare.

Meanwhile, the global legal framework protecting sensitive data (such as healthcare data) is a hodge-podge of ideas: for example, the US Health Insurance Portability and Accountability Act (HIPAA) and the corresponding European General Data Protection Regulations (GDPR) impose differing—and often ambiguous—constraints to limit data sharing. Both are deliberately vague in specifications of anonymity, relying only intuitive notions of “low probability of re-identification”. However, neither “low probability” nor “re-identification” are well defined concepts.

In practice, the risk of patient re-identification is a pressing concern: Consider a rogue insurance company discriminating against high-risk patients per financial incentive. Such concerns caution medical institutions against releasing data for public research, hampering progress in the validation of novel computational models for real-world clinical applications.

– James Jordon (Ph.D. student, van der Schaar Lab)

Due to the high dimensionality of clinical time-series data, de-identification to preserve privacy while retaining data utility is difficult to achieve using common techniques.

An innovative approach to this problem is synthetic data generation. From a technical perspective, a good generative model for time-series data should preserve temporal dynamics, in the sense that new sequences respect the original relationships between high dimensional variables across time. From the privacy perspective, the model should prevent patient re-identification by limiting vulnerability to membership inference attacks.

Objectives

The competition’s aim is to understand—through the practical task of membership inference attacks—the strengths and weaknesses of machine learning techniques on both sides of the privacy battle, in particular to organically uncover which existing (and potentially novel) notions of privacy and anonymity end up being the most meaningful in practice.

The ultimate goal is to seek to advance generative techniques for dense and high-dimensional temporal data streams that are (1) clinically meaningful in terms of fidelity and predictivity, as well as (2) capable of minimizing membership privacy risks in terms of the concrete notion of patient re-identification.

Schedule & competition entry

The NeurIPS 2020 hide-and-seek privacy challenge will run from July 1, with October 1 being the deadline for final submissions.

The organizing team will evaluate submissions between October 1 and November 16, with results to be announced shortly thereafter.

James Jordon

James is a 3rd year DPhil student at the University of Oxford.

His research focuses on the use of generative adversarial networks in solving supervised, unsupervised and private learning problems including: estimation of individualised treatment effects, feature selection, private synthetic data generation, data imputation and transfer learning.

Of particular interest is the use of generative modelling in creating private synthetic data to allow easier data sharing and therefore more rapid advancement in specialised machine learning technologies.

Mihaela van der Schaar

Mihaela van der Schaar is the John Humphrey Plummer Professor of Machine Learning, Artificial Intelligence and Medicine at the University of Cambridge and a Fellow at The Alan Turing Institute in London.

Mihaela has received numerous awards, including the Oon Prize on Preventative Medicine from the University of Cambridge (2018), a National Science Foundation CAREER Award (2004), 3 IBM Faculty Awards, the IBM Exploratory Stream Analytics Innovation Award, the Philips Make a Difference Award and several best paper awards, including the IEEE Darlington Award.

In 2019, she was identified by National Endowment for Science, Technology and the Arts as the most-cited female AI researcher in the UK. She was also elected as a 2019 “Star in Computer Networking and Communications” by N²Women. Her research expertise span signal and image processing, communication networks, network science, multimedia, game theory, distributed systems, machine learning and AI.

Mihaela’s research focus is on machine learning, AI and operations research for healthcare and medicine.

Nick Maxfield

From 2020 to 2022, Nick oversaw the van der Schaar Lab’s communications, including media relations, content creation, and maintenance of the lab’s online presence.