The van der Schaar Lab is teaming up with Microsoft Research, the University of Cambridge and Amsterdam UMC to host a novel two-tracked competition to explore the meaning and limitations of data privacy.
The challenge has been accepted as part of the NeurIPS 2020 competition track, with $10,000 cash prizes to be provided to the winning teams.
Competition format
The NeurIPS 2020 Hide-and-Seek Privacy Challenge is a novel two-tracked competition, pitting synthetic data generation and patient re-identification methods against each other.
Participants in the synthetic data generation track (i.e. “hiders”) and the patient re-identification track (i.e. “seekers”) will compete by way of a new, high-quality intensive care time-series dataset: the AmsterdamUMCdb dataset.
AmsterdamUMCdb is the first freely accessible comprehensive and high resolution European intensive care database. It is also first to have addressed compliance with General Data Protection Regulation using an extensive risk-based de-identification approach. The data is much richer and granular than those in other well-known freely available intensive care databases.
– Paul Elbers, MD, Ph.D., EDIC (co-chair, Amsterdam Medical Data Science)
Rather than falling back on fixed theoretical notions of anonymity, the competition allows participants on both sides to uncover the best approaches in practice for launching or defending against privacy attacks.
Background
The vast quantities of clinical data now stored in machine-readable form have the potential to revolutionize healthcare. At the same time, patient datasets are inherently highly sensitive, and privacy concerns have recently been thrown into sharp relief by several high-profile data breaches. Naturally, these issues extend beyond healthcare.
Meanwhile, the global legal framework protecting sensitive data (such as healthcare data) is a hodge-podge of ideas: for example, the US Health Insurance Portability and Accountability Act (HIPAA) and the corresponding European General Data Protection Regulations (GDPR) impose differing—and often ambiguous—constraints to limit data sharing. Both are deliberately vague in specifications of anonymity, relying only intuitive notions of “low probability of re-identification”. However, neither “low probability” nor “re-identification” are well defined concepts.
In practice, the risk of patient re-identification is a pressing concern: Consider a rogue insurance company discriminating against high-risk patients per financial incentive. Such concerns caution medical institutions against releasing data for public research, hampering progress in the validation of novel computational models for real-world clinical applications.
– James Jordon (Ph.D. student, van der Schaar Lab)
Due to the high dimensionality of clinical time-series data, de-identification to preserve privacy while retaining data utility is difficult to achieve using common techniques.
An innovative approach to this problem is synthetic data generation. From a technical perspective, a good generative model for time-series data should preserve temporal dynamics, in the sense that new sequences respect the original relationships between high dimensional variables across time. From the privacy perspective, the model should prevent patient re-identification by limiting vulnerability to membership inference attacks.
Objectives
The competition’s aim is to understand—through the practical task of membership inference attacks—the strengths and weaknesses of machine learning techniques on both sides of the privacy battle, in particular to organically uncover which existing (and potentially novel) notions of privacy and anonymity end up being the most meaningful in practice.
The ultimate goal is to seek to advance generative techniques for dense and high-dimensional temporal data streams that are (1) clinically meaningful in terms of fidelity and predictivity, as well as (2) capable of minimizing membership privacy risks in terms of the concrete notion of patient re-identification.
Schedule & competition entry
The NeurIPS 2020 hide-and-seek privacy challenge will run from July 1, with October 1 being the deadline for final submissions.
The organizing team will evaluate submissions between October 1 and November 16, with results to be announced shortly thereafter.