van der Schaar Lab

SyntheticData4ML Workshop at NeurIPS 2022 – Summary

2nd December 2022

Advances in machine learning owe much to the public availability of high-quality benchmark datasets and the well-defined problem settings that they encapsulate. Examples are abundant: CIFAR-10 for image classification, COCO for object detection, SQuAD for question answering, BookCorpus for language modelling, etc. There is a general belief that the accessibility of high-quality benchmark datasets is central to the thriving of our community.

However, three prominent issues affect benchmark datasets: data scarcity, privacy, and bias. They already manifest in many existing benchmarks, and also make the curation and publication of new benchmarks difficult (if not impossible) in numerous high-stakes domains, including healthcare, finance, and education. Hence, although ML holds strong promise in these domains, the lack of high-quality benchmark datasets creates a significant hurdle for the development of methodology and algorithms and leads to missed opportunities.

Synthetic data is a promising solution to the key issues of benchmark dataset curation and publication. Specifically, high-quality synthetic data generation could be done while addressing the following major issues.

  1. Data Scarcity. The training and evaluation of ML algorithms require datasets with a sufficient sample size. Note that even if the algorithm can learn from very few samples, we still need sufficient validation data for model evaluation. However, it is often challenging to obtain the desired number of samples due to the inherent data scarcity (e.g. people with unique characteristics, patients with rare diseases etc.) or the cost and feasibility of certain data collection. There has been very active research in cross-domain and out-of-domain data generation, as well as generation from a few samples. Once the generator is trained, one could obtain arbitrarily large synthetic datasets.
  2. Privacy. In many key applications, ML algorithms rely on record-level data collected from human subjects, which leads to privacy concerns and legal risks. As a result, data owners are often hesitant to publish datasets for the research community. Even if they are willing to, accessing the datasets often requires significant time and effort from the researchers. Synthetic data is regarded as one potential way to promote privacy. The 2019 NeurIPS Competition “Synthetic data hide and seek challenge” demonstrates the difficulty in performing privacy attacks on synthetic data. Many recent works look further into the theoretical and practical aspects of synthetic data and privacy.
  3. Bias and under-representation. The benchmark dataset may be subject to data collection bias and under-represent certain groups (e.g. people with less-privileged access to technology). Using these datasets as benchmarks would (implicitly) encourage the community to build algorithms that reflect or even exploit the existing bias. This is likely to hamper the adoption of ML in high-stake applications that require fairness, such as finance and justice. Synthetic data provides a way to curate less biased benchmark data. Specifically, (conditional) generative models can be used to augment any under-represented group in the original dataset. Recent works have shown that training on synthetically augmented data leads to consistent improvements in robustness and generalisation.

Why do we need this workshop? Despite the growing interest in using synthetic data to empower ML, this agenda is still challenging because it involves multiple research fields and various industry stakeholders. Specifically, it calls for the collaboration of the researchers in generative models, privacy, and fairness. Existing research in generative models focuses on generating high-fidelity data, often neglecting the privacy and fairness aspect. On the other hand, the existing research in privacy and fairness often focuses on the discriminative setting rather than the generative setting. Finally, while generative modelling in images and tabular data has matured, the generation of time series and multi-modal data is still a vibrant area of research, especially in complex domains in healthcare and finance. The data modality and characteristics differ significantly across application domains and industries. It is therefore important to get the inputs from the industry experts such that the benchmark reflects reality.

The goal of this workshop is to provide a platform for vigorous discussion with researchers in various fields of ML and industry experts in the hope to progress the idea of using synthetic data to empower ML research. The workshop also provides a forum for constructive debates and identifications of strengths and weaknesses with respect to alternative approaches, e.g. federated learning.

Invited Speakers

Mehrya Mori

NYU CIMS

Preserving Privacy for Data Publishing

Kalyan Veeramachaneni

MIT CSAIL

Generative Models beyond Images

Bo Li

UIUC

Fairness in Synthetic Data

Max Welling

UNIVERSITY OF AMSTERDAM

Synthetic data: the Fifth Paradigm

Carsten Utoft Niemann

RIGHOSPITALET

Teaching with Synthetic Data: A Case Study in Medical Education

Invited Panellists

Dino Oglic

ASTRAZENECA

Katrina Ligett

HEBREW UNIVERSITY

Freedom Gumedze

UNIVERSITY OF CAPE TOWN

Rachel Cummings

COLUMBIA UNIVERSITY

Bo Li

UIUC

Best Paper Awards

PrivE: Empirical Privacy Evaluation of Synthetic Data Generators

Mitigating Health Data Poverty: Generative Approaches versus Resampling for Time-series Clinical Data

Accepted Papers

Contributed talk session 1 (9:20 am – 10:00 am)

  1. Generating Synthetic Datasets by Interpolating along Generalized Geodesics
  2. Synthetic Clinical Trial Data while Preserving Subject-Level Privacy
  3. Stutter-TTS: Synthetic Generation of Diverse Stuttered Voice Profiles
  4. ReSPack: A Large-Scale Rectilinear Steiner Tree Packing Data Generator and Benchmark
  5. Visual Pre-training for Navigation: What Can We Learn from Noise?
  6. HAPNEST: An efficient tool for generating large-scale genetics datasets from limited training data
  7. Improving dermatology classifiers across populations using images generated by large diffusion models
  8. Weakly Supervised Data Augmentation Through Prompting for Dialogue Understanding
  9. Leading by example: Guiding knowledge transfer with adversarial data augmentation
  10. Exploring Biases in Facial Expression Analysis
  11. Noise-Aware Statistical Inference with Differentially Private Synthetic Data
  12. A source data privacy framework for synthetic clinical trial data
  13. Approaches to Optimizing Medical Treatment Policy using Temporal Causal Model-Based Simulation
  14. Entity-Controlled Synthetic Text Generation using Contextual Question and Answering with Pre-trained Language Models
  15. Distributional Privacy for Data Sharing
  16. Vine Copula Based Data Generation for Machine Learning With an Application to Industrial Processes
  17. Systematic review of effect of data augmentation using paraphrasing on Named entity recognition
  18. PRISIM: Privacy Preserving Synthetic Data Simulator
  19. Fair Synthetic Data Does not Necessarily Lead to Fair Models
  20. FARE: Provably Fair Representation Learning

Contributed talk session 2 (10:30 am – 11:20 am)

  1. SynBench: Task-Agnostic Benchmarking of Pretrained Representations using Synthetic Data
  2. MAQA: A Multimodal QA Benchmark for Negation
  3. Generating Realistic Synthetic Relational Data through Graph Variational Autoencoders
  4. Mitigating Health Data Poverty: Generative Approaches versus Resampling for Time-series Clinical Data
  5. Private GANs, Revisited
  6. Contrastive Learning on Synthetic Videos for GAN Latent Disentangling
  7. HandsOff: Labeled Dataset Generation with No Additional Human Annotations
  8. Secure Multiparty Computation for Synthetic Data Generation from Distributed Data
  9. Counterfactual Fairness in Synthetic Data Generation
  10. HyperTime: Implicit Neural Representations for Time Series
  11. Generic and Privacy-free Synthetic Data Generation for Pretraining GANs
  12. Importance of Synthesizing High-quality Data for Text-to-SQL Parsing
  13. Fast Learning of Multidimensional Hawkes Processes via Frank-Wolfe
  14. TAPAS: a Toolbox for Adversarial Privacy Auditing of Synthetic Data
  15. On the legal nature of synthetic data
  16. Synthesizing Informative Training Samples with GAN
  17. Unsupervised Anomaly Detection for Auditing Data and Impact of Categorical Encodings.
  18. C-GATS: Conditional Generation of Anomalous Time Series
  19. Medical Scientific Table-to-Text Generation with Synthetic Data under Data Sparsity Constraint
  20. Mind Your Step: Continuous Conditional GANs with Generator Regularization
  21. Federated Learning on Patient Data for Privacy-Protecting Polycystic Ovary Syndrome Treatment
  22. Random Walk based Conditional Generative Model for Temporal Networks with Attributes
  23. Generating High Fidelity Synthetic Data via Coreset selection and Entropic Regularization
  24. Conditional Progressive Generative Adversarial Network for satellite image generation
  25. Multi-Modal Conditional GAN: Data Synthesis in the Medical Domain
  26. Hypothesis Testing using Causal and Causal Variational Generative Models