van der Schaar Lab

SyntheticData4ML Workshop

This workshop was held at NeurIPS on 16 December 2023

This workshop brought together research communities in generative models, privacy, and fairness as well as industry leaders to provide a platform for vigorous discussion between all these different perspectives in the hope of progressing the ideal of using SD for better and trustworthy ML training.

About

Advances in machine learning owe much to access to high quality training datasets and the well defined problem settings that they encapsulate. However, access to rich, diverse, and clean datasets may not always be possible. Moreover, three prominent issues: data scarcity, privacy, and bias and fairness make trustworthy ML model building even more challenging. These challenges already manifest in numerous high-stakes domains, including healthcare, finance and education.

Hence, although ML holds strong promise in these domains, the lack of high-quality training datasets creates a significant hurdle for the development of methodology and algorithms, and leads to missed opportunities.

Synthetic data is a promising solution to the key issues of access to high-quality training dataset. Specifically, high-quality synthetic data generation could be done while addressing the following major issues.

  1. Data Scarcity. The training and evaluation of ML algorithms require datasets with a sufficient sample size. Note that even if the algorithm can learn from very few samples, we still need sufficient validation data for model evaluation. However, it is often challenging to obtain the desired number of samples due to the inherent data scarcity (e.g. people with unique characteristics, patients with rare diseases etc.) or the cost and feasibility of certain data collection. There has been very active research in cross-domain and out-of-domain data generation, as well as generation from a few samples. Once the generator is trained, one could obtain arbitrarily large synthetic datasets.
  2. Privacy. In many key applications, ML algorithms rely on record-level data collected from human subjects, which leads to privacy concerns and legal risks. As a result, data owners are often hesitant to publish datasets for the research community. Even if they are willing to, accessing the datasets often requires significant time and effort from the researchers. Synthetic data is regarded as one potential way to promote privacy. The 2019 NeurIPS Competition “Synthetic data hide and seek challenge” demonstrates the difficulty in performing privacy attacks on synthetic data. Many recent works look further into the theoretical and practical aspects of synthetic data and privacy.
  3. Bias and under-representation. The benchmark dataset may be subject to data collection bias and under-represent certain groups (e.g. people with less-privileged access to technology). Using these datasets as benchmarks would (implicitly) encourage the community to build algorithms that reflect or even exploit the existing bias. This is likely to hamper the adoption of ML in high-stake applications that require fairness, such as finance and justice. Synthetic data provides a way to curate less biased benchmark data. Specifically, (conditional) generative models can be used to augment any under-represented group in the original dataset. Recent works have shown that training on synthetically augmented data leads to consistent improvements in robustness and generalisation.

Why do we need this workshop? Despite the growing interest in using synthetic data, this agenda is still challenging because existing research in generative models focus on generating high fidelity data, often neglecting the privacy and fairness aspect. On the other hand, the existing research in privacy and fairness often focus on the discriminative setting rather than the generative setting. The field also lacks consistent benchmarking from these different perspectives. It is therefore important to bring researchers on this topic together to clarify gaps and challenges in the field.

We will further discuss how recent advances in Large Language Models can be utilised to generate high-quality synthetic data in various domains with a focus on different modalities, such as tabular and time series data sets. The target is generating high-quality data sets for ML training with privacy and fairness in mind.

The goal of this workshop is to provide a platform for vigorous discussion with researchers in various fields of ML and industry experts in the hope to progress the ideal of using synthetic data to empower trustworthy ML training. The workshop also provides a forum for constructive debates and identifications of strengths and weaknesses with respect to alternative approaches.