van der Schaar Lab

AAAI-23: Synthetic Data Tutorial

This AAAI tutorial will be presented by Mihaela van der Schaar and Zhaozhi Qian on Wednesday, 8 February 2023 2 – 6 pm EST. This is a hybrid event (in person/online) you can register for here.


Innovative Uses of Synthetic Data Tutorial


One of the biggest barriers to AI adoption is the difficulty to access high quality training data. Synthetic data has been widely recognised as a viable solution to this problem. It allows sharing, augmenting and de-biasing data for building performant and socially responsible AI algorithms. However, despite the significant progress in the theory and algorithm, the community still lacks a unified software that enables practical data sharing and access with synthetic data.

This lab aims to bridge this gap by introducing synthcity, an open source Python library that implements an array of cutting edge synthetic data generators to address the problems of data generation due to its commonality in various applications.

Goals of the lab

The primary objectives of this lab are:

• Presenting synthetic data as a viable solution to the common problems of data scarcity, privacy-preserving data sharing, and bias through case studies in healthcare, education and finance.
• Familiarising the participants with synthcity, an open-source Python library that offers an array of cutting-edge synthetic data generators designed to solve the use cases discussed above.
• Familiarising the participants with the best practices in synthetic data generation, e.g. pre-processing, initialisation and hyper-parameter tuning.
• Encouraging the participants to use synthetic data to build analytics for a variety of real applications.
• Building interest in the community to contribute to the future development of synthcity and synthetic data methodologies in general.
• Providing a set of well-implemented SOTA benchmarks for future research and competitions in synthetic data.

To achieve these goals, the lab will feature three case studies, each focusing on one innovative use of synthetic data (among privacy, fairness, and multi-source learning).We believe that the lab is the right format because it would allow us to showcase the synthcity library via real-world case studies with hands-on components.

Synthetic data has been the topic of several workshops and tutorials at top AI/ML conferences recently (e.g. our ICML 2021 tutorial, our workshop at NeurIPS 2022), as well as competitions (e.g. NeurIPS 2020 Hide-and-Seek competition). However, this AAAI lab will be the first to introduce an open-source software ecosystem for validating, building, and using synthetic data. This will allow participants to gain experience by experimenting in a large variety of settings and tasks.

What you can expect to learn

The participants will gather hands-on experience in using synthcity to address common challenges associated with generating synthetic data as well as using the generated synthetic data for training various machine learning models. They will also gain a deeper knowledge of the theory, algorithms, best practices as well as limitations of synthetic data generation.

We will aim for minimal required prerequisite knowledge. However, we will assume basic knowledge of generative models (e.g., GANs, VAEs) and basic Python skills.

We hope that this lab will prepare the AI researchers and practitioners for using synthetic data tools in real applications. It will also facilitate research in this area by providing a suite of strong baseline methods.

StartEndSession TitleDescription
02:00 pm02:30 pmOpening and IntroWe go through the promise of synthetic data in empowering AI development and the associated challenges.
02:30 pm03:15 pmData ModalityWe demonstrate how synthcity can generate tabular data with diverse modalities, including static data, regular and irregular time series, data with censoring, multi-source data, and composite data.
03:15 pm03:30 pmQ&A
03:30 pm04:00 pmBreak
04:00 pm04:30 pmFairnessWe show how synthetic data can promote ML fairness by (1) augmenting minority classes with conditional generation and (2) removing bias via causal generation
04:30 pm05:00 pmPrivacyWe introduce privacy-preserving synthetic data generators that facilitates sharing of sensitive data. We will cover differential-privacy based methods as well as methods that defend against specific threat models.
05:00 pm05:30 pmTransferWe show how to alleviate data scarcity by augmenting a small dataset using information learned from other related datasets in a transfer learning style.
05:30 pm05:45 pmQ&A
05:45 pm06:00 pmFurther EngagementWe discuss ways of further engaging with the application and development of synthcity.

If you’d like to learn more about our lab’s research in the area of synthetic data generation and evaluation, you can find a full overview here.

You can have a look at our Revolutionizing Healthcare session on the topic of synthetic data in healthcare here.

Also, consider registering and joining us for our Inspiration Exchange session on 1 February 2023; this session will be focused on Synthetic Data and we will introduce our new open-source software – synthcity.

You can find our previous Inspiration Exchange sessions on synthetic data here and here.

Other useful links:
– Our lab’s publications
– Mihaela van der Schaar on Twitter and LinkedIn

Mihaela van der Schaar

Mihaela van der Schaar is the John Humphrey Plummer Professor of Machine Learning, Artificial Intelligence and Medicine at the University of Cambridge and a Fellow at The Alan Turing Institute in London.

Mihaela has received numerous awards, including the Oon Prize on Preventative Medicine from the University of Cambridge (2018), a National Science Foundation CAREER Award (2004), 3 IBM Faculty Awards, the IBM Exploratory Stream Analytics Innovation Award, the Philips Make a Difference Award and several best paper awards, including the IEEE Darlington Award.

In 2019, she was identified by National Endowment for Science, Technology and the Arts as the most-cited female AI researcher in the UK. She was also elected as a 2019 “Star in Computer Networking and Communications” by N²Women. Her research expertise span signal and image processing, communication networks, network science, multimedia, game theory, distributed systems, machine learning and AI.

Mihaela’s research focus is on machine learning, AI and operations research for healthcare and medicine.

Zhaozhi Qian

After obtaining a MSc in Machine Learning at UCL, Zhaozhi Qian started his career as a data scientist in the largest mobile gaming company in Europe. Three years later, he found it might be more fulfilling to apply AI to cure cancer than to make the gamers hit the purchase button 1% more often.

He thus joined the group in 2019 as a PhD student focusing on robust and interpretable learning for longitudinal data. So far, his work has included inferring latent disease interaction networks from Electronic Health Records, uncovering the causal structure between events that unfold over time, and calibrating the predictive uncertainty under domain shift.

Zhaozhi also worked as a contractor in the NHS during the COVID-19 pandemic contributing his analytical skills to the national response to the pandemic.