Innovative Uses of Synthetic Data Tutorial
One of the biggest barriers to AI adoption is the difficulty to access high quality training data. Synthetic data has been widely recognised as a viable solution to this problem. It allows sharing, augmenting and de-biasing data for building performant and socially responsible AI algorithms. However, despite the significant progress in the theory and algorithm, the community still lacks a unified software that enables practical data sharing and access with synthetic data.
This lab aims to bridge this gap by introducing synthcity, an open source Python library that implements an array of cutting edge synthetic data generators to address the problems of data generation due to its commonality in various applications.
Goals of the lab
The primary objectives of this lab are:
• Presenting synthetic data as a viable solution to the common problems of data scarcity, privacy-preserving data sharing, and bias through case studies in healthcare, education and finance.
• Familiarising the participants with synthcity, an open-source Python library that offers an array of cutting-edge synthetic data generators designed to solve the use cases discussed above.
• Familiarising the participants with the best practices in synthetic data generation, e.g. pre-processing, initialisation and hyper-parameter tuning.
• Encouraging the participants to use synthetic data to build analytics for a variety of real applications.
• Building interest in the community to contribute to the future development of synthcity and synthetic data methodologies in general.
• Providing a set of well-implemented SOTA benchmarks for future research and competitions in synthetic data.
To achieve these goals, the lab will feature three case studies, each focusing on one innovative use of synthetic data (among privacy, fairness, and multi-source learning).We believe that the lab is the right format because it would allow us to showcase the synthcity library via real-world case studies with hands-on components.
Synthetic data has been the topic of several workshops and tutorials at top AI/ML conferences recently (e.g. our ICML 2021 tutorial, our workshop at NeurIPS 2022), as well as competitions (e.g. NeurIPS 2020 Hide-and-Seek competition). However, this AAAI lab will be the first to introduce an open-source software ecosystem for validating, building, and using synthetic data. This will allow participants to gain experience by experimenting in a large variety of settings and tasks.
What you can expect to learn
The participants will gather hands-on experience in using synthcity to address common challenges associated with generating synthetic data as well as using the generated synthetic data for training various machine learning models. They will also gain a deeper knowledge of the theory, algorithms, best practices as well as limitations of synthetic data generation.
We will aim for minimal required prerequisite knowledge. However, we will assume basic knowledge of generative models (e.g., GANs, VAEs) and basic Python skills.
We hope that this lab will prepare the AI researchers and practitioners for using synthetic data tools in real applications. It will also facilitate research in this area by providing a suite of strong baseline methods.
|Opening and Intro
|We go through the promise of synthetic data in empowering AI development and the associated challenges.
|We demonstrate how synthcity can generate tabular data with diverse modalities, including static data, regular and irregular time series, data with censoring, multi-source data, and composite data.
|We show how synthetic data can promote ML fairness by (1) augmenting minority classes with conditional generation and (2) removing bias via causal generation
|We introduce privacy-preserving synthetic data generators that facilitates sharing of sensitive data. We will cover differential-privacy based methods as well as methods that defend against specific threat models.
|We show how to alleviate data scarcity by augmenting a small dataset using information learned from other related datasets in a transfer learning style.
|We discuss ways of further engaging with the application and development of synthcity.
If you’d like to learn more about our lab’s research in the area of synthetic data generation and evaluation, you can find a full overview here.
You can have a look at our Revolutionizing Healthcare session on the topic of synthetic data in healthcare here.
Also, consider registering and joining us for our Inspiration Exchange session on 1 February 2023; this session will be focused on Synthetic Data and we will introduce our new open-source software – synthcity.