van der Schaar Lab
Data-Centric AI

Data-Centric AI

Data-Centric AI


On this page we introduce “Data-Centric AI”.

We will first introduce what we mean exactly when talking about data-centric AI. We then provide a practical framework to engage with data-centric AI, and then provide some examples of how our lab has innovated in this area so far.

We believe that data-centric AI is a major new area of research (hence it being a research pillar of our lab). So much so, that we only depict the beginning of this important area in machine learning.

Not only are we excited to continue working on this area in the future, but we are also keen to motivate researchers outside our lab to take notice and join us in this exciting topic.

This page is one of several introductions to areas that we see as “research pillars” for our lab. It is a living document, and the content here will evolve as we continue to reach out to the machine learning and healthcare communities, building a shared vision for the future of healthcare.

Our primary means of building this shared vision is through two groups of online engagement sessions: Inspiration Exchange (for machine learning students) and Revolutionizing Healthcare (for the healthcare community). If you would like to get involved, please visit the page below.

This page is authored and maintained by Nabeel Seedat and Mihaela van der Schaar.

Introducing Data-Centric AI (DCAI)

AI-powered applications are becoming increasingly widespread across many areas and industries, including e-commerce, finance, manufacturing, medicine, and many more. However, there are many considerations necessary to successfully develop robust and reliable ML systems, which can often be overlooked.

We’re aiming to change that with a data-centric AI lens. But, what does it mean?

The current paradigm in machine learning is model-centric AI. The data is considered a fixed and static asset (e.g. tabular data in a .csv file, a database, a language corpus or an image repository). The data is often considered somewhat external to the machine learning process. The focus is then on model iteration, whether it is new model architectures, novel loss functions or optimizers – with the goal of improving predictive performance for a fixed benchmark.

Of course, this agenda is important — but we need more reliable ML systems. We believe that the current focus on models and architectures as a panacea in the ML community is often a source of brittleness in real-world applications. We believe that the data work, often undervalued as merely operational, is key to unlocking reliable ML systems.

In data-centric AI, we seek to give data center stage. Data-centric AI views model or algorithmic refinement as less important (and in certain settings, algorithmic development is even considered as a solved problem), and instead seeks to systematically improve the data used by ML systems.

We go further and call for an expanded definition of data-centric AI such that a data-centric lens is applicable to end-to-end pipelines.

DEFINITION

Data-centric AI encompasses methods and tools to systematically: characterize, evaluate and generate the underlying data used to train and evaluate models

At the ML pipeline level, this means that the considerations at each stage should be informed in a data-driven manner.
We term this a data-centric lens. Since data is the fuel for any ML system, we should keep a sharp focus on the data, yet rather than ignoring the model, we should leverage the data-driven insights as feedback to systematically improve the model.

DC-Check: an actionable guide for practitioners and researchers to practically engage with data-centric AI

Data-centric AI is important, but currently, there is no standardized process to communicate the design of data-centric ML pipelines. More specifically, there is no guide to the necessary considerations for data-centric AI systems, making the agenda hard to engage with practically.

DC-Check solves this by providing an actionable checklist for all stages of the ML pipeline: Data, Training, Testing, Deployment

For practitioners & researchers

DC-Check is aimed at both practitioners and researchers. Each component of DC-Check includes a set of data-centric questions to guide developers in day-to-day development. We also suggest concrete data-centric tools and modeling approaches based on these considerations. In addition to the checklist that guides projects, we also include research opportunities necessary to advance the research area of data-centric AI.

Beyond a documentation tool

DC-Check supports practitioners and researchers in achieving greater transparency and accountability with regard to data-centric considerations for ML pipelines. We believe that this type of transparency and accountability provided by DC-Check can be useful to a range of stakeholders, including developers, researchers, policymakers, regulators, and organization decision-makers, to understand the design considerations at each stage of the ML pipeline.

Our work so far across the pipeline

Data

Firstly, developing methods to quantify the quality of the input (training) data, as well as, improve it. We have looked at this along 3-dimensions: (1) Data Quality Evaluation, (2) Data Pre-processing and cleaning and (3) Proactive dataset curation

1. Data Quality Evaluation

Data-IQ: Characterizing subgroups with heterogeneous outcomes in tabular data (NeurIPS 2022)

Nabeel Seedat, Jonathan Crabbé, Ioana Bica, Mihaela van der Schaar
NeurIPS 2022

Abstract and paper URL

2. Data Pre-Processing & Cleaning

HyperImpute:
Generalized Iterative Imputation with Automatic Model Selection

Daniel Jarrett*, Bogdan Cebere*, Tennison Liu, Alicia Curth, Mihaela van der Schaar
ICML 2022

Abstract and paper URL

More on imputation can be found here

3. Proactive dataset curation

MARS: Assisting Human with Information Processing Tasks Using Machine Learning

Cong Shen, Zhaozhi Qian, Alihan Huyuk, Mihaela van der Schaar
ACM Transactions on Computing in Healthcare

Abstract and paper URL

Deep Sensing:
Active Sensing using Multi-directional Recurrent Neural Networks

Jinsung Yoon, William Zame, Mihaela van der Schaar
ICLR 2018

Abstract and paper URL

ASAC: Active Sensing using Actor-Critic models

Jinsung Yoon, James Jordan, Mihaela van der Schaar
ML4HC 2019

Abstract and paper URL

Training

Secondly, a careful understanding of the data can guide the development and selection of models during the training phase.

Lifelong Bayesian Optimization

Yao Zhang, James Jordan, Ahmed Alaa, Mihaela van der Schaar

Abstract and paper URL

Deployment

Thirdly, analyzing the output of models both of which samples to trust predictions for with respect to a training set, as well as, how performance can be improved under data shift.

Data-SUITE: Data-centric identification of in-distribution incongruous examples (ICML 2022)

Nabeel Seedat, Jonathan Crabbé, Mihaela van der Schaar
ICML 2022

Data-SUITE Video

Abstract and paper URL

Unlabelled Data Improves Bayesian Uncertainty Calibration under Covariate Shift

Alex Chan, Ahmed M. Alaa, Zhaozhi Qian, Mihaela van der Schaar

ICML 2020

Abstract

More resources

Website & Blog

To learn more about data-centric AI and our work, check out our, blog post and dedicated website 

DCAI Inspiration Exchange

For an overview of data-centric AI and the work done in our lab, we held an Inspiration Exchange with more than 120 participants exploring data-centric AI with Prof Isabelle Guyon (Google Brain)- the NeurIPS 2022 keynote speaker on the data-centric era. We encourage you to take a look at the session where we explored a framework for data-centric ML as well as specific methods for better understanding training and test data.