van der Schaar Lab
Data-Centric AI

Data-Centric AI

Data-Centric AI

On this page we introduce “Data-Centric AI”.

We will first introduce what we mean exactly when talking about data-centric AI. We then provide a practical framework to engage with data-centric AI, and then provide some examples of how our lab has innovated in this area so far.

We believe that data-centric AI is a major new area of research (hence it being a research pillar of our lab). So much so, that we only depict the beginning of this important area in machine learning.

Not only are we excited to continue working on this area in the future, but we are also keen to motivate researchers outside our lab to take notice and join us in this exciting topic.

This page is one of several introductions to areas that we see as “research pillars” for our lab. It is a living document, and the content here will evolve as we continue to reach out to the machine learning and healthcare communities, building a shared vision for the future of healthcare.

Our primary means of building this shared vision is through two groups of online engagement sessions: Inspiration Exchange (for machine learning students) and Revolutionizing Healthcare (for the healthcare community). If you would like to get involved, please visit the page below.

This page is authored and maintained by Nabeel Seedat and Mihaela van der Schaar.

Introducing Data-Centric AI (DCAI)

AI-powered applications are becoming increasingly widespread across many areas and industries, including e-commerce, finance, manufacturing, medicine, and many more. However, there are many considerations necessary to successfully develop robust and reliable ML systems, which can often be overlooked.

We’re aiming to change that with a data-centric AI lens. But, what does it mean?

The current paradigm in machine learning is model-centric AI. The data is considered a fixed and static asset (e.g. tabular data in a .csv file, a database, a language corpus or an image repository). The data is often considered somewhat external to the machine learning process. The focus is then on model iteration, whether it is new model architectures, novel loss functions or optimizers – with the goal of improving predictive performance for a fixed benchmark.

Of course, this agenda is important — but we need more reliable ML systems. We believe that the current focus on models and architectures as a panacea in the ML community is often a source of brittleness in real-world applications. We believe that the data work, often undervalued as merely operational, is key to unlocking reliable ML systems.

In data-centric AI, we seek to give data center stage. Data-centric AI views model or algorithmic refinement as less important (and in certain settings, algorithmic development is even considered as a solved problem), and instead seeks to systematically improve the data used by ML systems.

We go further and call for an expanded definition of data-centric AI such that a data-centric lens is applicable to end-to-end pipelines.


Data-centric AI encompasses methods and tools to systematically: characterize, evaluate and generate the underlying data used to train and evaluate models

At the ML pipeline level, this means that the considerations at each stage should be informed in a data-driven manner.
We term this a data-centric lens. Since data is the fuel for any ML system, we should keep a sharp focus on the data, yet rather than ignoring the model, we should leverage the data-driven insights as feedback to systematically improve the model.

DC-Check: an actionable guide for practitioners and researchers to practically engage with data-centric AI

Data-centric AI is important, but currently, there is no standardized process to communicate the design of data-centric ML pipelines. More specifically, there is no guide to the necessary considerations for data-centric AI systems, making the agenda hard to engage with practically.

DC-Check solves this by providing an actionable checklist for all stages of the ML pipeline: Data, Training, Testing, Deployment

For practitioners & researchers

DC-Check is aimed at both practitioners and researchers. Each component of DC-Check includes a set of data-centric questions to guide developers in day-to-day development. We also suggest concrete data-centric tools and modeling approaches based on these considerations. In addition to the checklist that guides projects, we also include research opportunities necessary to advance the research area of data-centric AI.

Beyond a documentation tool

DC-Check supports practitioners and researchers in achieving greater transparency and accountability with regard to data-centric considerations for ML pipelines. We believe that this type of transparency and accountability provided by DC-Check can be useful to a range of stakeholders, including developers, researchers, policymakers, regulators, and organization decision-makers, to understand the design considerations at each stage of the ML pipeline.

Our work so far across the pipeline


Firstly, developing methods to quantify the quality of the input (training) data, as well as, improve it. We have looked at this along 3-dimensions: (1) Data Quality Evaluation, (2) Data Pre-processing and cleaning and (3) Proactive dataset curation

1. Data Quality Evaluation

Data-IQ: Characterizing subgroups with heterogeneous outcomes in tabular data (NeurIPS 2022)

Nabeel Seedat, Jonathan Crabbé, Ioana Bica, Mihaela van der Schaar
NeurIPS 2022

High model performance, on average, can hide that models may systematically underperform on subgroups of the data. We consider the tabular setting, which surfaces the unique issue of outcome heterogeneity – this is prevalent in areas such as healthcare, where patients with similar features can have different outcomes, thus making reliable predictions challenging.

To tackle this, we propose Data-IQ, a framework to systematically stratify examples into subgroups with respect to their outcomes. We do this by analyzing the behavior of individual examples during training, based on their predictive confidence and, importantly, the aleatoric (data) uncertainty. Capturing the aleatoric uncertainty permits a principled characterization and then subsequent stratification of data examples into three distinct subgroups (Easy, Ambiguous, Hard). We experimentally demonstrate the benefits of Data-IQ on four real-world medical datasets. We show that Data-IQ’s characterization of examples is most robust to variation across similarly performant (yet different) models, compared to baselines.

Since Data-IQ can be used with any ML model (including neural networks, gradient boosting etc.), this property ensures consistency of data characterization, while allowing flexible model selection. Taking this a step further, we demonstrate that the subgroups enable us to construct new approaches to both feature acquisition and dataset selection. Furthermore, we highlight how the subgroups can inform reliable model usage, noting the significant impact of the Ambiguous subgroup on model generalization.

2. Data Pre-Processing & Cleaning

Generalized Iterative Imputation with Automatic Model Selection

Daniel Jarrett*, Bogdan Cebere*, Tennison Liu, Alicia Curth, Mihaela van der Schaar
ICML 2022

Consider the problem of imputing missing values in a dataset. One the one hand, conventional approaches using iterative imputation benefit from the simplicity and customizability of learning conditional distributions directly, but suffer from the practical requirement for appropriate model specification of each and every variable. On the other hand, recent methods using deep generative modeling benefit from the capacity and efficiency of learning with neural network function approximators, but are often difficult to optimize and rely on stronger data assumptions. In this work, we study an approach that marries the advantages of both: We propose *HyperImpute*, a generalized iterative imputation framework for adaptively and automatically configuring column-wise models and their hyperparameters. Practically, we provide a concrete implementation with out-of-the-box learners, optimizers, simulators, and extensible interfaces. Empirically, we investigate this framework via comprehensive experiments and sensitivities on a variety of public datasets, and demonstrate its ability to generate accurate imputations relative to a strong suite of benchmarks. Contrary to recent work, we believe our findings constitute a strong defense of the iterative imputation paradigm.

More on imputation can be found here

3. Proactive dataset curation

MARS: Assisting Human with Information Processing Tasks Using Machine Learning

Cong Shen, Zhaozhi Qian, Alihan Huyuk, Mihaela van der Schaar
ACM Transactions on Computing in Healthcare

This article studies the problem of automated information processing from large volumes of unstructured, heterogeneous, and sometimes untrustworthy data sources. The main contribution is a novel framework called Machine Assisted Record Selection (MARS). Instead of today’s standard practice of relying on human experts to manually decide the order of records for processing, MARS learns the optimal record selection via an online learning algorithm. It further integrates algorithm-based record selection and processing with human-based error resolution to achieve a balanced task allocation between machine and human. Both fixed and adaptive MARS algorithms are proposed, leveraging different statistical knowledge about the existence, quality, and cost associated with the records. Experiments using semi-synthetic data that are generated from real-world patients record processing in the UK national cancer registry are carried out, which demonstrate significant (3 to 4 fold) performance gain over the fixed-order processing. MARS represents one of the few examples demonstrating that machine learning can assist humans with complex jobs by automating complex triaging tasks.

Deep Sensing:
Active Sensing using Multi-directional Recurrent Neural Networks

Jinsung Yoon, William Zame, Mihaela van der Schaar
ICLR 2018

For every prediction we might wish to make, we must decide what to observe (what source of information) and when to observe it. Because making observations is costly, this decision must trade off the value of information against the cost of observation. Making observations (sensing) should be an active choice. To solve the problem of active sensing we develop a novel deep learning architecture: Deep Sensing. At training time, Deep Sensing learns how to issue predictions at various cost-performance points. To do this, it creates multiple representations at various performance levels associated with different measurement rates (costs). This requires learning how to estimate the value of real measurements vs. inferred measurements, which in turn requires learning how to infer missing (unobserved) measurements. To infer missing measurements, we develop a Multi-directional Recurrent Neural Network (M-RNN). An M-RNN differs from a bi-directional RNN in that it sequentially operates across streams in addition to within streams, and because the timing of inputs into the hidden layers is both lagged and advanced. At runtime, the operator prescribes a performance level or a cost constraint, and Deep Sensing determines what measurements to take and what to infer from those measurements, and then issues predictions. To demonstrate the power of our method, we apply it to two real-world medical datasets with significantly improved performance.

ASAC: Active Sensing using Actor-Critic models

Jinsung Yoon, James Jordan, Mihaela van der Schaar
ML4HC 2019

Deciding what and when to observe is critical when making observations is costly. In a medical setting where observations can be made sequentially , making these observations (or not) should be an active choice. We refer to this as the active sensing problem. In this paper, we propose a novel deep learning framework, which we call ASAC (Active Sensing using Actor-Critic models) to address this problem. ASAC consists of two networks: a selector network and a predictor network. The selector network uses previously selected observations to determine what should be observed in the future. The predictor network uses the observations selected by the selector network to predict a label, providing feedback to the selector network (well-selected variables should be predictive of the label). The goal of the selector network is then to select variables that balance the cost of observing the selected variables with their predictive power; we wish to preserve the conditional label distribution. During training, we use the actor-critic models to allow the loss of the selector to be “back-propagated” through the sampling process. The selector network “acts” by selecting future observations to make. The predictor network acts as a “critic” by feeding predictive errors for the selected variables back to the selector network. In our experiments, we show that ASAC significantly outperforms state-of-the-arts in two real-world medical datasets.


Secondly, a careful understanding of the data can guide the development and selection of models during the training phase.

Lifelong Bayesian Optimization

Yao Zhang, James Jordan, Ahmed Alaa, Mihaela van der Schaar

Automatic Machine Learning (Auto-ML) systems tackle the problem of automating
the design of prediction models or pipelines for data science. In this paper, we
present Lifelong Bayesian Optimization (LBO), an online, multitask Bayesian
optimization (BO) algorithm designed to solve the problem of model selection for
datasets arriving and evolving over time. To be suitable for “lifelong” Bayesian
optimization, an algorithm needs to scale with the ever increasing number of
acquisitions and should be able to leverage past optimizations in learning the current
best model. In LBO, we exploit the correlation between black-box functions by
using components of previously learned functions to speed up the learning process
for newly arriving datasets. Experiments on real and synthetic data show that LBO
outperforms standard BO algorithms applied repeatedly on the data.


Thirdly, analyzing the output of models both of which samples to trust predictions for with respect to a training set, as well as, how performance can be improved under data shift.

Data-SUITE: Data-centric identification of in-distribution incongruous examples (ICML 2022)

Nabeel Seedat, Jonathan Crabbé, Mihaela van der Schaar
ICML 2022

Systematic quantification of data quality is critical for consistent model performance. Prior works have focused on out-of-distribution data. Instead, we tackle an understudied yet equally important problem of characterizing incongruous regions of in-distribution (ID) data, which may arise from feature space heterogeneity. To this end, we propose a paradigm shift with Data-SUITE: a datacentric framework to identify these regions, independent of a task-specific model.

Systematic quantification of data quality is critical for consistent model performance. Prior works have focused on out-of-distribution data. Instead, we tackle an understudied yet equally important problem of characterizing incongruous regions of in-distribution (ID) data, which may arise from feature space heterogeneity. To this end, we propose a paradigm shift with Data-SUITE: a datacentric framework to identify these regions, independent of a task-specific model.

DATA-SUITE leverages copula modeling, representation learning, and conformal prediction to build featurewise confidence interval estimators based on a set of training instances. These estimators can be used to evaluate the congruence of test instances with respect to the training set, to answer two practically useful questions: (1) which test instances will be reliably predicted by a model trained with the training instances? and (2) can we identify incongruous regions of the feature space so that data owners understand the data’s limitations or guide future data collection?

We empirically validate Data-SUITE’s performance and coverage guarantees and demonstrate on cross-site medical data, biased data, and data with concept drift, that Data-SUITE best identifies ID regions where a downstream model may be reliable (independent of said model). We also illustrate how these identified regions can provide insights into datasets and highlight their limitations.

Unlabelled Data Improves Bayesian Uncertainty Calibration under Covariate Shift

Alex Chan, Ahmed M. Alaa, Zhaozhi Qian, Mihaela van der Schaar

ICML 2020

Modern neural networks have proven to be powerful function approximators, providing state-of-the-art performance in a multitude of applications. They however fall short in their ability to quantify confidence in their predictions — this is crucial in high-stakes applications that involve critical decision-making.

Bayesian neural networks (BNNs) aim at solving this problem by placing a prior distribution over the network’s parameters, thereby inducing a posterior distribution that encapsulates predictive uncertainty. While existing variants of BNNs based on Monte Carlo dropout produce reliable (albeit approximate) uncertainty estimates over in-distribution data, they tend to exhibit over-confidence in predictions made on target data whose feature distribution differs from the training data, i.e., the covariate shift setup.

In this paper, we develop an approximate Bayesian inference scheme based on posterior regularisation, wherein unlabelled target data are used as “pseudo-labels” of model confidence that are used to regularise the model’s loss on labelled source data. We show that this approach significantly improves the accuracy of uncertainty quantification on covariate-shifted data sets, with minimal modification to the underlying model architecture.

We demonstrate the utility of our method in the context of transferring prognostic models of prostate cancer across globally diverse populations.

More resources

Website & Blog

To learn more about data-centric AI and our work, check out our, blog post and dedicated website 

DCAI Inspiration Exchange

For an overview of data-centric AI and the work done in our lab, we held an Inspiration Exchange with more than 120 participants exploring data-centric AI with Prof Isabelle Guyon (Google Brain)- the NeurIPS 2022 keynote speaker on the data-centric era. We encourage you to take a look at the session where we explored a framework for data-centric ML as well as specific methods for better understanding training and test data.