van der Schaar Lab
Data-Centric AI

Data-Centric AI

Data-Centric AI

On this page we introduce “Data-Centric AI”.

We will first introduce what we mean exactly when talking about data-centric AI. We then provide a practical framework to engage with data-centric AI, and then provide some examples of how our lab has innovated in this area so far.

We believe that data-centric AI is a major new area of research (hence it being a research pillar of our lab). So much so, that we only depict the beginning of this important area in machine learning. Not only are we excited to continue working on this area in the future, but we are also keen to motivate researchers outside our lab to take notice and join us in this exciting topic.

This page is one of several introductions to areas that we see as “research pillars” for our lab. It is a living document, and the content here will evolve as we continue to reach out to the machine learning and healthcare communities, building a shared vision for the future of healthcare.

Our primary means of building this shared vision is through two groups of online engagement sessions: Inspiration Exchange (for machine learning students) and Revolutionizing Healthcare (for the healthcare community). If you would like to get involved, please visit the page below.

This page is authored and maintained by Nabeel Seedat and Mihaela van der Schaar.

Introducing Data-Centric AI

AI-powered applications are becoming increasingly widespread in across many areas, including e-commerce, finance, manufacturing, medicine, and many more. However, there are many considerations necessary to successfully develop robust and reliable ML systems, which can often be overlooked.

We’re aiming to change that with a data-centric AI lens. But, what does it mean?

The current paradigm in machine learning is model-centric AI. The data is considered a fixed and static asset (e.g. tabular data in a .csv file, a database, a language corpus or image repository). The data is often considered somewhat external to the machine learning process. The focus is then on model iteration, whether it is new model architectures, novel loss functions or optimizers – with the goal of improving predictive performance for a fixed benchmark.

Of course, this agenda is important — but we need more reliable ML systems. We believe that the current focus on models and architectures as a panacea in the ML community is often a source of brittleness in real-world applications. We believe that the data work, often undervalued as merely operational, is key to unlocking reliable ML systems.

In data-centric AI, we seek to give data center stage. Data-centric AI views model or algorithmic refinement as less important (and in certain settings algorithmic development is even considered as a solved problem), and instead seeks to systematically improve the data used by ML systems.

We go further and call for an expanded definition of data-centric AI such that a data-centric lens is applicable for end-to-end pipelines.


Data-centric AI encompasses methods and tools to systematically: characterize, evaluate and generate the underlying data used to train and evaluate models

At the ML pipeline level, this means that the considerations at each stage should be informed in a data-driven manner.
We term this a data-centric lens. Since data is the fuel for any ML system, we should keep a sharp focus on the data, yet rather than ignoring the model, we should leverage the data-driven insights as feedback to systematically improve the model.

DC-Check: an actionable guide for practitioners and researchers to practically engage with data-centric AI

Data-centric AI is important, but currently, there is no standardized process to communicate the design of data-centric ML pipelines. More specifically, there is no guide to the necessary considerations for data-centric AI systems, making the agenda hard to engage with practically.

DC-Check solves this providing an actionable checklist for all stages of the ML pipeline: Data, Training, Testing, Deployment

For practitioners & researchers: DC-Check is aimed at both practitioners and researchers. Each component of DC-Check includes a set of data-centric questions to guide developers in day-to-day development. We also suggest concrete data-centric tools and modeling approaches based on these considerations. In addition to the checklist that guides projects, we also include research opportunities necessary to advance the research area of data-centric AI.

Beyond a documentation tool: DC-Check supports practitioners and researchers in achieving greater transparency and accountability with regard to data-centric considerations for ML pipelines. We believe that this type of transparency and accountability provided by DC-Check can be useful to a range of stakeholders, including developers, researchers, policymakers, regulators, and organization decision-makers to understand the design considerations at each stage of the ML pipeline.

Our work so far

We have primarily focused on both ends of the pipeline — Data and Deployment


Firstly, developing methods to quantify the quality of the input (training) data, as well as, improve it.

Data-IQ: Characterizing subgroups with heterogeneous outcomes in tabular data (NeurIPS 2022)

Nabeel Seedat, Jonathan Crabbé, Ioana Bica, Mihaela van der Schaar

High model performance, on average, can hide that models may systematically underperform on subgroups of the data. We consider the tabular setting, which surfaces the unique issue of outcome heterogeneity – this is prevalent in areas such as healthcare, where patients with similar features can have different outcomes, thus making reliable predictions challenging.

To tackle this, we propose Data-IQ, a framework to systematically stratify examples into subgroups with respect to their outcomes. We do this by analyzing the behavior of individual examples during training, based on their predictive confidence and, importantly, the aleatoric (data) uncertainty. Capturing the aleatoric uncertainty permits a principled characterization and then subsequent stratification of data examples into three distinct subgroups (Easy, Ambiguous, Hard). We experimentally demonstrate the benefits of Data-IQ on four real-world medical datasets. We show that Data-IQ’s characterization of examples is most robust to variation across similarly performant (yet different) models, compared to baselines.

Since Data-IQ can be used with any ML model (including neural networks, gradient boosting etc.), this property ensures consistency of data characterization, while allowing flexible model selection. Taking this a step further, we demonstrate that the subgroups enable us to construct new approaches to both feature acquisition and dataset selection. Furthermore, we highlight how the subgroups can inform reliable model usage, noting the significant impact of the Ambiguous subgroup on model generalization.

Generalized Iterative Imputation with Automatic Model Selection

Daniel Jarrett*, Bogdan Cebere*, Tennison Liu, Alicia Curth, Mihaela van der Schaar
ICML 2022

Consider the problem of imputing missing values in a dataset. One the one hand, conventional approaches using iterative imputation benefit from the simplicity and customizability of learning conditional distributions directly, but suffer from the practical requirement for appropriate model specification of each and every variable. On the other hand, recent methods using deep generative modeling benefit from the capacity and efficiency of learning with neural network function approximators, but are often difficult to optimize and rely on stronger data assumptions. In this work, we study an approach that marries the advantages of both: We propose *HyperImpute*, a generalized iterative imputation framework for adaptively and automatically configuring column-wise models and their hyperparameters. Practically, we provide a concrete implementation with out-of-the-box learners, optimizers, simulators, and extensible interfaces. Empirically, we investigate this framework via comprehensive experiments and sensitivities on a variety of public datasets, and demonstrate its ability to generate accurate imputations relative to a strong suite of benchmarks. Contrary to recent work, we believe our findings constitute a strong defense of the iterative imputation paradigm.


Secondly, analyzing the output of model’s both of which samples to trust predictions for with respect to a training set, as well as, how performance can be improved under data shift.

Data-SUITE: Data-centric identification of in-distribution incongruous examples (ICML 2022)

Nabeel Seedat, Jonathan Crabbé, Mihaela van der Schaar

Systematic quantification of data quality is critical for consistent model performance. Prior works have focused on out-of-distribution data. Instead, we tackle an understudied yet equally important problem of characterizing incongruous regions of in-distribution (ID) data, which may arise from feature space heterogeneity. To this end, we propose a paradigm shift with Data-SUITE: a datacentric framework to identify these regions, independent of a task-specific model.

Systematic quantification of data quality is critical for consistent model performance. Prior works have focused on out-of-distribution data. Instead, we tackle an understudied yet equally important problem of characterizing incongruous regions of in-distribution (ID) data, which may arise from feature space heterogeneity. To this end, we propose a paradigm shift with Data-SUITE: a datacentric framework to identify these regions, independent of a task-specific model.

DATA-SUITE leverages copula modeling, representation learning, and conformal prediction to build featurewise confidence interval estimators based on a set of training instances. These estimators can be used to evaluate the congruence of test instances with respect to the training set, to answer two practically useful questions: (1) which test instances will be reliably predicted by a model trained with the training instances? and (2) can we identify incongruous regions of the feature space so that data owners understand the data’s limitations or guide future data collection?

We empirically validate Data-SUITE’s performance and coverage guarantees and demonstrate on cross-site medical data, biased data, and data with concept drift, that Data-SUITE best identifies ID regions where a downstream model may be reliable (independent of said model). We also illustrate how these identified regions can provide insights into datasets and highlight their limitations.

Unlabelled Data Improves Bayesian Uncertainty Calibration under Covariate Shift

Alex Chan, Ahmed M. Alaa, Zhaozhi Qian, Mihaela van der Schaar

ICML 2020

Modern neural networks have proven to be powerful function approximators, providing state-of-the-art performance in a multitude of applications. They however fall short in their ability to quantify confidence in their predictions — this is crucial in high-stakes applications that involve critical decision-making.

Bayesian neural networks (BNNs) aim at solving this problem by placing a prior distribution over the network’s parameters, thereby inducing a posterior distribution that encapsulates predictive uncertainty. While existing variants of BNNs based on Monte Carlo dropout produce reliable (albeit approximate) uncertainty estimates over in-distribution data, they tend to exhibit over-confidence in predictions made on target data whose feature distribution differs from the training data, i.e., the covariate shift setup.

In this paper, we develop an approximate Bayesian inference scheme based on posterior regularisation, wherein unlabelled target data are used as “pseudo-labels” of model confidence that are used to regularise the model’s loss on labelled source data. We show that this approach significantly improves the accuracy of uncertainty quantification on covariate-shifted data sets, with minimal modification to the underlying model architecture.

We demonstrate the utility of our method in the context of transferring prognostic models of prostate cancer across globally diverse populations.

More resources

To learn more about data-centric AI and our work, check out our, blog post and dedicated website