van der Schaar Lab

What is Data-Centric AI?

The current paradigm in machine learning is model-centric AI. The data is considered a fixed and static asset (e.g. tabular data in a .csv file, a database, a language corpus or image repository). The data is often considered somewhat external to the machine learning process. The focus is then on model iteration, whether it is new model architectures, novel loss functions or optimizers – with the goal of improving predictive performance for a fixed benchmark.

Of course this agenda is important — but we need more reliable ML systems. We believe that the current focus on models and architectures as a panacea in the ML community is often a source of brittleness in real-world applications. In DC-Check, we outline why the data work, often undervalued as merely operational, is key to unlocking reliable ML systems in the wild.

In data-centric AI, we seek to give data center stage. Data-centric AI views model or algorithmic refinement as less important (and in certain settings algorithmic development is even considered as a solved problem), and instead seeks to systematically improve the data used by ML systems.

In DC-Check though we go further and call for an expanded definition of data-centric AI such that a data-centric lens is applicable for end-to-end pipelines.

Definition

Data-centric AI encompasses methods and tools to systematically characterise, evaluate, and monitor the underlying data used to train and evaluate models. At the ML pipeline level, this means that the considerations at each stage should be informed in a data-driven manner.
We term this a data-centric lens. Since data is the fuel for any ML system, we should keep a sharp focus on the data, yet rather than ignoring the model, we should leverage the data-driven insights as feedback to systematically improve the model.

Would you like to learn more about data-centric AI? Have a look at these two papers from our lab:

Data-IQ: Characterizing subgroups with heterogeneous outcomes in tabular data (NeurIPS 2022)

Nabeel Seedat, Jonathan Crabbé, Ioana Bica, Mihaela van der Schaar

High model performance, on average, can hide that models may systematically underperform on subgroups of the data. We consider the tabular setting, which surfaces the unique issue of outcome heterogeneity – this is prevalent in areas such as healthcare, where patients with similar features can have different outcomes, thus making reliable predictions challenging.

To tackle this, we propose Data-IQ, a framework to systematically stratify examples into subgroups with respect to their outcomes. We do this by analyzing the behavior of individual examples during training, based on their predictive confidence and, importantly, the aleatoric (data) uncertainty. Capturing the aleatoric uncertainty permits a principled characterization and then subsequent stratification of data examples into three distinct subgroups (Easy, Ambiguous, Hard). We experimentally demonstrate the benefits of Data-IQ on four real-world medical datasets. We show that Data-IQ’s characterization of examples is most robust to variation across similarly performant (yet different) models, compared to baselines.

Since Data-IQ can be used with any ML model (including neural networks, gradient boosting etc.), this property ensures consistency of data characterization, while allowing flexible model selection. Taking this a step further, we demonstrate that the subgroups enable us to construct new approaches to both feature acquisition and dataset selection. Furthermore, we highlight how the subgroups can inform reliable model usage, noting the significant impact of the Ambiguous subgroup on model generalization.

Data-SUITE: Data-centric identification of in-distribution incongruous examples (ICML 2022)

Nabeel Seedat, Jonathan Crabbé, Mihaela van der Schaar

Systematic quantification of data quality is critical for consistent model performance. Prior works have focused on out-of-distribution data. Instead, we tackle an understudied yet equally important problem of characterizing incongruous regions of in-distribution (ID) data, which may arise from feature space heterogeneity. To this end, we propose a paradigm shift with Data-SUITE: a datacentric framework to identify these regions, independent of a task-specific model.

Systematic quantification of data quality is critical for consistent model performance. Prior works have focused on out-of-distribution data. Instead, we tackle an understudied yet equally important problem of characterizing incongruous regions of in-distribution (ID) data, which may arise from feature space heterogeneity. To this end, we propose a paradigm shift with Data-SUITE: a datacentric framework to identify these regions, independent of a task-specific model.

DATA-SUITE leverages copula modeling, representation learning, and conformal prediction to build featurewise confidence interval estimators based on a set of training instances. These estimators can be used to evaluate the congruence of test instances with respect to the training set, to answer two practically useful questions: (1) which test instances will be reliably predicted by a model trained with the training instances? and (2) can we identify incongruous regions of the feature space so that data owners understand the data’s limitations or guide future data collection?

We empirically validate Data-SUITE’s performance and coverage guarantees and demonstrate on cross-site medical data, biased data, and data with concept drift, that Data-SUITE best identifies ID regions where a downstream model may be reliable (independent of said model). We also illustrate how these identified regions can provide insights into datasets and highlight their limitations.