van der Schaar Lab

What is Data-Centric AI?

The current paradigm in machine learning is model-centric AI. The data is considered a fixed and static asset (e.g. tabular data in a .csv file, a database, a language corpus or image repository). The data is often considered somewhat external to the machine learning process. The focus is then on model iteration, whether it is new model architectures, novel loss functions or optimizers – with the goal of improving predictive performance for a fixed benchmark.

Of course this agenda is important — but we need more reliable ML systems. We believe that the current focus on models and architectures as a panacea in the ML community is often a source of brittleness in real-world applications. In DC-Check, we outline why the data work, often undervalued as merely operational, is key to unlocking reliable ML systems in the wild.

In data-centric AI, we seek to give data center stage. Data-centric AI views model or algorithmic refinement as less important (and in certain settings algorithmic development is even considered as a solved problem), and instead seeks to systematically improve the data used by ML systems.

In DC-Check though we go further and call for an expanded definition of data-centric AI such that a data-centric lens is applicable for end-to-end pipelines.

Definition

Data-centric AI encompasses methods and tools to systematically characterise, evaluate, and monitor the underlying data used to train and evaluate models. At the ML pipeline level, this means that the considerations at each stage should be informed in a data-driven manner.
We term this a data-centric lens. Since data is the fuel for any ML system, we should keep a sharp focus on the data, yet rather than ignoring the model, we should leverage the data-driven insights as feedback to systematically improve the model.

Would you like to learn more about data-centric AI? Have a look at these two papers from our lab:

Data-IQ: Characterizing subgroups with heterogeneous outcomes in tabular data (NeurIPS 2022)

Nabeel Seedat, Jonathan Crabbé, Ioana Bica, Mihaela van der Schaar

Abstract and paper URL

Data-SUITE: Data-centric identification of in-distribution incongruous examples (ICML 2022)

Nabeel Seedat, Jonathan Crabbé, Mihaela van der Schaar

Data-SUITE Video

Abstract