van der Schaar Lab

DC-Check: Data-Centric AI Checklist

DC-Check: A Data-Centric AI checklist to guide the development of reliable machine learning systems.

DC-Check is an actionable checklist-style framework to elicit data-centric considerations through the different stages of the ML pipeline: Data, Training, Testing, and Deployment. This data-centric lens on development of ML systems aims to promote thoughtfulness and transparency prior to system development.

The DC-Check paper is a reference guide for both practitioners and researchers, highlighting specific data-centric AI challenges and research opportunities.

A data-centric lens on ML

AI powered applications are becoming evermore wide-spread from e-commerce, finance, manufacturing and medicine. The question is how to go from ‘the making ML model work’ phase to the ‘making real-world ML systems’ phase. Thinking about all the necessary considerations to develop robust and reliable ML systems can be difficult. We’re aiming to change that with DC-Check and a data-centric AI lens.

DC-Check: the actionable Data-centric AI guide

Data-centric AI is important, but currently there is no standardised process to communicate the design of data-centric ML pipelines. More specifically, there is no guide to the necessary considerations for data-centric AI systems, making the agenda hard to practically engage with. DC-Check solves this providing an actionable checklist for all stages of the ML pipeline.

For practitioners & researchers

DC-Check is aimed at both practitioners and researchers. Each component of DC-Check includes a set of data-centric questions to guide developers in day-to-day development. We also suggest concrete data-centric tools and modelling approaches based on these considerations. In addition to the checklist that guides projects, we also include research opportunities necessary to advance the research area of data-centric AI.

Beyond a documentation tool

DC-Check supports practitioners and researchers in achieving greater transparency and accountability with regard to data-centric considerations for ML pipelines. We believe that this type of transparency and accountability provided by DC-Check can be useful to a range of stakeholders, including developers, researchers, policymakers, regulators, and organisation decision-makers to understand the design considerations at each stage of the ML pipeline.


Considerations to improve the quality of data used for model training, such as proactive data selection, data curation, and data cleaning


Considerations based on understanding the data to improve model training, such as data informed model design, domain adaptation, and robust training


Considerations around novel data-centric methods to test ML models, such as informed data splits, targeted metrics and stress tests and evaluation on subgroups.


Considerations based on data post-deployment, such as data and model monitoring, model adaptation and retraining, and uncertainty quantification

Using DC-Check? Let us know, we’d love to hear from you!