DC-Check is an actionable checklist-style framework to practically engage with data-centric AI — providing the first standardized framework to communicate the design and necessary considerations for data-centric AI/ML pipelines.
A Data-Centric AI checklist to guide the development of reliable AI systems
AI-powered applications are becoming increasingly widespread in across many areas, including e-commerce, finance, manufacturing, medicine, and many more. However, there are many considerations necessary to successfully develop robust and reliable ML systems, which can often be overlooked.
We’re aiming to change that with DC-Check and a data-centric AI lens.
We believe in empowering researchers and practitioners with systematic tools and frameworks to develop reliable and robust AI systems which can make a real-world impact in medicine and beyond.
Specifically, we believe that a paradigm shift with data-centric AI is key to unlocking this impact, with data given center stage as compared to model-centric approaches to development.
DC-Check puts a data-centric lens on AI
Figure 1: Overview of DC-Check across the different stages of the ML pipeline: Data, Training, Testing, Deployment
DC-Check is an actionable data-centric AI checklist to guide the development of reliable machine learning systems.
Data-centric AI has been raised as an important concept to improve AI systems [1,2,3,4]. However, there is currently no standardized process to communicate the design of data-centric ML pipelines, nor a guide to the necessary considerations for data-centric AI systems, making the agenda hard to practically engage with.
DC-Check solves this by providing an actionable checklist for each stage of the ML pipeline: Data, Training, Testing, and Deployment.
For BOTH practitioners and researchers
DC-Check is aimed at both industry practitioners and researchers. Each component of DC-Check includes a set of data-centric questions and considerations to guide developers in day-to-day development.
We also suggest concrete data-centric tools and modeling approaches based on these considerations.
In addition to the checklist that guides projects, we also include research opportunities necessary to advance the research area of data-centric AI.
Beyond a documentation tool
This data-centric lens on development of ML systems aims to promote thoughtfulness and transparency prior to system development. We believe that this type of transparency and accountability provided by DC-Check can be useful to a range of stakeholders, including developers, researchers, policymakers, regulators, and organization decision-makers to understand the design considerations at each stage of the ML pipeline.
Engage with DC-Check!
Our inaugural version of DC-Check compromises an actionable DC-Checklist to get started with data-centric AI in your projects, with the paper serving as the go-to reference manual when using the checklist. The DC-Check checklist is both provided for your usage with an offline template and an online web tool. Example case studies showing the use of DC-Check are also provided as a reference of how to use DC-Check.
To easily engage with and use DC-Check and associated resources, we provide a DC-Check companion website
DC-Check: A Data-Centric AI checklist to guide the development of reliable machine learning systems
While there have been a number of remarkable breakthroughs in machine learning (ML), much of the focus has been placed on model development. However, to truly realize the potential of machine learning in real-world settings, additional aspects must be considered across the ML pipeline.
Data-centric AI is emerging as a unifying paradigm that could enable such reliable end-to-end pipelines. However, this remains a nascent area with no standardized framework to guide practitioners to the necessary data-centric considerations or to communicate the design of data-centric driven ML systems.
To address this gap, we propose DC-Check, an actionable checklist-style framework to elicit data-centric considerations at different stages of the ML pipeline: Data, Training, Testing, and Deployment.
This data-centric lens on development aims to promote thoughtfulness and transparency prior to system development. Additionally, we highlight specific data-centric AI challenges and research opportunities.
DC-Check is aimed at both practitioners and researchers to guide day-to-day development. As such, to easily engage with and use DC-Check and associated resources, we provide a DC-Check companion website (https://www.vanderschaar-lab.com/dc-check/). The website will also serve as an updated resource as methods and tooling evolve over time.
Want to learn more about data-centric AI?
Check out this link or have a look at these two papers from our lab:
N.Seedat, J. Crabbe, and M. van der Schaar. “Data-SUITE: Data-centric identification of in-distribution incongruous examples.” International Conference on Machine Learning (ICML), 2022.
N.Seedat, J. Crabbé, I. Bica, and M. van der Schaar. “Data-IQ: Characterizing subgroups with heterogeneous outcomes in tabular data.” Advances in Neural Information Processing Systems (NeurIPS), 2022.