van der Schaar Lab

DC-Check, a new framework to practically engage with Data-centric AI

DC-Check is an actionable checklist-style framework to practically engage with data-centric AI — providing the first standardized framework to communicate the design and necessary considerations for data-centric AI/ML pipelines.

A Data-Centric AI checklist to guide the development of reliable AI systems

AI-powered applications are becoming increasingly widespread in across many areas, including e-commerce, finance, manufacturing, medicine, and many more. However, there are many considerations necessary to successfully develop robust and reliable ML systems, which can often be overlooked.

We’re aiming to change that with DC-Check and a data-centric AI lens.

We believe in empowering researchers and practitioners with systematic tools and frameworks to develop reliable and robust AI systems which can make a real-world impact in medicine and beyond.

Specifically, we believe that a paradigm shift with data-centric AI is key to unlocking this impact, with data given center stage as compared to model-centric approaches to development.

DC-Check puts a data-centric lens on AI

Figure 1: Overview of DC-Check across the different stages of the ML pipeline: Data, Training, Testing, Deployment

DC-Check is an actionable data-centric AI checklist to guide the development of reliable machine learning systems.

Data-centric AI has been raised as an important concept to improve AI systems [1,2,3,4]. However, there is currently no standardized process to communicate the design of data-centric ML pipelines, nor a guide to the necessary considerations for data-centric AI systems, making the agenda hard to practically engage with.

DC-Check solves this by providing an actionable checklist for each stage of the ML pipeline: Data, Training, Testing, and Deployment.

For BOTH practitioners and researchers

DC-Check is aimed at both industry practitioners and researchers. Each component of DC-Check includes a set of data-centric questions and considerations to guide developers in day-to-day development.

We also suggest concrete data-centric tools and modeling approaches based on these considerations.

In addition to the checklist that guides projects, we also include research opportunities necessary to advance the research area of data-centric AI.

Beyond a documentation tool

This data-centric lens on development of ML systems aims to promote thoughtfulness and transparency prior to system development. We believe that this type of transparency and accountability provided by DC-Check can be useful to a range of stakeholders, including developers, researchers, policymakers, regulators, and organization decision-makers to understand the design considerations at each stage of the ML pipeline.

Engage with DC-Check!

Our inaugural version of DC-Check compromises an actionable DC-Checklist to get started with data-centric AI in your projects, with the paper serving as the go-to reference manual when using the checklist. The DC-Check checklist is both provided for your usage with an offline template and an online web tool. Example case studies showing the use of DC-Check are also provided as a reference of how to use DC-Check.

To easily engage with and use DC-Check and associated resources, we provide a DC-Check companion website

DC-Check: A Data-Centric AI checklist to guide the development of reliable machine learning systems

Nabeel Seedat, Fergus Imrie, Mihaela van der Schaar

While there have been a number of remarkable breakthroughs in machine learning (ML), much of the focus has been placed on model development. However, to truly realize the potential of machine learning in real-world settings, additional aspects must be considered across the ML pipeline.

Data-centric AI is emerging as a unifying paradigm that could enable such reliable end-to-end pipelines. However, this remains a nascent area with no standardized framework to guide practitioners to the necessary data-centric considerations or to communicate the design of data-centric driven ML systems.

To address this gap, we propose DC-Check, an actionable checklist-style framework to elicit data-centric considerations at different stages of the ML pipeline: Data, Training, Testing, and Deployment.

This data-centric lens on development aims to promote thoughtfulness and transparency prior to system development. Additionally, we highlight specific data-centric AI challenges and research opportunities.

DC-Check is aimed at both practitioners and researchers to guide day-to-day development. As such, to easily engage with and use DC-Check and associated resources, we provide a DC-Check companion website (https://www.vanderschaar-lab.com/dc-check/). The website will also serve as an updated resource as methods and tooling evolve over time.

Want to learn more about data-centric AI?

Check out this link or have a look at these two papers from our lab:

N.Seedat, J. Crabbe, and M. van der Schaar. “Data-SUITE: Data-centric identification of in-distribution incongruous examples.” International Conference on Machine Learning (ICML), 2022.

N.Seedat, J. Crabbé, I. Bica, and M. van der Schaar. “Data-IQ: Characterizing subgroups with heterogeneous outcomes in tabular data.” Advances in Neural Information Processing Systems (NeurIPS), 2022.

References

[1] Andrew Ng. 2021. MLOps: from model-centric to data-centric AI. Online unter https://www. deeplearning. ai/wp-content/uploads/2021/06/MLOps- From-Model-centric-to-Data-centricAI.pdf (2021).

[2] Weixin Liang, Girmaw Abebe Tadesse, Daniel Ho, L Fei-Fei, Matei Zaharia, Ce Zhang, and James Zou. 2022. Advances, challenges and opportunities in creating data for trustworthy AI. Nature Machine Intelligence 4, 8 (2022), 669–677.

[3] Neoklis Polyzotis and Matei Zaharia. 2021. What can Data-Centric AI Learn from Data and ML Engineering? arXiv preprint arXiv:2112.06439 (2021).

[4] Nabeel Seedat, Jonathan Crabbé, and Mihaela van der Schaar. 2022. Data-SUITE: Data-centric identification of in-distribution incongruous examples. In Proceedings of the 39th International Conference on Machine Learning. PMLR, 19467–19496.

Nabeel Seedat

Before joining the van der Schaar Lab, Nabeel received a merit scholarship for a master’s degree at Cornell University, researching Bayesian deep learning and uncertainty estimation for high stakes applications. In addition, he holds a master’s degree from the University of the Witwatersrand (South Africa), where he was awarded a National Research Foundation grant for his work applying signal processing and machine learning to Parkinson’s disease diagnostics in low-resource settings.

Professionally, Nabeel has worked as a machine learning engineer in the United States and South Africa. The computer vision and natural language processing models he worked on are currently deployed and serving millions of customers on a daily basis.

Nabeel is keenly aware that taking methods from the lab to the bedside “requires a unique focus beyond just high-performance predictive models; it requires the development of a toolkit of methods for transfer learning across domains and locations, learning on smaller datasets, understanding model biases and quantifying model reliability and uncertainty are fundamentally needed to bridge this divide.”

Nabeel’s research is supported by funding from the Cystic Fibrosis Trust.