van der Schaar Lab

DC-Check: about the Tool

An actionable checklist with a data-centric lens across the ML pipeline.

Note

We believe that a data-centric lens is key to unlocking reliable ML systems in the wild. DC-Check is an actionable checklist to help make this a reality.

Data-centric AI has been raised as an important concept to improve ML systems [1,2,3,4]. However, currently there is no standardised process to communicate the design of data-centric ML pipelines.

Furthermore, there is no guide to the necessary considerations for data-centric AI systems, making the agenda hard to practically engage with.

DC-Check solves this providing an actionable checklist that covers all stages of the ML pipeline.


Who is DC-Check for?

DC-Check is aimed at both practitioners and researchers.

  • Both: Each component of DC-Check includes a set of data-centric questions to guide developers in day-to-day development.
  • Practioners: We suggest concrete data-centric tools and modeling approaches based on these considerations.
  • Researchers: We include research opportunities necessary to advance the research area of data-centric AI.

DC-Check isn’t just a documentation tool

DC-Check goes beyond a documentation tool, supporting practitioners and researchers in achieving greater transparency and accountability with regard to data-centric considerations for ML pipelines. We believe that the transparency and accountability provided by DC-Check can be useful to a range of stakeholders, including developers, researchers, policymakers, regulators, and organization decision-makers, to understand design considerations at each stage of the ML pipeline.


DC-Check covers the end-to-end ML pipeline

DC-Check is an actionable checklist that advocates for a data-centric lens encompassing the following stages of the ML pipeline:

Data: considerations to improve the quality of data used for model training.

Training: considerations based on understanding the data that affect model training.

Testing: considerations around data-centric approaches to test ML models.

Deployment: considerations related to the data post deployment.


References

[1] Andrew Ng. 2021. MLOps: from model-centric to data-centric AI. Online unter https://www. deeplearning. ai/wp-content/uploads/2021/06/MLOps- From-Model-centric-to-Data-centricAI.pdf (2021).

[2] Weixin Liang, Girmaw Abebe Tadesse, Daniel Ho, L Fei-Fei, Matei Zaharia, Ce Zhang, and James Zou. 2022. Advances, challenges and opportunities in creating data for trustworthy AI. Nature Machine Intelligence 4, 8 (2022), 669–677.

[3] Neoklis Polyzotis and Matei Zaharia. 2021. What can Data-Centric AI Learn from Data and ML Engineering? arXiv preprint arXiv:2112.06439 (2021).

[4] Nabeel Seedat, Jonathan Crabbé, and Mihaela van der Schaar. 2022. Data-SUITE: Data-centric identification of in-distribution incongruous examples. In Proceedings of the 39th International Conference on Machine Learning. PMLR, 19467–19496.