van der Schaar Lab

Prescription for Perfect Data: Four Machine Learning Antidotes for Improving Clinical Data

This piece is part of our wider agenda to advance a Reality-Centric AI approach in the machine learning community. Learn more about this and especially the connection of Reality-Centric AI to real-world data (the 2nd Pillar) here.

Clinical data is subject to an array of challenges that affects its usability:

Data quality: data may be noisy, corrupted, or labels may be ambiguous or mislabelled

Biased data: collection process introduces bias or specific groups have coverage imbalances

Missing data: specific features are missing – missing at random vs informatively missing

Limited data: in numerous clinical scenarios only a small amount of data (possibly unlabelled or partially labelled will be available)

All methods (not only ML) suffer from these challenges!

These factors directly influence the performance and robustness of machine learning systems. Data with imperfections and limitations can lead to suboptimal model performance and that has crucial impacts in the clinical domain.

Four antidotes for improving clinical data

  1. Data-centric AI/ML
  2. Synthetic data
  3. Self- & semi-supervised learning
  4. Using existing/expert knowledge

Antidote 1: Data-centric AI/ML

Data-centric AI aims to systematically improve the data used by ML systems. However, we go even a step further and have defined data-centric AI/ML such that it encompasses end-to-end ML pipelines:

Data-centric AI/ML encompasses methods and tools to systematically: characterise, evaluate and generate the underlying data used to train and evaluate models

This means that, at the ML pipeline level, the considerations at each stage should be informed in a data-driven manner.


To make the data-centric lens accessible and useable for clinicians, we introduced DC-Check: an actionable checklist-style framework to elicit data-centric considerations through the different stages of the ML pipeline: Data, Training, Testing, and Deployment.

DC-Check and the related models and tools (as mentioned in the graphic above) offer new considerations to improve the quality of data used for model training. Said training is then based on a better understanding of data for informed model design, domain adaptation, and robust training. Novel data-centric methods to test ML models, such as informed data splits, targeted metrics, stress tests, and evaluation on subgroups, are included in DC-Check, together with considerations for data deployment, data and model monitoring, model adaptation and retraining, and uncertainty quantification.

For more information on data-centric AI and helpful resources, take a look at our dedicated research pillar and the DC-Check tool.

Data Imputation: An essential yet overlooked problem

Missing data is a problem that’s often overlooked, especially by ML researchers that assume access to complete input datasets to train their models. There is a plethora of methods one can use to impute the missing values in a dataset. Our lab has created a package – called HyperImpute – that selects the best method for you.

For more information on data imputation, have a look at our Big Idea piece on the topic.

Antidote 2: Creating better than real data using Synthetic Data

Our lab has been at the forefront of research into synthetic data since 2017. We have created a research pillar as a comprehensive introduction for a variety of audiences. There, we outline the importance of synthetic data, explore and summarise recent cutting-edge approaches and methods, and link to a wide range of additional resources.

You can find the research pillar here. Furthermore, please have a look at our Revolutionizing Healthcare session on the topic.

Our lab is proud to have run comprehensive tutorials on synthetic data at ICML 2021 and AAAI 2023

You can our ICML tutorial “Synthetic Healthcare Data Generation and Assessment: Challenges, Methods, and Impact on Machine Learning” here.

All information about our AAAI tutorial about “Innovative Uses of Synthetic Data” that focuses on fixing the issues with real data can be found here.

Synthetic data is a unified way to address various issues with real data

Private and sensitive

Contains pre-existing bias

Small sample size

Collected from different domains


Creating privacy-preserving synthetic data

Creating de-biased synthetic data

Augmenting with synthetic data

Transfer and domain adaptation with synthetic data

Forward looking data and future scenarios

To facilitate the use of synthetic data in an accessible manner, we introduce Synthcity, a single access point for cutting-edge methods.

Synthcity is the most comprehensive open-source synthetic data generation and evaluation library with a focus on tabular clinical data.

Find out more about Synthcity, the original publications, and how to access the library here.

Antidote 3: Self-supervised learning

Using self-supervised, semi-supervised, and multi-view learning are valuable ways to deal with the lack of labelled data.

We have created a dedicated research pillar on the topic that you can find here.

Self- and semi-supervised learning for tabular data

VIME (NeurIPS 2020)

First self- and semi-supervised method for tabular data

Self-supervised pretext tasks:

  • Feature vector estimation
  • Mask vector estimation

Masking process can be used for semi-supervised learning

SEFS (ICLR 2022)

Improvements to masking process

→ Independent vs. Correlated masking

Extension to feature selection

Self-Supervised Learning for Conformal Prediction (AISTATS 2023)

Conformal prediction is a powerful tool for uncertainty quantification, establishing valid prediction intervals with finite-sample guarantees. To produce valid intervals which are also adaptive to the difficulty of each instance, a common approach is to compute normalised non-conformity scores on a separate calibration set. In this recent AISTATS, we investigate how unlabelled data and self-supervised pretext tasks can improve the quality of the conformal regressors, specifically by improving the adaptability of conformal intervals. We train an auxiliary model with a self-supervised pretext task on top of an existing predictive model and use the self-supervised error as an additional feature to estimate nonconformity scores. Once again, we use whatever data is available to improve ML models.

You can read more about SSCP in our AISTATS 2023 paper “Improving Adaptive Conformal Prediction using Self-Supervised Learning” here.

Antidote 4: Use existing/expert knowledge

What do we do if we do not have enough useable data for our ML tools? One solution is to include existing/expert knowledge. However, these models can be incorrect, incomplete, contain complex dynamics and high dimensionality, or are only partially observable.

We propose a novel hybrid modelling framework, the Latent Hybridisation Model, that embeds a given pharmacological model (a collection of expert variables and the ODEs that describe the evolution of these variables) into a larger latent variable ML model (a system of Neural ODEs).

You can read more about our model in our NeurIPS 2021 paper “Integrating Expert ODEs into Neural ODEs: Pharmacology and Disease Progression” here.

Another approach would be the use of Causal Deep Learning models to process existing knowledge.

You can read more about Causal Deep Learning in our dedicated research pillar here.

What is next?

If you would like to learn more about the importance of data in a clinical setting and how ML tools can help clinicians to unlock the power behind it, we are running two Revolutionizing Healthcare sessions focused on “What data do I need?”.

A panel of internationally renowned and experienced Revolutionary Clinicians discuss the need for reliable data and what data-centric AI and ML tools can do for clinicians.

You can find the recording of the first session here, a summarise blog about what was discussed here, and sign up for the second session (19 April) here.

Mihaela van der Schaar

Mihaela van der Schaar is the John Humphrey Plummer Professor of Machine Learning, Artificial Intelligence and Medicine at the University of Cambridge and a Fellow at The Alan Turing Institute in London.

Mihaela has received numerous awards, including the Oon Prize on Preventative Medicine from the University of Cambridge (2018), a National Science Foundation CAREER Award (2004), 3 IBM Faculty Awards, the IBM Exploratory Stream Analytics Innovation Award, the Philips Make a Difference Award and several best paper awards, including the IEEE Darlington Award.

In 2019, she was identified by National Endowment for Science, Technology and the Arts as the most-cited female AI researcher in the UK. She was also elected as a 2019 “Star in Computer Networking and Communications” by N²Women. Her research expertise span signal and image processing, communication networks, network science, multimedia, game theory, distributed systems, machine learning and AI.

Mihaela’s research focus is on machine learning, AI and operations research for healthcare and medicine.