This piece is part of our wider agenda to advance a Reality-Centric AI approach in the machine learning community. Learn more about this and especially the connection of Reality-Centric AI to real-world data (the 2nd Pillar) here.
Clinical data is subject to an array of challenges that affects its usability:
Data quality: data may be noisy, corrupted, or labels may be ambiguous or mislabelled
Biased data: collection process introduces bias or specific groups have coverage imbalances
Missing data: specific features are missing – missing at random vs informatively missing
Limited data: in numerous clinical scenarios only a small amount of data (possibly unlabelled or partially labelled will be available)
All methods (not only ML) suffer from these challenges!
These factors directly influence the performance and robustness of machine learning systems. Data with imperfections and limitations can lead to suboptimal model performance and that has crucial impacts in the clinical domain.
Four antidotes for improving clinical data
- Data-centric AI/ML
- Synthetic data
- Self- & semi-supervised learning
- Using existing/expert knowledge
Antidote 1: Data-centric AI/ML
Data-centric AI aims to systematically improve the data used by ML systems. However, we go even a step further and have defined data-centric AI/ML such that it encompasses end-to-end ML pipelines:
Data-centric AI/ML encompasses methods and tools to systematically: characterise, evaluate and generate the underlying data used to train and evaluate models
This means that, at the ML pipeline level, the considerations at each stage should be informed in a data-driven manner.
DC-Check
To make the data-centric lens accessible and useable for clinicians, we introduced DC-Check: an actionable checklist-style framework to elicit data-centric considerations through the different stages of the ML pipeline: Data, Training, Testing, and Deployment.

DC-Check and the related models and tools (as mentioned in the graphic above) offer new considerations to improve the quality of data used for model training. Said training is then based on a better understanding of data for informed model design, domain adaptation, and robust training. Novel data-centric methods to test ML models, such as informed data splits, targeted metrics, stress tests, and evaluation on subgroups, are included in DC-Check, together with considerations for data deployment, data and model monitoring, model adaptation and retraining, and uncertainty quantification.

For more information on data-centric AI and helpful resources, take a look at our dedicated research pillar and the DC-Check tool.
Data Imputation: An essential yet overlooked problem
Missing data is a problem that’s often overlooked, especially by ML researchers that assume access to complete input datasets to train their models. There is a plethora of methods one can use to impute the missing values in a dataset. Our lab has created a package – called HyperImpute – that selects the best method for you.

For more information on data imputation, have a look at our Big Idea piece on the topic.
Antidote 2: Creating better than real data using Synthetic Data
Our lab has been at the forefront of research into synthetic data since 2017. We have created a research pillar as a comprehensive introduction for a variety of audiences. There, we outline the importance of synthetic data, explore and summarise recent cutting-edge approaches and methods, and link to a wide range of additional resources.

You can find the research pillar here. Furthermore, please have a look at our Revolutionizing Healthcare session on the topic.
Our lab is proud to have run comprehensive tutorials on synthetic data at ICML 2021 and AAAI 2023


You can our ICML tutorial “Synthetic Healthcare Data Generation and Assessment: Challenges, Methods, and Impact on Machine Learning” here.
All information about our AAAI tutorial about “Innovative Uses of Synthetic Data” that focuses on fixing the issues with real data can be found here.
Synthetic data is a unified way to address various issues with real data
Private and sensitive
Contains pre-existing bias
Small sample size
Collected from different domains
Future-proof
Creating privacy-preserving synthetic data
Creating de-biased synthetic data
Augmenting with synthetic data
Transfer and domain adaptation with synthetic data
Forward looking data and future scenarios
To facilitate the use of synthetic data in an accessible manner, we introduce Synthcity, a single access point for cutting-edge methods.

Synthcity is the most comprehensive open-source synthetic data generation and evaluation library with a focus on tabular clinical data.

Find out more about Synthcity, the original publications, and how to access the library here.
Antidote 3: Self-supervised learning
Using self-supervised, semi-supervised, and multi-view learning are valuable ways to deal with the lack of labelled data.

We have created a dedicated research pillar on the topic that you can find here.
Self- and semi-supervised learning for tabular data
First self- and semi-supervised method for tabular data
Self-supervised pretext tasks:
- Feature vector estimation
- Mask vector estimation
Masking process can be used for semi-supervised learning
Improvements to masking process
→ Independent vs. Correlated masking
Extension to feature selection

Self-Supervised Learning for Conformal Prediction (AISTATS 2023)
Conformal prediction is a powerful tool for uncertainty quantification, establishing valid prediction intervals with finite-sample guarantees. To produce valid intervals which are also adaptive to the difficulty of each instance, a common approach is to compute normalised non-conformity scores on a separate calibration set. In this recent AISTATS, we investigate how unlabelled data and self-supervised pretext tasks can improve the quality of the conformal regressors, specifically by improving the adaptability of conformal intervals. We train an auxiliary model with a self-supervised pretext task on top of an existing predictive model and use the self-supervised error as an additional feature to estimate nonconformity scores. Once again, we use whatever data is available to improve ML models.

You can read more about SSCP in our AISTATS 2023 paper “Improving Adaptive Conformal Prediction using Self-Supervised Learning” here.
Antidote 4: Use existing/expert knowledge
What do we do if we do not have enough useable data for our ML tools? One solution is to include existing/expert knowledge. However, these models can be incorrect, incomplete, contain complex dynamics and high dimensionality, or are only partially observable.

We propose a novel hybrid modelling framework, the Latent Hybridisation Model, that embeds a given pharmacological model (a collection of expert variables and the ODEs that describe the evolution of these variables) into a larger latent variable ML model (a system of Neural ODEs).
You can read more about our model in our NeurIPS 2021 paper “Integrating Expert ODEs into Neural ODEs: Pharmacology and Disease Progression” here.
Another approach would be the use of Causal Deep Learning models to process existing knowledge.

You can read more about Causal Deep Learning in our dedicated research pillar here.
What is next?
If you would like to learn more about the importance of data in a clinical setting and how ML tools can help clinicians to unlock the power behind it, we are running two Revolutionizing Healthcare sessions focused on “What data do I need?”.
A panel of internationally renowned and experienced Revolutionary Clinicians discuss the need for reliable data and what data-centric AI and ML tools can do for clinicians.

You can find the recording of the first session here, a summarise blog about what was discussed here, and sign up for the second session (19 April) here.