Developing and using machine learning tools for a clinical setting brings with it several challenges both for the developers of such tools and the clinicians who use them.
One significant topic is data.
Common questions and key concepts to consider include:
- How much data do I need to do machine learning in a clinical setting?
- What should the quality of that data be?
- How do I test the quality of my data?
- Can machine learning improve the quality of my data?
- What happens if I do not have enough high quality or labelled data?
- Can I share my data without privacy fears and what role can synthetic data play?
- What are the differences between cross-sectional, treatment, and time-series data?
With these questions automatically come challenges associated with data quality in clinical machine learning, including noisy data, errors in features and labels, biased data collection, and missing data. It must be emphasised that systematic data curation is crucial to ensure reliability in machine learning models.
However, these challenges are not unique to machine learning but are also relevant for statistical or epidemiological models. In the field of clinical analytics, dealing with challenges related to data quality can be a difficult task. Machine learning offers a solution to these challenges, and even those who are not interested in building machine learning models can benefit from the increased accuracy and quality of the data. To rigorously test clinical analytics, it is important to move beyond benchmarking and stress test these models to identify the settings in which they can be effectively applied.
At the time of deployment, it is crucial to consider the type of data that will be available and how to prepare for errors, biases, and uncertainties associated with data curation. Data curation is important for all clinical models, not just machine learning models. Four key concepts that can be used to deal with these issues include data-centric AI or machine learning, synthetic data, self- and semi-supervised learning, and combining expert and AI/ML models.
Data-centric AI
In this short introduction, Prof Mihaela van der Schaar briefly describes these four key concepts and talks about the potential of data-centric AI:
Data-centric AI is a new paradigm that focuses on giving data centre stage rather than just using the available data to train machine learning models. The goal is to systematically improve the quality of the data using machine learning techniques. This approach is more encompassing and enables holistic thinking about building end-to-end machine learning pipelines.
In this section of our Revolutionizing Healthcare session, we explore the impact of data on many of these sections contained in the below graphic:

If you are interested in learning more about what machine learning can do with a data-centric focus, you can look through our dedicated Research Pillar on data-centric AI. This pillar explores how machine learning can improve efficiency in various areas, such as healthcare, finance, retail, and manufacturing.
Data Imputation
In the realm of machine learning, there are several techniques that can be employed to improve data quality and accuracy. One such technique is data imputation, which involves dealing with missing data in datasets.
Those interested in learning more about this should refer to our Big Idea piece that focuses on data imputation, which also includes an open-source package called HyperImpute. This package represents the state-of-the-art in machine learning data imputation and can be used either as part of AutoPrognosis or as a standalone paradigm.
Synthetic Data
Synthetic data is not just anonymised data; rather, it is data based on real data that has been created from scratch. This technique has several advantages, such as providing privacy-preserving properties and improving data quality. To generate synthetic data, cutting-edge machine learning models are employed, and this technique is useful in several scenarios, such as sharing data with colleagues and improving data quality internally.
One key advantage of synthetic data is that it can fix a lot of issues associated with real data, such as biases. It can also be used to augment real data for populations of interest that are underrepresented in the dataset.
If you are interested in learning more about synthetic data, please have a look at our Research Pillar and one of Prof van der Schaar’s introductions given as part of an earlier Revolutionizing Healthcare session. We have also recently introduced an open-source software called Synthcity, which can be used to improve data quality and provides libraries for various innovative uses of data improvement.
Self- & semi-supervised learning
Self- & semi-supervised learning are crucial in scenarios where labelled data is limited and expensive. In these cases, utilizing unlabelled data sets through self-supervised learning can provide useful representations for building better predictive analytics or identifying causal relationships between variables of interest.
Self-supervised learning has been an impactful paradigm in imaging, and our lab has introduced technology for self-supervised learning in tabular clinical data, which has shown significant success, for example, in building polygenic risk scores for genomic data.
To learn more about these concepts, you can refer to our Research Pillar.
Hybridisation of expert models and machine learning models
Machine learning can address issues associated with expert modelling by using the limited available data to combine the two models through a concept called hybridisation. This empowers machine learning models with existing knowledge from expert models in fields like pharmacology. Determining how much data is needed for machine learning in the medical setting depends on the problem at hand and the accuracy and trustworthiness required. The more data available, the more confident the predictions.
Synthetic data augmentation and self-supervision can also be used to enrich the data set and improve predictive analytics. Additionally, expert models from fields like pharmacology and epidemiology can be hybridized with machine learning models to personalize them to the population of interest and address model specification issues.
Assessment of data quality
The quality of the data needed will depend on the problem being addressed. For this purpose, the van der Schaar Lab introduced DC-Check, a comprehensive tool that can assess the quality of data through various machine learning methods.
What’s next?
In summary, machine learning can help improve the quality of data in the medical setting by assessing the data’s quality, imputing missing data, and using synthetic data to amend datasets and solve privacy issues.
To get a more thorough inside into what machine learning can do in a clinical setting, and the role data plays, we highly recommend watching the recording of our last Revolutionizing Healthcare session.
On 19 April, we will have a follow-up session in which we dive back into data and discuss with a panel of experts. If you have any questions or thoughts on the topic, especially after reading through this blog, we wholeheartedly invite you to join us. You can sign up for our Revolutionizing Healthcare sessions here.