van der Schaar Lab

Living up to the Aspiration: how the van der Schaar Lab is revolutionising healthcare

Recently, Wade Shen, US Deputy Chief Technology Officer and Director of the National AI Initiative Office, outlined an ambitious vision for revolutionising healthcare using advanced AI. Here, we present a comprehensive response to his aspirations, showing that the van der Schaar Lab is up for the challenge and, in many cases, ahead of the curve.

When Mr Shen speaks about an AI aspiration for health, he means improving the current drug development process – a slow and costly procedure with a high failure rate. AI holds the potential to transform this landscape by leveraging large-scale models and vast datasets to predict drug interactions, optimise trials, and repurpose existing drugs. However, achieving this requires overcoming significant challenges in data access, privacy, safety, and model validation.

The van der Schaar lab reads this not only as a call to action but also as reassurance that we have been on the right track for the past 10 years. With our research we have anticipated, identified, and tackled many of the issues healthcare faces and AI can potentially solve. While we don’t have ready solutions for everything, the lab offers transformative ideas and practical tools that already push what reality-centric AI can do to change the healthcare environment.

Revolutionising Clinical Trials using Machine Learning

Randomised controlled trials (RCTs), the gold standard for evidence-based medicine, are expensive, difficult to run, and often exclude over 75% of potential patients due to restrictive criteria. This limits their generalisability and introduces bias. To address these issues, clinical trials must fundamentally change and leverage AI and machine learning.

The challenges facing drug development (as outlined by Shen Wade and presented more precisely in the image above) are not new to us. We have long been in conversation with clinicians and industry experts on the topic. With the creation of Revolutionizing Healthcare and Inspiration Exchange, we present engaging platforms that helped us identify what is relevant at each stage of a new form of truly AI-powered adaptive trials.

Where are we?

We have not only anticipated many challenges but also worked on and proposed practical solutions. Our work contributes to a new generation of clinical trials that, unlike traditional RCTs, use accumulated results to dynamically adjust trials for better efficiency and ethics, while maintaining study integrity. Adaptive designs, powered by AI and machine learning, use interim analyses to reconfigure patient recruitment criteria, assignment rules, and treatment options. The below image offers a quick overview of the AI domains we identified as answer to challenges along a clinical trial.

The planning stag involves synthesising information from diverse sources like observational or pre-clinical data. Conduct the trials includes decisions on recruitment and drug dosages. Once we get to the analysis, we tackle complex inference problems related to risks and outcomes. Commercialisation relies on intelligent modelling of processes, from disease progression to clinician prescribing behaviour. A key, often overlooked challenge is using one trial’s results to inform the planning, conduct, and analysis of subsequent trials, such as suggesting broader applications for a treatment or optimising recruitment for specific conditions.

But let’s answer the challenges as outlined in the AI aspirations – what do we offer?

How to deal with data

Wade Shen emphasizes that vast amounts of data and computational power drive progress in AI and its application in healthcare. However, he notes the challenges posed by the size and complexity of data, which come from various sources and in multiple forms.

At the van der Schaar lab, we have long realised the importance of data and given it centre stage. Our novel paradigm of data-centric AI views model or algorithmic refinement as less important (and in certain settings, algorithmic development is even considered as a solved problem), and instead seeks to systematically improve the data used by ML systems:

But what does this mean in practise? We have produced a variety of approaches and tools to improve data quality, transfer knowledge from one model to another, and to train models appropriately on complex data common in healthcare, genomics, and other omics settings. First, we have to accept that clinical data is subject to an array of challenges that affects its usability:

These factors directly influence the performance and robustness of machine learning systems. Data with imperfections and limitations can lead to suboptimal model performance and that has crucial impacts in the clinical domain.

For this purpose, the van der Schaar Lab introduced 4 antidotes for improving clinical data, a comprehensive assessment of clinical data needs that summarises a lot of the frameworks below.

Data Imputation

Data imputation involves dealing with missing data in datasets. Those interested in learning more about this should refer to our Big Idea piece that focuses on data imputation, which also includes an open-source package called HyperImpute. This package represents the state-of-the-art in machine learning data imputation and can be used either as part of AutoPrognosis or as a standalone paradigm.

Self- & semi-supervised learning

Self- & semi-supervised learning are crucial in scenarios where labelled data is limited and expensive. In these cases, utilising unlabelled data sets through self-supervised learning can provide useful representations for building better predictive analytics or identifying causal relationships between variables of interest.

Self-supervised learning has been an impactful paradigm in imaging, and our lab has introduced technology for self-supervised learning in tabular clinical data, which has shown significant success, for example, in building polygenic risk scores for genomic data.

Synthetic Data

As outlined in the aspirations, researchers are still hamstrung by a lack of access to high-quality data, which is the result of perfectly valid concerns regarding privacy.

If the goal is to develop and validate machine learning methods (e.g., prognostic risk scoring), synthetic data can replace real data. Creating synthetic patient records that is sufficiently close to the real information provides researchers with the needed data while protecting sensitive patient information. This approach balances risks and benefits in favour of the latter.

Prof Mihaela van der Schaar, “inventor” of synthetic data, has long established that synthetic data can provide researchers with datasets that have been tailored to specific needs, while still based on real data. Varying types of synthetic datasets could, for instance, be created specifically for clinical trials and solve the problem of data availability and sharing.

However, synthetic data has advantages that go beyond privacy-preserving properties – it can improve data quality. One key advantage of synthetic data is that it can fix a lot of issues associated with real data, such as biases. It can also be used to augment real data for populations of interest that are underrepresented in the dataset.

Aggregate Datasets

How do we deal with clinical datasets that are heterogeneous and come from different sources? Although Wade Shen mentions that “to date, no one has built large-scale models with aggregate data”, our lab has pioneered this aspect of data-centric AI for a while now.

A useful approach is “clustering” of data, e.g., patients. Here, we are interested in a type of clustering of phenotypes, in which patients are grouped based on similarity of future outcomes, rather than solely on similarity of observations. One of our lab’s first projects to implement this was the “tree of predictors” (ToPs), an ensemble method first published in 2018.

Working in the supervised setting, ToPs captures the heterogeneity of the populations by learning automatically on the basis of the data which features have the most predictive power and which features have the most discriminative power for each time horizon. ToPs uses this knowledge to create clusters of patients and specific predictive models for each cluster. The clusters that are identified and the predictive models that are applied to each cluster are readily interpretable.

Our newest approach in leveraging large datasets from related but different sources is RadialGAN. It allows for related datasets to be jointly used for modelling – a breakthrough – especially for settings in which high quality data is rare and fragmented. By solving feature and distribution mismatch, RadialGANs open the door to effective transfer learning. The practical utility of this approach was demonstrated using 14 different heart failure datasets for improved predictive modelling.

Accelerating drug development with AI

Reinforcement Learning/Multi-Armed Bandits

Next-generation clinical trial design and implementation is one of our lab’s key research priorities, and our lab has developed an array of novel approaches. Much of this work extends on a solid foundation of roughly 10 years of expertise with multi-armed bandits (related publications can be found here).

The multi-armed bandits framework addresses the exploration-exploitation trade-off in clinical trials by balancing clinical research (discovering treatment knowledge) with clinical practice (benefiting participants). It assigns new patients to treatment arms based on information from previous patients. These methods accelerate learning and identify subgroups with different treatment responses. They are easy to implement when trial logistics allow. Additionally, their Bayesian nature enables the smooth incorporation of prior observational evidence.

Understanding Treatment Effects

Using data from randomized clinical trials to justify treatment decisions for real-world patients is current practice, assuming that average treatment effects can apply broadly. However, patients vary widely in personal and disease characteristics. Leveraging machine learning to estimate an expected conditional average treatment effect (CATE) from diverse observational datasets offers the potential for more accurate treatment effect estimations tailored to individual patient characteristics.

Conventional machine learning methods designed for standard prediction tasks often do not include specific features for forecasting treatment effectiveness, especially in deciding whom to treat, when, and with which intervention. To address the question of modifying current treatment policies and predicting outcomes, new approaches are needed.

This research is closely linked to a more precise individualised treatment effect (ITE) inference – rather than having the “average” patient in mind, we follow the evidence that different treatments result in different effects and outcomes from one individual to another.

Our goal is to support a shift from a focus on average treatment effects to individualised treatment effects by optimising the use of observational datasets and clinical trial design.

Our lab pioneered the first AutoML approach for causal inference using influence functions, showcased at ICML 2019. Explore our research pillar on automated machine learning (AutoML) for healthcare to learn more. AutoML holds promise for personalised screening, monitoring, forecasting biomarkers, disease trajectories, and clinical outcomes, presenting untapped potential for drug development in clinical trials.


In addition to topics already discussed here, the van der Schaar lab has done significant work on the time-series domain – the backbone of personalised medicine. We have, for example, used time series datasets to produce new discoveries and a develop understanding of progression and clinical trajectories across a wide range of diseases, including cancer, cystic fibrosis, Alzheimer’s, cardiovascular disease and COVID-19, as well as within specific settings such as intensive care.

Personalised Medicine

This all leads to better drug development and our focus on a personalised medicine. We have recently published the results of an international team of researchers in Nature Medicine, exploring how causal machine learning can improve medical treatments, making them safer, more efficient, and more personalised. You can find out more here.

Societal Risks – How to deal with bias?

Despite all the advances in our AI approaches, we need to be mindful about not exacerbating inequities and bias. Our lab is conscious of these responsibilities, and we are at the forefront of building AI that clinicians and patients can trust. Our researchers, in cooperation with GSK, have recently explored how data-centric and model-centric approaches can help with the Generalization challenge: how can AI systems apply their knowledge to new data outside their original training pool?

The researchers highlight a tremendously important approach – responsible AI. Their main criterion for responsible use of ML is whether we can trust the predictions of a model. Our paper, published in Nature Digital Medicine, suggests a number of possible solutions to the Generalization Challenge – mainly data-centric and model-centric methods (or a combination thereof).

Another approach describes Synthetic Model Combination (SMC), our newest answer to the question of predicting treatment outcomes in new patients based on existing models, a game-changing machine learning method for constructing new model ensembles. SMC builds a new ensemble weighing existing models according to their likelihood to accurately represent a novel case. Based on our results, SMC is more robust and gives more accurate predictions than existing models, especially when there is no new data to judge which existing model is best. SMC can be reverse engineered to the needs of the situation and find applicable data for individuals who do not fit any of the existing models – this is ground-breaking, especially for under-represented groups and special cases. You can find our paper here.

Building in” safety

One of the very valid challenges established by Wade Shane is the need for safe and reliable AI, the responsible handling of data, and the control over powerful AI tools. Our lab has a strong focus on AI that not only empowers humans, but that is also safe and trustworthy.

We have already talked about the potential of synthetic data to provide solutions for data privacy and data sharing. In addition, our lab has worked tirelessly in the fields of trustworthy AI that is interpretable and addresses uncertainty. Both of these are key to unlock an AI that not only fosters a new human-machine partnership, but can also be kept in check by clinicians and researchers.

However, addressing regulatory issues is something we cannot deal with at the grassroots research level – here politicians have to take responsibility. What we can do, though, is inform decision makers about potential benefits and dangers of AI, and steer them toward a better regulated and effective policy. Prof Mihaela van der Schaar has taken a prominent role in advising the British parliament on AI governance – and as advocate for reality-centric AI.

Work to be done

Even if we’ve worked hard on existing challenges in AI for healthcare, we don’t have all the answers and we welcome aspirational new initiatives. It motivates us to develop bigger networks and intensify worldwide cooperation in our ongoing strife to revolutionise drug development and healthcare delivery using cutting-edge machine learning. With the power of AI, we can empower researchers, clinicians, and patients.

Andreas Bedorf