van der Schaar Lab

Spotlight on cancer research projects

To confront cancer is to encounter a parallel species,
one perhaps more adapted to survival than even we are.

– Siddhartha Mukherjee, The Emperor of All Maladies

The term cancer embraces a wide variety of related disorders/conditions that share as many similarities but also as many differences. This daunting complexity becomes more apparent with every breakthrough in our quest to understand it. This complexity is manifold, ranging from the bewildering array of disease subtypes (and subtypes of subtypes) to variations in cause and presentation, to the lengthy and unpredictable pathways inflicted on patients.

While the notion of developing a single “magic bullet” to cure cancer is outdated, ongoing research advancements have at least allowed us to develop a substantial arsenal in areas such as prevention, prediction, detection, diagnosis, treatment, and care. Truly revolutionizing our ability to combat cancer, however, requires an altogether deeper understanding of its disease pathways, and I believe this can only be achieved through the adoption of machine learning methods.

The potential of machine learning in combating cancer is a topic I addressed in our most recent Revolutionizing Healthcare engagement session. To view that session (in which I also explain many of the methods and approaches detailed later in this post), or to sign up for the Revolutionizing Healthcare series, please click the links below.

This post will highlight and summarize some of our lab’s key projects related to cancer. Our summary will follow a slightly simplified chronological representation of the standard cancer patient’s pathway: at each stage along the pathway, we introduce specific projects and provide resources for further reading.

Genetic risk

Polygenic risk scores play an important role in determining an individual’s risk of developing cancer during their lifetime. To date, only linear models have been successfully applied to crafting genomic risk scores used genomics data. This raises the question: can machine learning help further improve the crafting of polygenic risk scores by comparison with linear models?

To address this, our lab recently created VIME, a machine learning framework for crafting polygenic risk scores combining self- and semi-supervised learning.

VIME: Extending the Success of Self- and Semi-supervised Learning to Tabular Domain

Jinsung Yoon, Yao Zhang, James Jordon, Mihaela van der Schaar



In addition to genetic factors, lifestyle (including socio-demographics) plays a major role in determining an individual’s cancer risk.

While current statistical risk scoring models only use a handful of factors that have been identified as potentially important, we know that there are other factors that may be just as important (or more important).

In this context, we can apply machine learning in the service of two main objectives:
1) identifying, out of a large number of potentially informative risk factors (including socio-demographic information), which factors are most relevant for issuing an accurate prediction, in essence determining the value of information for a particular individual or class of individuals; and
2) understanding when non-linear interactions between identified factors are important and moving beyond linear models to non-linear models.

For this, we developed AutoPrognosis, our machine learning tool for crafting clinical scores. You can learn more about AutoPrognosis here, or by reading the paper below.

AutoPrognosis: Automated Clinical Prognostic Modeling via Bayesian Optimization
with Structured Kernel Learning

Ahmed Alaa, Mihaela van der Schaar


An extensive study published in Nature Machine Intelligence in June 2021 showcased the capabilities of Adjutorium, a machine learning system for prognostication and treatment benefit prediction developed by our lab. The study, which made unprecedented use of complex, high-quality cancer datasets from the U.K. and U.S., demonstrated that Adjutorium could recommend therapies for breast cancer patients more reliably than methods considered international clinical best practice

Machine learning to guide the use of adjuvant therapies for breast cancer

Ahmed M. Alaa, Deepti Gurdasani, Adrian L. Harris, Jem Rashbass, Mihaela van der Schaar

Nature Machine Intelligence, 2021



Early diagnosis

For early diagnosis, we use available healthcare data to understand progression of health and disease trajectories. This is very important in order to be able to identify cancer early in a patient.

So we are going to use machine learning and the wealth of input data available about the patient (symptoms, clinical findings, imaging results, lab tests, possible treatments given, and the timing of all of these) to issue predictions and forecasts, including early diagnosis of onset of cancer and potentially (upon diagnosis) severity of disease progression, etc.

For this, we need to use machine learning to build data-driven dynamic forecasting models that are personalized, accurate, and interpretable. These can be used for early diagnosis, as well as (if cancer has been identified) personalized monitoring and forecasting disease progression.

A hidden absorbing semi-Markov model for informatively censored temporal data:
learning and inference

Ahmed Alaa, Mihaela van der Schaar


Attentive State-Space Modeling of Disease Progression

Ahmed Alaa, Mihaela van der Schaar


Dynamic personalized screening

As mentioned above, screening is another critical part of the presentation stage of the cancer pathway. Medicine has been moving from a one-size-fits all approach towards dynamic personalized screening.

This is the approach we took in our work on DPSCREEN, a technology we developed a few years ago. DPSCREEN takes into account both the features (unique characteristics) of an individual and their past clinical and screening history.

DPSCREEN: Dynamic Personalized Screening

Kartik Ahuja, William Zame, Mihaela van der Schaar


Using (co)morbidities to prevent or identify cancer

At present, morbidities and comorbidities are modeled in a one-size-fits all, static fashion. This is often based on networks of relationships between these different morbidities.

Using machine learning, we are able to predict the likelihood of an individual developing a new morbidity, such as cancer, in the future. This can be done through the use of morbidity networks that are both personalized (i.e. they depend on the unique characteristics, such as genetic information, of each specific individual) and dynamic (i.e. they depend on the order in which morbidities occur).

Deep diffusion processes (DDP), developed by our lab last year, allow us to model the relationships between comorbid disease onsets expressed through a dynamic graph, meaning we can predict the onset of a new disease.

Learning Dynamic and Personalized Comorbidity Networks from Event Data
using Deep Diffusion Processes

Zhaozhi Qian, Ahmed Alaa, Alexis Bellot, Mihaela Schaar, Jem Rashbass



Machine learning’s ability to assist with diagnosis has been particularly well-documented—especially with regard to areas such as imaging. In this post, I would like to move beyond those impressive but well-trodden paths, and consider how machine learning can improve a range of overall diagnostic processes and empower the human professionals behind those processes.

Triaging in the diagnosis process

A key priority in cancer diagnosis is managing the workload of radiologists to optimize accuracy, efficiency, and costs. Our challenge here is to ensure that radiologists can devote the right amount of time to viewing scans that actually need their attention, meaning such scans must be separated out from others which can simply be read using machine learning or similar technologies.

MAMMO is a framework for cooperation between radiologists and machine learning. The focus of MAMMO is to triage mammograms between machine learning systems and radiologists.

Improving Workflow Efficiency for Mammography using Machine Learning

Trent Kyono, Fiona J Gilbert, Mihaela van der Schaar


Determining personalized screening modality

Our lab has also developed a system called ConfidentCare, which, like MAMMO, aims to improve accuracy and efficiency of resource usage within the overall diagnostic process.

ConfidentCare is a machine learning clinical decision support system that identifies what type of screening modality (e.g. mammogram, ultrasound, MRI) should be used for specific individuals, given their unique characteristics such as genomic information or past screening history.

ConfidentCare: A Clinical Decision Support System for Personalized Breast Cancer Screening

Ahmed Alaa, Kyeong H. Moon, William Hsu, Mihaela van der Schaar


Referral and composition of multidisciplinary teams

Determining the composition of multidisciplinary teams (MDTs) can be one of the most complex parts of the diagnosis and treatment process.

This is a process that can be made substantially more efficient and effective through the use of machine learning-enabled recommender systems. These systems can identify which clinicians should come together to best decide treat treatment options for a cancer patient, based on particular patient and clinician characteristics.

A few years ago, we built a recommender system that can “discover the experts” by assessing the context of the patient and determining the characteristics required of individual clinicians within the MDT—as well as determining the kind of machine learning decision support tools that should be used by this specific MDT for this specific patient.

Discover the Expert: Context-Adaptive Expert Selection for Medical Diagnosis

Cem Tekin, Onur Atan, Mihaela Van Der Schaar


Competing risks

We can also use machine learning to analyze competing risks; this can be done not only for one particular type of cancer but also other related cancers (for example, breast cancer and ovarian cancer), or different types of diseases (for example, cancer and cardiovascular disease). This lets us better determine screening profiles by adjusting cause-specific predictions, while also managing and prioritizing preventative treatments further down the line.

For this, we have developed a range of methods, some of which are shown below.

Deep Multi-task Gaussian Processes for Survival Analysis with Competing Risks

Ahmed Alaa, Mihaela van der Schaar


DeepHit: A Deep Learning Approach to Survival Analysis With Competing Risks

Changhee Lee, William R. Zame, Jinsung Yoon, Mihaela van der Schaar


Multitask Boosting for Survival Analysis with Competing Risks

Alexis Bellot, Mihaela van der Schaar


Temporal Quilting for Survival Analysis

Changhee Lee, William Zame, Ahmed Alaa, Mihaela van der Schaar


Application of a novel machine learning framework for predicting non-metastatic prostate cancer-specific mortality in men using the Surveillance, Epidemiology, and End Results (SEER) database

Changhee Lee, Alexander Light, Ahmed Alaa, David Thurtle, Mihaela van der Schaar, Vincent J Gnanapragasam



Truly personalized healthcare (which we refer to as “bespoke medicine”) goes far being providing predictions for individual patients: we also need to understand the effect of specific treatments on specific patients at specific times. This is what we call individualized treatment effect inference. It is a substantially more complex undertaking than prediction, and every bit as important—particularly in treating a disease like cancer, since no two patients will have the same cancer pathway.

When deciding on a treatment for a given form of cancer, clinical decisions are often made on the basis of results from randomized controlled trials of treatments involving that cancer. This approach assumes a response to treatment based on the response of the “average patient,” rather than taking into account the health history and specific features of the individual.

Rather than making treatment decisions based on such blanket assumptions, the goal of clinical decision-makers has shifted to determining the optimal treatment course for any given patient at any given time. Methods for doing so in a quantitative fashion based on insights from machine learning are in the formative stages of development, and our lab has built a position of leadership in this area. We have defined the research agenda by outlining and addressing key complexities and challenges, and by laying the theoretical groundwork for model development.

To read more about individualized treatment effect inference, visit our dedicated page on the topic.

GANITE: Estimation of Individualized Treatment Effects using Generative Adversarial Nets

Jinsung Yoon, James Jordon, Mihaela van der Schaar


Limits of Estimating Heterogeneous Treatment Effects:
Guidelines for Practical Algorithm Design

Ahmed Alaa, Mihaela van der Schaar


Forecasting Treatment Responses Over Time Using Recurrent Marginal Structural Networks

Bryan Lim, Ahmed Alaa, Mihaela Van Der Schaar


Estimating counterfactual treatment outcomes over time
through adversarially balanced representations

Ioana Bica, Ahmed Alaa, James Jordon, Mihaela van der Schaar



The lengthy trajectory and complex evolution of cancer over time means that follow-up care is a particularly important part of the patient pathway. Machine learning is particularly well-positioned to predict, prevent, and empower decision making around recurrence and relapse.

One of our key projects in this area, temporal phenotyping of disease progression, is outlined below.

Outcome-Oriented Deep Temporal Phenotyping of Disease Progression

Changhee Lee, Jem Rashbass, Mihaela Van Der Schaar


Further resources

The post above has introduced and explained a range of methods our lab has developed to provide actionable, accurate, and interpretable information at various points along the cancer pathway. Some of these have been integrated into a live demonstrator system based on breast cancer, fed by anonymized real-world data. More details on this project are available in the video below, taken from a presentation given by Mihaela van der Schaar and Dr. Jem Rashbass at the Royal College of Physicians in 2019.

This demonstrator (and several related pieces of work) are also introduced in an Impact Story published by The Alan Turing Institute.

For a full list of the van der Schaar Lab’s publications, click here.

Mihaela van der Schaar

Mihaela van der Schaar is the John Humphrey Plummer Professor of Machine Learning, Artificial Intelligence and Medicine at the University of Cambridge and a Fellow at The Alan Turing Institute in London.

Mihaela has received numerous awards, including the Oon Prize on Preventative Medicine from the University of Cambridge (2018), a National Science Foundation CAREER Award (2004), 3 IBM Faculty Awards, the IBM Exploratory Stream Analytics Innovation Award, the Philips Make a Difference Award and several best paper awards, including the IEEE Darlington Award.

In 2019, she was identified by National Endowment for Science, Technology and the Arts as the most-cited female AI researcher in the UK. She was also elected as a 2019 “Star in Computer Networking and Communications” by N²Women. Her research expertise span signal and image processing, communication networks, network science, multimedia, game theory, distributed systems, machine learning and AI.

Mihaela’s research focus is on machine learning, AI and operations research for healthcare and medicine.

Nick Maxfield

From 2020 to 2022, Nick oversaw the van der Schaar Lab’s communications, including media relations, content creation, and maintenance of the lab’s online presence.