van der Schaar Lab

Spotlight on cystic fibrosis research projects

Thanks to support from the UK Cystic Fibrosis Trust and its pioneering patient registry, our lab has developed a range of powerful machine learning tools for diagnosis, prognosis, phenotyping, and treatment related to cystic fibrosis.

The most common genetic disease in caucasian populations, Cystic fibrosis is defined by a unique mix of complexities that make the lives of its patients and the task of healthcare professionals particularly unpredictable. As a chronic condition, its progression at times appears almost random due to the potential presence of a variety of (often competing) complications. These can be hard to disentangle, and usually require targeted prevention or mitigation when identified.

While significant progress has been made in understanding this disease and improving the lives of sufferers in recent decades, there is much yet to be done: for example, only about half of those born in the UK with cystic fibrosis (as of 2019) are likely to live to the age of 50. Clinical insights gained through machine learning could reduce the burden of this disease and increase longevity through increasingly personalized treatment and intervention choices, accurate clinical predictions, and accelerated medical discovery.

Cystic fibrosis is a fertile ground to explore machine learning methods, due in part to the creation of the UK Cystic Fibrosis Registry, an extensive database covering 99% of the UK’s cystic fibrosis population, which is managed by the UK Cystic Fibrosis Trust. The Registry holds both static and time-series data for each patient, including demographic information, CFTR genotype, disease-related measures including infection data, comorbidities and complications, lung function, weight, intravenous antibiotics usage, medications, transplantations and deaths.

Turning such rich datasets into medical understanding is a key priority for the future of personalized healthcare. Through our own lab’s ongoing partnership with, and support from, the UK Cystic Fibrosis Trust, we have been able to take the Registry’s data to a completely new level.

This post will highlight and summarize some of our key projects related to cystic fibrosis, including (but not limited to) those in which we have leveraged our extensive partnership with the UK Cystic Fibrosis Trust. Each project targets a number of clinical problem types related to cystic fibrosis; these are detailed below.

Risk assessment and diagnosis

Whether diagnosing cystic fibrosis in the first instance or determining the likelihood of any number of potential risks facing patients, common statistical risk evaluation methods are unable to fully integrate the wealth of information available about each individual. By contrast, machine learning methods are able to handle many more features (offering significant informational gains) and can make better use of feature information by better capturing the potentially complex interactions between features (resulting in modeling gains). This can result in more accurate predictions, and hence better treatment guidance, for the patient at hand.


Cystic fibrosis evolves slowly, allowing for development of comorbidities and bacterial infections, and creating distinct responses to therapeutic interventions. This results in great heterogeneity in terms of potential disease pathways and potential interactions between different comorbidities, often resulting in very diverse patient outcomes, even in narrow patient subgroups. Machine learning techniques for patient phenotyping (supported by sufficient data) can help anticipate patients’ prognoses by identifying “similar” patients, and designing treatment guidelines that are tailored to homogeneous patient subgroups.

Forecasting disease trajectories

Due to the wide availability of modern electronic health records, patient care data is now often stored in the form of time-series data. This is particularly relevant to cystic fibrosis, given the slow evolution of the disease (for example, annual follow ups over multi-year horizons are commonplace). Since biomarkers and other risk factors of cystic fibrosis patients are measured repeatedly over time, prognostic tools powered by machine learning can process the longitudinal trajectory of these biomarkers and help clinical decision-makers better understand the disease and predict multiple events or outcomes over time.

Competing risks and comorbidities

Cystic fibrosis patients suffer from, or are at risk of, multiple diseases or conditions; these risks increase as the patient ages. Machine learning methods can help monitor and treat such patients by predicting which diseases or conditions are likely to occur and at what point, and how the risks for various diseases or conditions change over time. By comparison with commonly used statistical models, machine learning is extremely well-suited to analyses involving multiple competing risks where more than one type of event plays a role in the survival setting.

Personalized monitoring and early warning systems

Cystic fibrosis must currently make routine clinical visits even when well, which is inefficient and can adversely impact the lives of patients. Enabled in part by remote monitoring, machine learning can transform this model of care through by enabling the provision of comprehensive and high-quality care. Based on integration of all data relevant to an individual; machine learning-enabled systems can offer assessment of (and feedback regarding) patient progress, predictions regarding likely health development or changes, and alerts related to the need for further action or consultation.


A major challenge across the domain of healthcare is ascertaining whether a given intervention will influence or determines an outcome. For cystic fibrosis patients, such decisions may commonly involve determining whether there is a survival benefit to prescribing a certain medication, or waitlisting a patient for a lung transplant. In addition to providing accurate predictions and granular risk scores that can quantify the severity of future outcomes, machine learning tools can can be used for treatment planning, individualized treatment effect inference, follow-up scheduling, or estimating the time at which a transplant would be needed in the future.

Scientific discovery

Cystic fibrosis is a complex disease that is not yet close to being fully understood. The application of machine learning models can yield new insights into the nature of cystic fibrosis: for example, integrating many features and capturing complex patterns can teach us about the clinical significance of specific features that were not previously believed to be important.

The figure above is a conceptual rendering outlining the process of developing, validating, and deploying tailored machine learning tools that support bespoke medicine and scientific discovery in healthcare.

For a succinct, accessible, and high-level overview of the many opportunities for machine learning to transform care for people with cystic fibrosis, please take a look at a recent article published in the Journal of Cystic Fibrosis by our lab and collaborators.

Prognostication and Risk Factors for Cystic Fibrosis via Automated Machine Learning

Ahmed Alaa, Mihaela van der Schaar
Published in Nature Scientific Reports, 2018

Accurate prediction of survival for cystic fibrosis patients is instrumental in establishing the optimal timing for referring patients with terminal respiratory failure for lung transplantation. Current practice considers referring patients for lung transplantation evaluation once the forced expiratory volume (FEV1) drops below 30% of its predicted nominal value. While FEV1 is indeed a strong predictor of cystic fibrosis-related mortality, we hypothesized that the survival behavior of cystic fibrosispatients exhibits a lot more heterogeneity.

To this end, we developed an algorithmic framework, which we call AutoPrognosis, that leverages the power of machine learning to automate the process of constructing clinical prognostic models, and used it to build a prognostic model for cystic fibrosis using data from a contemporary cohort that involved 99% of the cystic fibrosis population in the UK. AutoPrognosis uses Bayesian optimization techniques to automate the process of configuring ensembles of machine learning pipelines, which involve imputation, feature processing, classification and calibration algorithms. Because it is automated, it can be used by clinical researchers to build prognostic models without the need for in-depth knowledge of machine learning.

Our experiments revealed that the accuracy of the model learned by AutoPrognosis is superior to that of existing guidelines and other competing models.

Dynamic-DeepHit: a Deep Learning Approach for Dynamic Survival Analysis
with Competing Risks based on Longitudinal Data

Changhee Lee, Jinsung Yoon, Mihaela van der Schaar
Published in IEEE Transactions on Biomedical Engineering, 2020

Currently available risk prediction methods are limited in their ability to deal with complex, heterogeneous, and longitudinal data such as that available in primary care records, or in their ability to deal with multiple competing risks.

This paper develops a novel deep learning approach that is able to successfully address current limitations of standard statistical approaches such as land marking and joint modeling. Our approach, which we call Dynamic-DeepHit, flexibly incorporates the available longitudinal data comprising various repeated measurements (rather than only the last available measurements) in order to issue dynamically updated survival predictions for one or multiple competing risk(s).

Dynamic-DeepHit learns the time-to-event distributions without the need to make any assumptions about the underlying stochastic models for the longitudinal and the time-to-event processes. Thus, unlike existing works in statistics, our method is able to learn data-driven associations between the longitudinal data and the various associated risks without underlying model specifications.

We demonstrate the power of our approach by applying it to a real-world longitudinal dataset from the U.K. Cystic Fibrosis Registry, which includes a heterogeneous cohort of 5883 adult patients with annual follow-ups between 2009 to 2015. The results show that Dynamic-DeepHit provides a drastic improvement in discriminating individual risks of different forms of failures due to cystic fibrosis.

Furthermore, our analysis utilizes post-processing statistics that provide clinical insight by measuring the influence of each covariate on risk predictions and the temporal importance of longitudinal measurements, thereby enabling us to identify covariates that are influential for different competing risks.

Attentive State-Space Modeling of Disease Progression

Ahmed Alaa, Mihaela van der Schaar
NeurIPS 2019

Models of disease progression are instrumental for predicting patient outcomes and understanding disease dynamics. Existing models provide the patient with pragmatic (supervised) predictions of risk, but do not provide the clinician with intelligible (unsupervised) representations of disease pathology.

In this paper, we develop the attentive state-space model, a deep probabilistic model that learns accurate and interpretable structured representations for disease trajectories. Unlike Markovian state-space models, in which state dynamics are memoryless, our model uses an attention mechanism to create “memoryful” dynamics, whereby attention weights determine the dependence of future disease states on past medical history. To learn the model parameters from medical records, we develop an inference algorithm that jointly learns a compiled inference network and the model parameters, leveraging the attentive representation to construct a variational approximation of the posterior state distribution.

Experiments on data from the UK Cystic Fibrosis registry show that our model demonstrates superior predictive accuracy, in addition to providing insights into disease progression dynamic.

Disease-Atlas: Navigating Disease Trajectories with Deep Learning

Bryan Lim, Mihaela van der Schaar
MLHC 2018

Note: The UK Cystic Fibrosis Registry was one of the two real-world medical datasets on which we conducted experiments to investigate the performance of our approach. In our investigations, we considered a joint model for 2 continuous lung function scores (FEV1 and Predicted FEV1), 20 comorbidity and infection risks (treated as binary longitudinal observations) as well as death as the event of interest, simultaneously forecasting them all at each time step..

Joint models for longitudinal and time-to-event data are commonly used in longitudinal studies to forecast disease trajectories over time. While there are many advantages to joint modeling, the standard forms suffer from limitations that arise from a fixed model specification and computational difficulties when applied to high-dimensional datasets.

In this paper, we propose a deep learning approach to address these limitations, enhancing existing methods with the inherent flexibility and scalability of deep neural networks while retaining the benefits of joint modeling.

Using longitudinal data from two real-world medical datasets, we demonstrate improvements in performance and scalability, as well as robustness in the presence of irregularly sampled data.

Temporal Phenotyping using Deep Predictive Clustering of Disease Progression

Changhee Lee, Mihaela van der Schaar
ICML 2020

Note: The UK Cystic Fibrosis Registry was one of the two real-world medical datasets on which we conducted experiments to investigate the performance of our model. The Registry was included due to the richness of the data and inclusion of comorbidity diagnosis information: at each time stamp, we set the development of different comorbidities in the subsequent year as the label of interest.

Due to the wider availability of modern electronic health records, patient care data is often being stored in the form of time-series. Clustering such time-series data is crucial for patient phenotyping, anticipating patients’ prognoses by identifying “similar” patients, and designing treatment guidelines that are tailored to homogeneous patient subgroups.

In this paper, we develop a deep learning approach for clustering time-series data, where each cluster comprises patients who share similar future outcomes of interest (e.g., adverse events, the onset of comorbidities). To encourage each cluster to have homogeneous future outcomes, the clustering is carried out by learning discrete representations that best describe the future outcome distribution based on novel loss functions.

Experiments on two real-world datasets show that our model achieves superior clustering performance over state-of-the-art benchmarks and identifies meaningful clusters that can be translated into actionable information for clinical decision-making.

Application of Kernel Hypothesis Testing on Set-valued Data

Alexis Bellot, Mihaela van der Schaar

The rate of progression of lung function decline and response to treatments are heterogenous across cystic fibrosis individuals. Robust methods that could identify subgroups with differing lung function trajectories or responses to treatments would deliver enormous insights into the pathophysiology of this disease, and provide opportunities for personalised therapy.

While standard hypothesis testing can be used to define such significant differences, most tests are restricted to static data and cannot test difference between trajectories themselves. We have therefore developed a AI-based hypothesis test designed to compare irregularly sampled time series to solve this problem.

Our test is designed to encode the uncertainty between observations by interpreting the lung function trajectory of each patient as a probability distribution and makes comparisons between probability these probability distributions to provably recover significant differences in the lung function of different groups.

We then took retrospective longitudinal data from the UK Cystic Fibrosis Registry to systematically test for differences in lung function decline. The Registry contains regular measurements of lung function as well as clinical metadata for nearly all cystic fibrosis individuals in the UK. Our analysis partitioned the population into subgroups defined by these variables and compared their respective lung progression.

We demonstrate gains in power (i.e. the proportion of correctly rejected differences in two synthetically generated populations) of up to 50% compared to tests that do not capture the uncertainty between observations, but rather consider each trajectory as a fixed vector of observations. We use synthetic data to quantitatively evaluate our test because the ground truth aetiology of cystic fibrosis is most often unknown.

Using real cystic fibrosis data, our method identifies several important subgroups (some already recognised) and detects significant differences in disease progression between these newly defined groupings.

Our new method provides a principled comparison method to systematically analyse patient progression and enlighten our understanding of the variable effect of new drugs such as Ivacaftor, on future lung progression..

Clairvoyance: a Pipeline Toolkit for Medical Time Series

Daniel Jarrett, Jinsung Yoon, Ioana Bica, Zhaozhi Qian, Ari Ercole, Mihaela van der Schaar
ICLR 2021

Note: When validating Clairvoyance, we specifically sought to include experiments using datasets from time-series environments that reflect the heterogeneity of realistic use cases envisioned for Clairvoyance. The UK Cystic Fibrosis Registry was an obvious choice in this context, since individuals in the registry are chronic patients monitored over infrequent visits, and for whom long-term decline is generally expected.

Time-series learning is the bread and butter of data-driven clinical decision support, and the recent explosion in ML research has demonstrated great potential in various healthcare settings.

At the same time, medical time-series problems in the wild are challenging due to their highly composite nature: They entail design choices and interactions among components that preprocess data, impute missing values, select features, issue predictions, estimate uncertainty, and interpret models. Despite exponential growth in electronic patient data, there is a remarkable gap between the potential and realized utilization of ML for clinical research and decision support. In particular, orchestrating a real-world project lifecycle poses challenges in engineering (i.e. hard to build), evaluation (i.e. hard to assess), and efficiency (i.e. hard to optimize).

Designed to address these issues simultaneously, Clairvoyance proposes a unified, end-to-end, autoML-friendly pipeline that serves as a (i) software toolkit, (ii) empirical standard, and (iii) interface for optimization. Our ultimate goal lies in facilitating transparent and reproducible experimentation with complex inference workflows, providing integrated pathways for (1) personalized prediction, (2) treatment-effect estimation, and (3) information acquisition.

Through illustrative examples on real-world data in outpatient, general wards, and intensive-care settings, we illustrate the applicability of the pipeline paradigm on core tasks in the healthcare journey. To the best of our knowledge, Clairvoyance is the first to demonstrate viability of a comprehensive and automatable pipeline for clinical time-series ML.

The substantial body of work presented above would not have been possible without the generous support of from the UK Cystic Fibrosis Trust, or without their pioneering work and vision in creating the UK Cystic Fibrosis Registry.

If you are a clinician and would like to learn more about how machine learning can be applied to real-world healthcare problems, please sign up for our Revolutionizing Healthcare online engagement sessions (no machine learning knowledge required).

For a full list of the van der Schaar Lab’s publications, click here.

Mihaela van der Schaar

Mihaela van der Schaar is the John Humphrey Plummer Professor of Machine Learning, Artificial Intelligence and Medicine at the University of Cambridge and a Fellow at The Alan Turing Institute in London.

Mihaela has received numerous awards, including the Oon Prize on Preventative Medicine from the University of Cambridge (2018), a National Science Foundation CAREER Award (2004), 3 IBM Faculty Awards, the IBM Exploratory Stream Analytics Innovation Award, the Philips Make a Difference Award and several best paper awards, including the IEEE Darlington Award.

In 2019, she was identified by National Endowment for Science, Technology and the Arts as the most-cited female AI researcher in the UK. She was also elected as a 2019 “Star in Computer Networking and Communications” by N²Women. Her research expertise span signal and image processing, communication networks, network science, multimedia, game theory, distributed systems, machine learning and AI.

Mihaela’s research focus is on machine learning, AI and operations research for healthcare and medicine.

Nick Maxfield

From 2020 to 2022, Nick oversaw the van der Schaar Lab’s communications, including media relations, content creation, and maintenance of the lab’s online presence.