van der Schaar Lab


Please note: this page is a work in progress. Please treat it as a “stub” containing only basic information, rather than a full-fledged summary of our lab’s vision for clustering and our research to date.

Clustering patients (also referred to on this page as phenotyping and subgroup identification) is an important challenge that becomes particularly complicated in a dynamic setting where longitudinal datasets are in use. This page provides an overview of our lab’s work to date on clustering, with a special focus on our research on outcome-oriented clustering.

From unsupervised clustering to outcome-oriented clustering

The conventional notion of clustering seeks to group patients together in an unsupervised manner, based on their static or longitudinal features (covariates). However, unsupervised clustering does not account for patients’ observed outcomes (such as adverse events or the onset of comorbidities), and thus often leads to heterogeneous outcomes within a given cluster. Therefore, this type of clustering yields information that is of relatively limited use to clinicians and patients—after all, chronic diseases such as cancer, cystic fibrosis and dementia are heterogeneous in nature, with widely differing outcomes, even when the patients’ features seem relatively similar.

What clinicians and patients actually need to know is what types of events (including events related to competing risks) will likely occur in the future, given the observations (features) they have observed so far. We are, therefore, interested in a type of clustering or phenotyping in which patients are grouped based on similarity of future outcomes, rather than solely on similarity of observations.

One of our lab’s first projects to address this shortcoming was the “tree of predictors” (ToPs), an ensemble method first published in 2018.

Working in the supervised setting, ToPs captures the heterogeneity of the populations by learning automatically on the basis of the data which features have the most predictive power and which features have the most discriminative power for each time horizon. ToPs uses this knowledge to create clusters of patients and specific predictive models for each cluster. The clusters that are identified and the predictive models that are applied to each cluster are readily interpretable.

ToPs differs from existing methods in that it discovers clusters in a data-driven manner, and then constructs and applies different predictive models to the discovered clusters. While tree-based approaches create successive clusters (splits) of the feature space in order to maximize homogeneity of each cluster (split) with respect to labels, ToPs creates successive clusters (splits) of the feature space in order to create clusters that maximize the predictive accuracy of each cluster (split) with respect to a constructed predictive model. To this end, ToPs creates a tree of clusters (subsets of the feature space) and associates a predictive model with each such cluster (subset).

Reference: how ToPs clusters patients (cardiac transplantation survival example)

Abstracts and papers related to the lab’s work on ToPs (including an application of ToPs to the problem of cardiac transplantation) can be found below.

ToPs: Ensemble Learning with Trees of Predictors

Jinsung Yoon, William R. Zame, Mihaela van der Schaar

IEEE Transactions on Signal Processing, 2018


See also: Personalized survival predictions via Trees of Predictors: An application to cardiac transplantation
(Jinsung Yoon, William R. Zame, Amitava Banerjee, Martin Cadeiras, Ahmed M. Alaa, Mihaela van der Schaar; PloS One, 2018)

Outcome-oriented clustering in the time series setting

Temporal clustering has been recently used as a data-driven framework to partition patients with time-series observations into subgroups of patients. Recent research has typically focused on either finding fixed-length and low-dimensional representations, or on modifying the similarity measure, both in an attempt to apply the existing clustering algorithms to time-series observations.

Identifying patient subgroups with similar progression patterns can be advantageous for understanding such heterogeneous diseases. This allows clinicians to anticipate patients’ prognoses by comparing them to “similar” patients, and to design treatment guidelines tailored to homogeneous subgroups.

Our lab has developed a method for temporal phenotyping in this manner using deep predictive clustering of disease progression, as presented at ICML 2020. This provides a notion of temporal phenotyping that is predictive of similar future outcomes, on the basis of which doctors and patients can actively plan. The focus here is on learning discrete representations of past observations that best describe and predict future events and outcomes of interest.

Temporal phenotyping using deep predictive clustering of disease progression

Changhee Lee, Mihaela van der Schaar

ICML 2020


Further reference: presentation by Changhee Lee on temporal phenotyping (2021 van der Schaar Lab open house)

Further reference: Application by Changhee Lee et al.

Outcome-oriented deep temporal phenotyping of disease progression

Changhee Lee, Jem Rashbass, Mihaela van der Schaar

IEEE transactions on biomedical engineering, 2020


Associating outcome-oriented subgroups with longitudinal patterns

Outcome-oriented clusters can capture the transition of disease progression and allow clinicians to investigate the associated longitudinal patterns in patient trajectories. Existing temporal clustering approaches generally focus on discovering patient subgroups solely based on their clinical status or outcome, which restrains the prognostic value of discovered clusters due to the negligence of the heterogeneity of longitudinal patterns in each subgroup.

To understand the full picture of disease progression that manifests through heterogeneous temporal characteristics in disease trajectory, identification of unique associations between longitudinal patterns and clinical outcomes is desirable. This provides greater diagnostic value to clinicians and enables tailored treatments with references to “similar” patients of both close outcomes and homogeneous disease progression patterns over time.

In complement to the outcome-oriented clusters, our lab has developed a novel temporal clustering method to correctly uncover predictive temporal patterns that are descriptive of the underlying disease progression from labeled time-series data, as published in AISTATS 2023.  This new temporal clustering approach not only can identify clusters that have a prognostic value but also can offer interpretable information about the disease progression patterns. This is achieved through constrained outcome-oriented clustering on a similarity graph which captures heterogeneities in disease trajectories of individual patients.

T-Phenotype: Discovering Phenotypes of Predictive Temporal Patterns in Disease Progression

Yuchao Qin, Mihaela van der Schaar, Changhee Lee



Other work on clustering and subtyping

New approaches to clustering and subtyping also feature in some of our lab’s earlier research. For example, the paper below introduces a personalized risk scoring method that learns a set of latent patient subtypes from offline electronic health record data, and trains a mixture of Gaussian process experts. Each expert models the physiological data streams associated with a specific patient subtype. Transfer learning techniques are used to learn the relationship between a patient’s latent subtype and static admission information (e.g., age, gender, transfer status, ICD-9 codes, etc).

Personalized Risk Scoring for Critical Care Prognosis using Mixtures of Gaussian Processes

Ahmed M Alaa, Jinsung Yoon, Scott Hu, Mihaela van der Schaar

IEEE transactions on biomedical engineering, 2017


Learn more and get involved

Our research related to clustering is closely linked to a number of our lab’s other core areas of focus. If you’re interested in branching out from clustering, we’d recommend reviewing our summaries on time series in healthcare and survival analysis, competing risks, and comorbidities.

We would also encourage you to stay up-to-date with ongoing developments in this and other areas of machine learning for healthcare by signing up to take part in one of our two streams of online engagement sessions.

If you are a practicing clinician, please sign up for Revolutionizing Healthcare, which is a forum for members of the clinical community to share ideas and discuss topics that will define the future of machine learning in healthcare (no machine learning experience required).

If you are a machine learning student, you can join our Inspiration Exchange engagement sessions, in which we introduce and discuss new ideas and development of new methods, approaches, and techniques in machine learning for healthcare.

A full list of our papers on this and related topics can be found here.