van der Schaar Lab

Self-supervised, semi-supervised, and multi-view learning

This page offers a basic introduction to our lab’s work around self-supervised, semi-supervised and multi-view learning. This is an important and evolving research area that we are continuing to flesh out over time.

After explaining the difficulties encountered when working with tabular datasets containing largely unlabeled data, we will show how the genomics and healthcare settings differ from domains in which self-supervised learning (and machine learning more broadly) has seen success so far.

We will then introduce a range of our lab’s own self-supervised, semi-supervised and multi-view learning approaches, and will go on to show how such approaches can both enable personalized medicine and drive scientific discovery.

The content of this page is designed to be accessible and useful to a wide range of readers, from machine learning novices to experts.

You can find our publications on self-supervised learning, semi-supervised learning, and multi-view learning, as well as other research areas, here.

This page is one of several introductions to areas that we see as “research pillars” for our lab. It is a living document, and the content here will evolve as we continue to reach out to the machine learning and healthcare communities, building a shared vision for the future of healthcare.

Our primary means of building this shared vision is through two groups of online engagement sessions: Inspiration Exchange (for the machine learning community) and Revolutionizing Healthcare (for the healthcare community). If you would like to get involved, please visit the page below.

This page is authored and maintained by Mihaela van der SchaarNick Maxfield, and Fergus Imrie.

Why do we need self-supervised learning?

Fitting a model to a dataset is considerably easier when labeled data is plentiful. In many circumstances, however, access to labeled data is limited—meaning that learning a performant model is much harder, with high risk of overfitting the limited quantity of data. This is particularly common in healthcare, genomics, and other omics settings, and is a problem confounded by the fact that obtaining additional samples is either costly or altogether impossible.

In such situations, therefore, using unlabeled data effectively is of critical importance. This is where self-supervised learning comes into play. We can adopt a two-stage training process: first, a model is pre-trained using unlabeled data (self-supervised learning); next, the insights learned on the unlabeled data are transferred to a smaller labeled dataset, at which point the model is fine-tuned using the limited quantity of labeled data.

This is a particularly effective approach in domains such as genomics and healthcare at large, where datasets frequently contain a small amount of labeled data and a large amount of unlabeled data. The U.K.’s 100,000 Genomes project, for instance, sequenced 100,000 genomes from around 85,000 NHS patients affected by a rare disease, such as cancer. By definition, rare diseases occur in (fewer than) 1 in 2,000 people. Datasets like these present huge opportunities for self- and semi-supervised learning algorithms, which can leverage the unlabeled data to further improve the performance of a predictive model.

Can’t we just use existing self-supervised learning?

Naturally, self-supervised learning is not being introduced here as a new concept; in fact, it is an approach that has seen a great deal of success already in areas including image recognition (computer vision) and natural language processing.

The key differences between such areas and applications such as genomics and healthcare are data modality and inherent structure. Image recognition relies on spatial structure, and natural language processing on semantic structure: the spatial correlations between pixels in images or the sequential correlations between words in text data are well-known and consistent across different datasets.

By contrast, healthcare and genomics problems primarily involve high-dimensional tabular data (the most common data type in the real-world). The inherent structures among features in tabular datasets are unknown and vary across different datasets. In other words, there is no “common” correlation structure in tabular data (unlike in image and text data). This makes the self- and semi-supervised learning in tabular data more challenging: explicit structures must be learned—a fundamentally different challenge compared with image recognition or natural language processing.

Building self-supervised approaches for tabular datasets

As explained above, the genomics and general healthcare settings pose two challenges: the lack of labels, and the lack of common structures within tabular data. Combined, these challenges require us to develop new approaches to self-supervised learning that are effective when working with tabular data.

One such approach developed by our lab is value imputation and mask estimation (VIME), which was introduced in a paper published at NeurIPS 2020.

VIME adopts a novel self- and semi-supervised learning framework. First, VIME’s encoder function learns to construct informative representations from the raw features in the unlabeled data. The trained encoder then 1) generates multiple augmented samples by randomly masking various unlabeled data points and then 2) predicting which features were masked and imputing their original values. In order to do this, the encoder must learn how the features are related to one another—and in the process learns the inherent structure of the data. This makes it easier to learn on the labeled data subsequently.

VIME proposes a new pretext task to recover the mask vector in addition to the original sample, with a novel corrupted sample generation scheme. VIME also incorporates a novel tabular data augmentation scheme that can be combined with various contrastive learning frameworks, to extend self-supervised learning to tabular domains such as genomic and clinical data.

As outlined below, we evaluated VIME against a range of self-supervised, semi-supervised, and supervised methods on several tabular datasets from different application domains, including genomics (for example, from U.K. Biobank) and clinical data. VIME exceeded state-of-the-art performance in comparison to the existing baseline methods.

Evaluating VIME on genome-wide polygenic scoring

VIME: Extending the Success of Self- and Semi-supervised Learning to Tabular Domain

Jinsung Yoon, Yao Zhang, James Jordon, Mihaela van der Schaar

NeurIPS 2020


Making discoveries with feature selection

So far, this page has explained the potential of self-supervised learning for prediction in genomics, broader omics applications, and healthcare at large. Our description has been limited to standard tasks involving the prediction of outcomes on the basis of features, which (while certainly valuable) provides no additional insights.

We could, for example, vastly improve our understanding of the outcome of interest by determining which features led to that outcome; this is known as feature selection (for more information on feature selection, see the related section in our research pillar page on interpretable machine learning).

Feature selection can provide very powerful insights that drive scientific discovery. For example, while next-generation sequencing can detect the expression of tens of thousands of genes per sample, many genetic disorders stem from the variation in only a few groups of related genes. Identification of such disease-related factors (i.e., genetic associations) is crucial for the design of therapeutic treatments.

As well as providing valuable insights into the outcome of interest, feature selection can help reduce costs by filtering a large number of features down to a handful of relevant features, and can improve the generalizability of a model, since there are fewer features to which to fit.

Again, however, this becomes significantly more difficult when working with datasets comprising largely unlabeled samples, as is often the case with omics data: feature selection models risk picking up on spurious relationships between feature and label (and may therefore fail to identify important features), and may choose correlated variables that are not actually responsible for outcomes.

To solve both of these shortcomings, our lab developed self-supervision enhanced feature selection (SEFS). SEFS uses a self-supervised approach to train an encoder using unlabeled data via two pretext tasks: feature vector reconstruction and gate vector estimation. This pre-conditions the encoder to learn informative representations from partial feature sets, aligning the self-supervision with the model selection process of the downstream feature selection task. In addition, SEFS features a novel gating procedure that accounts for the correlation structure of the input features. This ensures the pretext tasks remain challenging by preventing the model from memorizing trivial relations between features. More specifically, unlike previous deep learning-based feature selection methods, the correlated gate vectors encourage SEFS to select the most relevant features by making multiple correlated features compete against each other.

SEFS consists of two training phases:
– a self-supervision phase, in which a network encoder is pre-trained with unlabeled data to learn representations that are favorable for feature selection; and
– a supervision phase, in which feature selection is performed using the pre-trained encoder.

As detailed below and in the full paper, SEFS was validated through experiments on synthetic and multiple real-world datasets, including from the clinical, transcriptomics, and proteomics domains, where only a small number of labeled samples are available. Through extensive evaluation, we demonstrated that SEFS discovers relevant features that provide superior prediction performance compared to state-of-the-art benchmarks, and we corroborated these features with supporting medical and scientific literature.

Evaluating SEFS on a proteomics dataset

Self-supervision enhanced feature selection with correlated gates

Changhee Lee*, Fergus Imrie*, Mihaela van der Schaar

ICLR, 2022


The insights and principles of SEFS are not limited to this particular architecture, and are applicable to new approaches for feature selection, such as our recent work CompFS.

Related paper:
Composite Feature Selection using Deep Ensembles

Fergus Imrie*, Alexander Norcliffe*, Pietro Lio, Mihaela van der Schaar

NeurIPS, 2022


Improving conformal prediction with self-supervision

Often, for real-world applications and particularly in high-stakes domains, predictions alone are not enough. Instead, we seek various forms of model trustworthiness, such as estimates of, or even better guarantees, concerning a model’s predictive uncertainty, i.e. quantifying the uncertainty in a model’s prediction.

See our research pillar on uncertainty quantification for an introduction and more information about our work in that area.

Conformal prediction has emerged as an extremely popular tool for uncertainty quantification. This
powerful method provides valid prediction intervals with finite-sample, frequentist guarantees on the marginal coverage of the intervals, with minimal assumption on the data.

Naturally, we want the prediction intervals to be as narrow as possible, while still maintaining coverage, and this has been the subject of significant research in recent years. One of the most successful approaches uses an auxiliary model to predict the residual errors of the predictive model. This allows the predictive intervals to be adapted based on the perceived difficulty of the sample.

However, the task of predicting residuals can be challenging; indeed, presumably they would not be residuals in the first place otherwise. Therefore, we propose to improve the performance of the residual model with additional signal.

Can self-supervision provide this added signal? With self-supervised pretext tasks, not only do we get their prediction at test time but also the ground truth target. Therefore, we have access to the self-supervised error even at test time. If this error has a relationship with the error of the main model, then it can provide a useful input feature to the residual model.

We propose Self-Supervised Conformal Prediction (SSCP), a framework that provides a recipe to leverage information from self-supervised pretext tasks to improve prediction intervals. Crucially, we note that the auxiliary self-supervision information does not impact the theoretical guarantees of conformal prediction.

Comparison of related approaches to conformal prediction.
Left. Standard inductive conformal prediction results in constant width confidence intervals.  Center. Conformal residual fitting produces adaptive intervals but can be inefficient in regions.  Right. The errors of a self-supervised task are included above the plot with red indicating larger self-supervised errors. SSCP leverages these errors to improve the efficiency of conformal residual fitting.

Furthermore, the self-supervised model can simply be trained on the labeled training data as an extra step or, even better, it can also leverage any additional unlabeled data that is available. We highlight that this approach can be applied orthogonally and in addition to any standard self-supervised representation learning of the main model.

Through a series of empirical evaluations, we demonstrate the benefit of SSCP over state-of-the-art approaches.

We believe this work is particularly interesting not only for improving conformal prediction, but demonstrating an additional use of self-supervision. While self-supervised learning has been effectively utilized in many domains to learn general representations for downstream predictors, the use of self-supervision beyond model pretraining and representation learning has been largely unexplored.

Improving Adaptive Conformal Prediction Using Self-Supervised Learning

Nabeel Seedat*, Alan Jeffares*, Fergus Imrie, Mihaela van der Schaar



Multi-view supervised learning: a new solution for multi-omics data integration

Technological advances in high-throughput biology enable integrative analyses that use information across multiple omics layers – including genomics, epigenomics, transcriptomics, proteomics, and metabolomics – to deliver more comprehensive understanding in biological systems and improve prediction of outcomes of interest (such as disease traits, phenotypes, and drug responses).

Unfortunately, due to limitations of experimental designs or compositions from different data sources (such as The Cancer Genome Atlas), integrated samples commonly have one or more entirely missing omics with various missing patterns. Learning from such incomplete observations is challenging. Discarding samples with missing omics greatly reduces sample sizes (especially when integrating many omics layers) and simple mean imputation can seriously distort the marginal and joint distribution of the data.

Additionally, the interactions between omics layers can be highly complex, and need to be properly modeled to ensure optimal predictive power.

Furthermore, the additive value of incorporating each omics layer needs to be quantified and assessed in order to allow cost-efficient predictions for new samples.

Our lab sees these challenges as a new potential direction in research surrounding machine learning and AI for multi-omics. In a paper published at AISTATS 2021, we modeled multi-omics data integration as learning from incomplete multi-view observations where we refer to observations from each omics data as views (e.g., DNA copy number and mRNA expressions).

However, direct application of existing multi-view learning methods does not address the key challenges of handling missing views in integrating multi-omics data. This is because these methods are typically designed for complete-view observations assuming that all the views are available for every sample. Therefore, we set our goal to develop a model that not only learns complex intra-view and inter-view interactions that are relevant for target tasks but also flexibly integrates observed views regardless of their view-missing patterns in a single framework.

To that end, in the same AISTATS 2021 paper we proposed a deep variational information bottleneck (IB) approach for incomplete multi-view observations—which we referred to as DeepIMV. DeepIMV consists of four network components: a set of view-specific encoders, a set of view-specific predictors, a product-of-experts (PoE) module, and a multi-view predictor. More specifically, for flexible integration of the observed views regardless of the view-missing patterns, we modeled the joint representations as a PoE over the marginal representations, which are further utilized by the multi-view predictor.

An illustration of DeepIMV’s network architecture with 3 views. For illustrative purposes, we assume that the current sample has the second view missing. Here, dotted lines correspond to drawing samples in the latent representation from the respective distributions, and green and blue colored lines indicate marginal representations and the corresponding predictions.

Thus, the joint representations combine both common and complementary information across the observed views. The entire network was trained under the IB principle which encourages the marginal and joint representations to focus on intra-view and inter-view interactions that are relevant to the target, respectively.

As shown in the section below and the full paper, DeepIMV consistently achieved gain from data integration. When evaluated on two real-world multi-omics datasets compiled by The Cancer Genome Atlas (TCGA) and the Cancer Cell Line Encyclopedia (CCLE), DeepIMV significantly outperformed state-of-the-art benchmarks with respect to measures of predictive performance.

Evaluating DeepIMV on real-world multi-omics datasets

A Variational Information Bottleneck Approach to Multi-Omics Data Integration

Changhee Lee, Mihaela van der Schaar



Video: NeurIPS 2021 Self-supervised Learning Workshop

This invited talk, entitled “Self-supervised learning for genomics,” was given by Mihaela van der Schaar on December 14, 2021, as part of the Self-supervised Learning Workshop running alongside NeurIPS 2021.

Self-supervised, semi-supervised and multi-view learning: an evolving agenda

Much of the focus above has been on genomics and other omics data, but the methods described have value when applied to any datasets where the prevalence of unlabeled data would hinder the viability of a standard supervised learning approach (as is often the case in healthcare).

The body of work included in this page represents a fraction of our lab’s broader ongoing work in the areas of self-supervised learning, semi-supervised learning, and multi-view learning; these are some of our first steps towards extending self-supervised learning to a broader class of models that can capture complex non-linearities. We will continue to update this page in line with our development of powerful new methods and publication of related papers.

We would also encourage you to stay up-to-date with ongoing developments in this and other areas of machine learning for healthcare by signing up to take part in one of our two streams of online engagement sessions.

If you are a practicing clinician, please sign up for Revolutionizing Healthcare, which is a forum for members of the clinical community to share ideas and discuss topics that will define the future of machine learning in healthcare (no machine learning experience required).

If you are a machine learning student, you can join our Inspiration Exchange engagement sessions, in which we introduce and discuss new ideas and development of new methods, approaches, and techniques in machine learning for healthcare.

A full list of our papers on this and related topics can be found here.