van der Schaar Lab

Self-supervised, semi-supervised, and multi-view learning



This page offers a basic introduction to our lab’s work around self-supervised, semi-supervised and multi-view learning. This is an important and evolving research area that we are continuing to flesh out over time.

After explaining the difficulties encountered when working with tabular datasets containing largely unlabeled data, we will show how the genomics and healthcare settings differ from domains in which self-supervised learning (and machine learning more broadly) has seen success so far.

We will then introduce a range of our lab’s own self-supervised, semi-supervised and multi-view learning approaches, and will go on to show how such approaches can both enable personalized medicine and drive scientific discovery.

The content of this page is designed to be accessible and useful to a wide range of readers, from machine learning novices to experts.

You can find our publications on self-supervised learning, semi-supervised learning, and multi-view learning, as well as other research areas, here.

This page is one of several introductions to areas that we see as “research pillars” for our lab. It is a living document, and the content here will evolve as we continue to reach out to the machine learning and healthcare communities, building a shared vision for the future of healthcare.

Our primary means of building this shared vision is through two groups of online engagement sessions: Inspiration Exchange (for machine learning students) and Revolutionizing Healthcare (for the healthcare community). If you would like to get involved, please visit the page below.

This page is authored and maintained by Mihaela van der Schaar and Nick Maxfield.


Why do we need self-supervised learning?

Fitting a model to a dataset is considerably easier when labeled data is plentiful. In many circumstances, however, access to labeled data is limited—meaning that learning a performant model is much harder, with high risk of overfitting the limited quantity of data. This is particularly common in healthcare, genomics, and other omics settings, and is a problem confounded by the fact that obtaining additional samples is either costly or altogether impossible.

In such situations, therefore, using unlabeled data effectively is of critical importance. This is where self-supervised learning comes into play. We can adopt a two-stage training process: first, a model is pre-trained using unlabeled data (self-supervised learning); next, the insights learned on the unlabeled data are transferred to a smaller labeled dataset, at which point the model is fine-tuned using the limited quantity of labeled data.

This is a particularly effective approach in domains such as genomics and healthcare at large, where datasets frequently contain a small amount of labeled data and a large amount of unlabeled data. The U.K.’s 100,000 Genomes project, for instance, sequenced 100,000 genomes from around 85,000 NHS patients affected by a rare disease, such as cancer. By definition, rare diseases occur in (fewer than) 1 in 2,000 people. Datasets like these present huge opportunities for self- and semi-supervised learning algorithms, which can leverage the unlabeled data to further improve the performance of a predictive model.

Can’t we just use existing self-supervised learning?

Naturally, self-supervised learning is not being introduced here as a new concept; in fact, it is an approach that has seen a great deal of success already in areas including image recognition (computer vision) and natural language processing.

The key differences between such areas and applications such as genomics and healthcare are data modality and inherent structure. Image recognition relies on spatial structure, and natural language processing on semantic structure: the spatial correlations between pixels in images or the sequential correlations between words in text data are well-known and consistent across different datasets.

By contrast, healthcare and genomics problems primarily involve high-dimensional tabular data (the most common data type in the real-world). The inherent structures among features in tabular datasets are unknown and vary across different datasets. In other words, there is no “common” correlation structure in tabular data (unlike in image and text data). This makes the self- and semi-supervised learning in tabular data more challenging: explicit structures must be learned—a fundamentally different challenge compared with image recognition or natural language processing.

Building self-supervised approaches for tabular datasets

As explained above, the genomics and general healthcare settings pose two challenges: the lack of labels, and the lack of common structures within tabular data. Combined, these challenges require us to develop new approaches to self-supervised learning that are effective when working with tabular data.

One such approach developed by our lab is value imputation and mask estimation (VIME), which was introduced in a paper published at NeurIPS 2020.

VIME adopts a novel self- and semi-supervised learning framework. First, VIME’s encoder function learns to construct informative representations from the raw features in the unlabeled data. The trained encoder then 1) generates multiple augmented samples by randomly masking various unlabeled data points and then 2) predicting which features were masked and imputing their original values. In order to do this, the encoder must learn how the features related to one another—and in the process learns the inherent structure of the data. This makes it easier to learn on the labeled data subsequently.

VIME proposes a new pretext task to recover the mask vector in addition to the original sample, with a novel corrupted sample generation scheme. VIME also incorporates a novel tabular data augmentation scheme that can be combined with various contrastive learning frameworks, to extend self-supervised learning to tabular domains such as genomic and clinical data.

As outlined below, we evaluated VIME against a range of self-supervised, semi-supervised, and supervised methods on several tabular datasets from different application domains, including genomics (for example, from U.K. Biobank) and clinical data. VIME exceeded state-of-the-art performance in comparison to the existing baseline methods.

In this subsection, we show how VIME was evaluated on a large genomics dataset from U.K. Biobank consisting of around 400,000 individuals’ genomics information (SNPs) and 6 corresponding blood cell traits:

(1) Mean Reticulocyte Volume (MRV)
(2) Mean Platelet Volume (MPV)
(3) Mean Cell Hemoglobin (MCH)
(4) Reticulocyte Fraction of Red Cells (RET)
(5) Plateletcrit (PCT)
(6) Monocyte Percentage of White Cells (MONO).

The features of the dataset consist of around 700 SNPs (after the standard p-values filtering process), where each SNP, taking value in is treated as a categorical variable (with three categories). Here, we have 6 different blood cell traits to predict, and we treat each of them as an independent prediction task (selected SNPs are different across different blood cell traits).

To test the effectiveness of self- and semi-supervised learning in the small labeled data setting, VIME and benchmarks were tasked with predicting the 6 blood cell traits while we gradually increased the number of labeled data points from 1,000 to 100,000 samples while using the remaining data as unlabeled data (more than 300,000 samples). We used a linear model (ElasticNet) as the predictive model due to its superior performance in comparison to other non-linear models such as multi-layer perceptron and random forests on genomics datasets.

The figure above shows MSE performance (y-axis) against the number of labeled data points (x-axis, in log scale) increasing from 1,000 to 10,000. VIME outperforms all the benchmarks, including purely supervised method ElasticNet, the self-supervised method Context Encoder and the semi-supervised method MixUp. In fact, in many cases VIME shows similar performances to the benchmarks even when it has access to only half as many labeled data points (as the benchmarks).

Further details regarding the evaluation of VIME on this and additional datasets can be found in the full paper (linked directly below).

VIME: Extending the Success of Self- and Semi-supervised Learning to Tabular Domain

Jinsung Yoon, Yao Zhang, James Jordon, Mihaela van der Schaar

NeurIPS 2020

Self- and semi-supervised learning frameworks have made significant progress in training machine learning models with limited labeled data in image and language domains. These methods heavily rely on the unique structure in the domain datasets (such as spatial relationships in images or semantic relationships in language). They are not adaptable to general tabular data which does not have the same explicit structure as image and language data.

In this paper, we fill this gap by proposing novel self- and semi-supervised learning frameworks for tabular data, which we refer to collectively as VIME (Value Imputation and Mask Estimation). We create a novel pretext task of estimating mask vectors from corrupted tabular data in addition to the reconstruction pretext task for self-supervised learning. We also introduce a novel tabular data augmentation method for self- and semi-supervised learning frameworks.

In experiments, we evaluate the proposed framework in multiple tabular datasets from various application domains, such as genomics and clinical data. VIME exceeds state-of-the-art performance in comparison to the existing baseline methods.

Making discoveries with feature selection

So far, this page has explained the potential of self-supervised learning for prediction in genomics, broader omics applications, and healthcare at large. Our description has been limited to standard tasks involving prediction of outcomes on the basis of features, which (while certainly valuable) provides no additional insights.

We could, for example, vastly improve our understanding of the outcome of interest by determining which features led to that outcome; this is known as feature selection (for more information on feature selection, see the related section in our research pillar page on interpretable machine learning).

Feature selection can provide very powerful insights that drive scientific discovery. For example, while next generation sequencing can detect the expression of tens of thousands of genes per sample, many genetic disorders stem from the variation in only a few groups of related genes. Identification of such disease-related factors (i.e., genetic associations) is crucial for the design of therapeutic treatments.

As well as providing valuable insights into the outcome of interest, feature selection can help reduce costs by filtering a large number of features down to a handful of relevant features, and can improve the generalizability of a model, since there are fewer features to which to fit.

Again, however, this becomes significantly more difficult when working with datasets comprising largely unlabeled samples, as is often the case with omics data: feature selection models risk picking up on spurious relationships between feature and label (and may therefore fail to identify important features), and may choose correlated variables that are not actually responsible for outcomes.

To solve both of these shortcomings, our lab developed self-supervision enhanced feature selection (SEFS). SEFS uses a self-supervised approach to train an encoder using unlabeled data via two pretext tasks: feature vector reconstruction and gate vector estimation. This pre-conditions the encoder to learn informative representations from partial feature sets, aligning the self-supervision with the model selection process of the downstream feature selection task. In addition, SEFS features a novel gating procedure that accounts for the correlation structure of the input features. This ensures the pretext tasks remain challenging by preventing the model from memorizing trivial relations between features. More specifically, unlike previous deep learning-based feature selection methods, the correlated gate vectors encourage SEFS to select the most relevant features by making multiple correlated features compete against each other.

SEFS consists of two training phases:
– a self-supervision phase, in which a network encoder is pre-trained with unlabeled data to learn representations that are favorable for feature selection; and
– a supervision phase, in which feature selection is performed using the pre-trained encoder.

As detailed below and in the full paper (available soon), SEFS was validated through experiments on synthetic and multiple real-world datasets, including from the clinical, transcriptomics, and proteomics domains, where only a small number of labeled samples are available. Through extensive evaluation, we demonstrated that SEFS discovers relevant features that provide superior prediction performance compared to state-of-the-art benchmarks, and we corroborated these features with supporting medical and scientific literature.

This subsection will explain how we evaluated the performance of SEFS and multiple feature selection methods using a real-world proteomics dataset.

We studied the response of heterogeneous cancer cell lines to 11 different drugs where the goal is to identify proteins associated with the cell line response based on proteomic measurements from the Cancer Cell Line Encyclopedia (CCLE).

CCLE is a small dataset containing 899 cancer cell lines (i.e., samples) described by 196 protein expressions. The real-valued drug response is available for 458 samples and is missing for the remaining 441 samples (thus unlabeled). To benefit from self-supervised learning, we integrated the RPPA measurements on 7,329 samples from The Cancer Genome Atlas (TCGA), creating overall 7,770 unlabeled samples.

The figure above shows a comparison of the ranking of SEFS and the benchmarks across 11 drugs. SEFS is the best performing method for 9 drugs and is always in the top 3 (median rank: 1, average rank: 1.36). Despite the majority of unlabeled data originating from a different source, SEFS outperforms SEFS (no SS) in every experiment.

While we would expect further gain from more similar unlabeled data, our results highlight that there is potential benefit even when unlabeled data is only partially related to the labeled samples.

If you would like to find out more about how we evaluated SEFS on this and additional synthetic and real-world datasets, please read the full paper (which will be available for publication soon).

Self-supervision enhanced feature selection with correlated gates

Changhee Lee, Fergus Imrie, Mihaela van der Schaar

Submitted, 2021

Discovering relevant input features for predicting a target variable is a key scientific question. However, in many domains, such as medicine and biology, feature selection is confounded by a scarcity of labeled samples coupled with significant correlations among features.

In this paper, we propose a novel deep learning approach to feature selection that addresses both challenges simultaneously. First, we pre-train the network using unlabeled samples within a self-supervised learning framework by solving pretext tasks that require the network to learn informative representations from partial feature sets. Then, we fine-tune the pre-trained network to discover relevant features using labeled samples. During both training phases, we explicitly account for the correlation structure of the input features by generating correlated gate vectors from a multivariate Bernoulli distribution.

Experiments on multiple real-world datasets including clinical and omics demonstrate that our model discovers relevant features that provide superior prediction performance compared to the state-of-the-art benchmarks, in practical scenarios where there is often limited labeled data and high correlations among features.

Multi-view supervised learning: a new solution for multi-omics data integration

Technological advances in high-throughput biology enable integrative analyses that use information across multiple omics layers – including genomics, epigenomics, transcriptomics, proteomics, and metabolomics – to deliver more comprehensive understanding in biological systems and improve prediction of outcomes of interest (such as disease traits, phenotypes, and drug responses).

Unfortunately, due to limitations of experimental designs or compositions from different data sources (such as The Cancer Genome Atlas), integrated samples commonly have one or more entirely missing omics with various missing patterns. Learning from such incomplete observations is challenging. Discarding samples with missing omics greatly reduces sample sizes (especially when integrating many omics layers) and simple mean imputation can seriously distort the marginal and joint distribution of the data.

Additionally, the interactions between omics layers can be highly complex, and need to be properly modeled to ensure optimal predictive power.

Furthermore, the additive value of incorporating each omics layer needs to be quantified and assessed in order to allow cost-efficient predictions for new samples.

Our lab sees these challenges as a new potential direction in research surrounding machine learning and AI for multi-omics. In a paper published at AISTATS 2021, we modeled multi-omics data integration as learning from incomplete multi-view observations where we refer to observations from each omics data as views (e.g., DNA copy number and mRNA expressions).

However, direct application of the existing multi-view learning methods does not address the key challenges of handling missing views in integrating multi-omics data. This is because these methods are typically designed for complete-view observations assuming that all the views are available for every sample. Therefore, we set our goal to develop a model that not only learns complex intra-view and inter-view interactions that are relevant for target tasks but also flexibly integrates observed views regardless of their view-missing patterns in a single framework.

To that end, in the same AISTATS 2021 paper we proposed a deep variational information bottleneck (IB) approach for incomplete multi-view observations—which we referred to as DeepIMV. DeepIMV consists of four network components: a set of view-specific encoders, a set of view-specific predictors, a product-of-experts (PoE) module, and a multi-view predictor. More specifically, for flexible integration of the observed views regardless of the view-missing patterns, we modeled the joint representations as a PoE over the marginal representations, which are further utilized by the multi-view predictor.

An illustration of DeepIMV’s network architecture with 3 views. For illustrative purposes, we assume that the current sample has the second view missing. Here, dotted lines correspond to drawing samples in the latent representation from the respective distributions, and green and blue colored lines indicate marginal representations and the corresponding predictions.

Thus, the joint representations combine both common and complementary information across the observed views. The entire network was trained under the IB principle which encourages the marginal and joint representations to focus on intra-view and inter-view interactions that are relevant to the target, respectively.

As shown in the section below and the full paper, DeepIMV consistently achieved gain from data integration. When evaluated on two real-world multi-omics datasets compiled by The Cancer Genome Atlas (TCGA) and the Cancer Cell Line Encyclopedia (CCLE), DeepIMV significantly outperformed state-of-the-art benchmarks with respect to measures of predictive performance.

We evaluated DeepIMV extensively against a range of benchmarks, using different multi-view learning methods on two real-world multi-omics datasets collected by the Cancer Genome Atlas (TCGA) and the Cancer Cell Line Encyclopedia (CCLE). In the former case, the context was integration of multi-omics observations for predicting 1-year mortality; the latter examined drug sensitivity of cancer cells.

TCGA dataset results

We analyze 1-year mortality based on the comprehensive observations from multiple omics on 7,295 cancer cell lines (i.e., samples). The data consisted of observations from 4 distinct views on each cell line across 3 different omics layers: (View 1) mRNA expressions, (View 2) DNA methylation, (View 3) microRNA expressions, and (View 4) reverse phase protein array.

The table above offers a comparison of the AUROC performance (mean ± 95%-CI) for both the multi-view learning methods trained with only complete multi-view samples (“complete”) and those trained with both complete and incomplete multi-view samples (“incomplete”).

As the table shows, DeepIMV better integrated samples with incomplete views, as the performance improvements significantly outperformed the benchmarks regardless of the number of observed views. Additionally, even when trained only with complete-view samples, DeepIMV better handled different view-missing patterns during testing as it provided the highest performance (except for 1 View) with partially observed views. Finally, we noted that MVAE and MOFA sacrificed their discriminative power since the latent representations focused on retaining the information of the input for view generation (reconstruction), which resulted in discarding the task-relevant discriminative information.

CCLE dataset results

For this evaluation, we analyzed sensitivities of heterogeneous cell lines to 4 different drugs – Irinotecan, Panobinostat, Lapatinib, and PLX4720 – based on the multiple omics observations on 504 cancer cell lines (i.e., samples).

Drug response was converted to a binary label by dividing cell lines into quartiles ranked by ActArea; the top 25% were assigned to the “sensitive” class and the rest were assigned to the “nonsensitive” class. The data consisted of observations from 6 distinct views on each cell line across 5 different omics layers: (View 1) DNA copy number, (View 2) DNA methylation, (View 3) mRNA expressions, (View 4) microRNA expressions, (View 5) reverse phase protein array, and (View 6) metabolites.

We explored the benefit of incorporating more samples and more views on predicting drug sensitivities of heterogeneous cell lines. To this end, we increased the set of available views from 2 to 6, and included samples with observations from at least one of the views in this set.

The top row of the figure above compares the prediction of different multi-view methods in terms of AUROC performance as we increased the set of available views. There are a few of things to be highlighted from this figure: First, DeepIMV provides better discriminative performance on all the tested drug sensitivity datasets (most of the time) as the number of integrated views increases. Second, the performances of DCCA and DCCAE are saturated, since these methods can utilize only two views at the most, whereas GCCA provides consistently-increasing performance since it generalizes to multiple views. Third, MVAE and MOFA sacrifice their discriminative task-relevant information, since the latent representations focus on retaining the information of the input for view generation (reconstruction).

The bottom row of the figure shows the AUROC performance as the rate of samples with missing views ranges from 0.0 (all complete) to 1.0 (all incomplete) with 6 views. This shows how robust the multi-view learning methods are with respect to the view-missing rate. The figure lets us make several observations: First, DeepIMV outperforms all the benchmarks on the Irinotecan and Panobinostat datasets and provides comparable performance on the Lapatinib and PLX4720 datasets to the best performing benchmark across different missing rates. Second, while other methods often fail, DeepIMV provides the most robust performance as the rate of samples with missing views increases. Third, DCCA and DCCAE show poor performance since these methods do not fully utilize the available views. Last, the same trend can be found with regard to MVAE and MOFA as described in the previous paragraph.

For more details regarding how we evaluated DeepIMV, please read the full paper (linked directly below).

A Variational Information Bottleneck Approach to Multi-Omics Data Integration

Changhee Lee, Mihaela van der Schaar

AISTATS 2021

Integration of data from multiple omics techniques is becoming increasingly important in biomedical research. Due to non-uniformity and technical limitations in omics platforms, such integrative analyses on multiple omics, which we refer to as views, involve learning from incomplete observations with various view-missing patterns. This is challenging because i) complex interactions within and across observed views need to be properly addressed for optimal predictive power and ii) observations with various view-missing patterns need to be flexibly integrated.

To address such challenges, we propose a deep variational information bottleneck (IB) approach for incomplete multi-view observations. Our method applies the IB framework on marginal and joint representations of the observed views to focus on intra-view and inter-view interactions that are relevant for the target. Most importantly, by modeling the joint representations as a product of marginal representations, we can efficiently learn from observed views with various view-missing patterns.

Experiments on real-world datasets show that our method consistently achieves gain from data integration and outperforms state-of-the-art benchmarks.

Video: NeurIPS 2021 Self-supervised Learning Workshop

This invited talk, entitled “Self-supervised learning for genomics,” was given by Mihaela van der Schaar on December 14, 2021, as part of the Self-supervised Learning Workshop running alongside NeurIPS 2021.

Self-supervised, semi-supervised and multi-view learning: an evolving agenda

Much of the focus above has been on genomics and other omics data, but the methods described have value when applied to any datasets where the prevalence of unlabeled data would hinder the viability of a standard supervised learning approach (as is often the case in healthcare).

The body of work included in this page represents a fraction of our lab’s broader ongoing work in the areas of self-supervised learning, semi-supervised learning, and multi-view learning; these are some of our first steps towards extending self-supervised learning to a broader class of models that can capture complex non-linearities. We will continue to update this page in line with our development of powerful new methods and publication of related papers.

We would also encourage you to stay up-to-date with ongoing developments in this and other areas of machine learning for healthcare by signing up to take part in one of our two streams of online engagement sessions.

If you are a practicing clinician, please sign up for Revolutionizing Healthcare, which is a forum for members of the clinical community to share ideas and discuss topics that will define the future of machine learning in healthcare (no machine learning experience required).

If you are a machine learning student, you can join our Inspiration Exchange engagement sessions, in which we introduce and discuss new ideas and development of new methods, approaches, and techniques in machine learning for healthcare.

A full list of our papers on this and related topics can be found here.