van der Schaar Lab
Interpretable machine learning

Interpretable machine learning

Machine learning is capable of enabling truly personalized healthcare; this is what our lab calls “bespoke medicine.”

More info on bespoke medicine can be found here.

Interpretability is essential to the success of the machine learning and AI models that will make bespoke medicine a reality. Despite its acknowledged importance and value, the actual concept of interpretability has resisted definition and is not well understood.

Our lab has conducted field-leading research into a variety of forms of interpretability for years, and has developed a unique and cohesive framework for categorizing and developing interpretable machine learning models. Our framework is presented on this page, alongside much of the accompanying research, in the hope of advancing the discussion on this crucial topic and inspiring readers to engage in new projects and research.

The content of this page is designed to be accessible and useful to a wide range of readers, from machine learning novices to experts.

You can find our publications on interpretability and explainability here.

This page is one of several introductions to areas that we see as “research pillars” for our lab. It is a living document, and the content here will evolve as we continue to reach out to the machine learning and healthcare communities, building a shared vision for the future of healthcare.

Our primary means of building this shared vision is through two groups of online engagement sessions: Inspiration Exchange (for machine learning students) and Revolutionizing Healthcare (for the healthcare community). If you would like to get involved, please visit the page below.

This page is authored and maintained by Mihaela van der Schaar and Jonathan Crabbé

This page proposes a unique and coherent framework for categorizing and developing interpretable machine learning models. We will demonstrate this framework using a range of examples from the van der Schaar Lab’s extensive research into interpretability, and our ongoing interdisciplinary discussions with members of the clinical and other non-ML communities.

First, we will discuss the many potential definitions and uses of interpretability. We will then lay out a framework of four distinct types of interpretability, and explain the potential roles and applications of each type. Finally, we will turn the debate on its head by examining how interpretability can also be applied to understand and support humans, rather than AI and machine learning models.

Interpretability: a concept with clear value but an unclear definition

There are several reasons to make a “black box” machine learning model interpretable. First, an interpretable output can be more readily understood and trusted by its users (for example, clinicians deciding whether to prescribe a treatment), making its outputs more actionable. Second, a model’s outputs often need to be explained by its users to the subjects of its outputs (for example, patients deciding whether to accept a proposed treatment course) . Third, by uncovering valuable information that otherwise would have remained hidden within the model’s opaque inner workings, an interpretable output can empower users such as researchers with powerful new insights.

The value of interpretability as a broad concept is, therefore, clear. Yet despite this, the meaning of the term itself is too seldom discussed and too often oversimplified. There is no single “type” of interpretability, after all, since there are many potential ways to extract and present information from the output of a model, and many types of information to choose to extract.

This is something we explored in 2018, when we designed a reinforcement learning system capable of learning from its interactions with users and accurately predicting which outputs would maximize their confidence in the underlying clinical risk prediction model. This work was introduced in a paper entitled “What is Interpretable? Using Machine Learning to Design Interpretable Decision-Support Systems.”

What is Interpretable? Using Machine Learning to Design Interpretable Decision-Support Systems

Owen Lahav, Nicholas Mastronarde, Mihaela van der Schaar


Our lab has been researching interpretability methods and approaches (for application in healthcare and beyond) for many years. Our work so far has led us to a unique but powerful framework for considering the multiple types of interpretability.

Our framework divides interpretability into 5 broad “types”:
1) feature importance;
2) similarity classification;
3) unraveled rules and laws;
4) transparent risk equations; and
5) concept-based explanations

Each of these types of interpretability represents a distinct set of challenges from a model development perspective and can benefit different users in a variety of applications. These will be explored below, but an in-depth discussion on each type—driven by insights from colleagues from the clinical community—can be found in a recent piece of content entitled “Making machine learning interpretable: a dialog with clinicians.” Additionally, we further motivated and discussed these types of interpretability in the following paper, published in Nature Machine Intelligence in 2023.

Multiple stakeholders drive diverse interpretability requirements for machine learning in healthcare


Type 1 interpretability: feature importance

This type of interpretability involves identifying and showing which patient-specific features the machine learning model has considered when issuing a prediction for a patient. We can do this either by identifying features that are important for an entire population or by identifying features the model has considered specifically for the patient at hand.

Our lab has already developed a number of models offering this type of interpretability. One such approach is INVASE, which was first introduced in a paper published at ICLR 2019.

INVASE: Instance-wise Variable Selection using Neural Networks

Jinsung Yoon, James Jordon, Mihaela van der Schaar

ICLR 2019


We have continued to make progress in developing methods that offer interpretations based on explanatory patient features. In a paper recently accepted for publication at ICML 2021, for example, we introduced an approach specifically designed for multivariate time series, using saliency masks to identify and highlight important features at each time step.

Explaining Time Series Predictions with Dynamic Masks

Jonathan Crabbé, Mihaela van der Schaar

ICML 2021


While feature importance methods are typically introduced for supervised models, they can be extended to the unsupervised setting. Our lab has formalized this extension by introducing the notion of Label-Free Explainability. Note that this extension also covers Type 2 interpretability, described below.

Label-Free Explainability for Unsupervised Models

Jonathan Crabbé, Mihaela van der Schaar

ICML 2022


In a recent paper, we have demonstrated that feature importance methods have a practical interest in the context of treatment effect estimation. We use feature importance to benchmark treatment effect models on their ability to discover covariates that are predictive of the individual treatment effect.

Benchmarking Heterogeneous Treatment Effect Models through the Lens of Interpretability

Jonathan Crabbé, Alicia Curth, Ioana Bica, Mihaela van der Schaar

NeurIPS 2022 (Datasets and Benchmarks)


Clinicians have explained to us that this type of interpretability would be particularly valuable to them: since they are required to work out the best way to treat a patient, it is helpful to understand the features that influenced a model’s output. By contrast, clinicians see the value of this type of interpretability for patients as lower. Patients may not consider it particularly useful to know the relative importance of their features: for example, a patient may not benefit from knowing that the most important features determining her cancer mortality risk are her age and ER status.

Type 2 interpretability: similarity classification

Through similarity classification, we seek to identify and explain which similar patients a machine learning model has provided the same–or different–predictions for. An approach based on similarity classification would involve cross-referencing the black box model’s prediction with available observational data regarding the features and outcomes of similar patients, and then explaining the model’s prediction in terms of those features and outcomes.

Several of our lab’s projects to date have sought to provide interpretable explanations based on similarity classification. Some—such as the two outlined immediately below—are tailor-made for particular medical problems.

For instance, temporal phenotyping targets the problem of disease progression; it uses deep learning to cluster time series data, where each cluster comprises patients who share similar future outcomes of interest. Meanwhile, SyncTwin is designed to provide interpretable treatment effect estimation; it issues counterfactual predictions for a target patient by constructing a synthetic twin that closely matches the target in representation.

Temporal Phenotyping using Deep Predictive Clustering of Disease Progression

Changhee Lee, Mihaela van der Schaar

ICML 2020


SyncTwin: Treatment Effect Estimation with Longitudinal Outcomes

Zhaozhi Qian, Yao Zhang, Ioana Bica, Angela Wood, Mihaela van der Schaar

NeurIPS 2021


Not all similarity classification methods are created to address a specific need, however. SimplEx, introduced below, is an example of a general approach that enables explanation for models that are not task- or problem-specific: in essence, it can be seen as a post-hoc explainability module that could be used as a plug-in for almost any machine learning model.

Explaining Latent Representations with a Corpus of Examples

Jonathan Crabbé, Zhaozhi Qian, Fergus Imrie, Mihaela van der Schaar

NeurIPS 2021


In our discussions with clinicians, they generally felt that this type of interpretability has far more value to patients than feature importance (type 1). Patients generally find it easier to make a decision based on a prediction or recommendation when it is explained with reference to similarities or differences with patients like them.

Type 3 interpretability: unraveled rules and laws

With this type of interpretability, we seek to discover “rules” and “laws” learned by the machine model. These can be in the form of decision rules, or even “counterfactual” explanations in the form of “What if?” question-answer pairs that describe the smallest adjustment to the patient’s features that would change the model’s prediction to a predefined output. For example, a clinician could use this type of interpretability to establish the smallest difference in tumor size that would change the model’s prediction for a patient with cancer.

Our lab’s work at the forefront of research into this type of interpretability is in its early stages, but one particularly relevant recent paper can be found below.

Integrating Expert ODEs into Neural ODEs: Pharmacology and Disease Progression

Zhaozhi Qian, William R. Zame, Lucas M. Fleuren, Paul Elbers, Mihaela van der Schaar


Type 4 interpretability: transparent risk equations

This approach to interpretability allows us to turn black box models into white boxes by generating transparent risk equations that describe the predictions made by machine learning models. Unlike regression models, this involves mapping non-linear interactions between different features. We can then discard the black box model, and rely on the transparent risk equation to issue predictions.

The bulk of our own work focusing on this type of interpretability has involved symbolic metamodeling frameworks for expressing black-box models in terms of transparent mathematical equations that can be easily understood and analyzed by human subjects. A symbolic metamodel is a model of a model—a surrogate model of a trained (machine learning) model expressed through a succinct symbolic expression that comprises familiar mathematical functions and can be subjected to symbolic manipulation. We first introduced symbolic metamodels in a paper published at NeurIPS 2019.

Demystifying Black-box Models with Symbolic Metamodels

Ahmed Alaa, Mihaela van der Schaar

NeurIPS 2019


We built on our symbolic metamodeling work by developing Symbolic Pursuit, which was first introduced in a paper published at NeurIPS 2020. The Symbolic Pursuit algorithm benefits from the ability to produce parsimonious expressions that involve a small number of terms. Such interpretations permit easy understanding of the relative importance of features and feature interactions.

Learning outside the Black-Box: The pursuit of interpretable models

Jonathan Crabbé,, Yao Zhang, William Zame, Mihaela van der Schaar

NeurIPS 2020


It should be noted that transparent risk equations can be applied to the other three types of interpretability listed above. Using patient features as inputs and risk as outputs, we can identify variable importance, classify similarities, discover variable interactions, and enable hypothesis induction.

Type 5 interpretability: concept-based explainability

Human beings tend to use high-level concepts to explain their decisions. The purpose of concept-based explainability is to extend this approach to neural networks. This type of explanation permits to analyse how the model relates high-level concepts defined by the user to its predictions. A typical example is an image classifier that identifies zebras through their stripes. In this example, “zebra” is the model’s prediction and “stripes” is a concept. Concepts can be defined arbitrarily by the user through relevant examples illustrating the concept.

We have developed an extension of the existing formalism for concept-based explainability, called Concept Activation Regions (CARs). This extension permits to relax stringent assumptions made by previous works, such as the linear separability of concept sets in the neural network’s representation space. We also illustrate the interest of concept-based explanations in a medical context by showing that neural networks implicitly rediscover medical concepts, such as the prostate cancer grading system.

Concept Activation Regions: A Generalized Framework For Concept-Based Explanations

Jonathan Crabbé, Mihaela van der Schaar

NeurIPS 2022


Robust and trustworthy interpretations

All the interpretability techniques described above are useful only if they are faithful to the model they explain. Indeed, failing in this basic criterion implies that the explanations could be inconsistent with the true model behaviour, hence leading to false insights about the model. For this reason, we believe that guaranteeing an alignment between interpretability methods and the model is just as important as the interpretability methods themselves.

In a work presented at NeurIPS 2023, we explore this faithfulness through the lens of model symmetries. In particular, we consider neural networks whose predictions are invariant under a specific symmetry group. This includes popular architectures, ranging from convolutional to graph neural networks. Any explanation that faithfully explains this type of model needs to be in agreement with this invariance property. We formalise this intuition through the notion of explanation invariance and equivariance by leveraging the formalism from geometric deep learning.

Through this rigorous formalism, we derive (1) two metrics to measure the robustness of any interpretability method with respect to the model symmetry group; (2) theoretical robustness guarantees for some popular interpretability methods and (3) a systematic approach to increase the invariance of any interpretability method with respect to a symmetry group. By empirically measuring our metrics for explanations of models associated with various modalities and symmetry groups, we derive a set of 5 guidelines that we present in-depth to allow users and developers of interpretability methods to produce robust explanations.

Evaluating the Robustness of Interpretability Methods through Explanation Invariance and Equivariance

Jonathan Crabbé, Mihaela van der Schaar

NeurIPS 2023


Peering into the ultimate black box

The bulk of this page has been dedicated to exploring what it means to make machine learning models “interpretable,” and showing how this can be done in a variety of ways. In our view, this is still premised on a relatively blinkered view that ignores some very exciting possibilities for interpretability and machine learning—namely, for humans to use interpretability to understand our own decision-making process.

This possibility is at the heart of quantitative epistemology, a new and transformationally significant research pillar pioneered by our lab. The purpose of this research is to develop a strand of machine learning aimed at understanding, supporting, and improving human decision-making. We aim to do so by building machine learning models of decision-making, including how humans acquire and learn from new information, establish and update their beliefs, and act on the basis of their cumulative knowledge. Because our approach is driven by observational data in studying knowledge as well as using machine learning methods for supporting and improving knowledge acquisition and its impact on decision-making, we call this “quantitative epistemology.”

We develop machine learning models that capture how humans acquire new information, how they pay attention to such information, how their beliefs may be represented, how their internal models may be structured, how these different levels of knowledge are leveraged in the form of actions, and how such knowledge is learned and updated over time. Our methods are aimed at studying human decision-making, identifying potential suboptimalities in beliefs and decision processes (such as cognitive biases, selective attention, imperfect retention of past experience, etc.), and understanding risk attitudes and their implications for learning and decision-making. This would allow us to construct decision support systems that provide humans with information pertinent to their intended actions, their possible alternatives and counterfactual outcomes, as well as other evidence to empower better decision-making.

You can learn more about quantitative epistemology and explore some of our first papers in this area in the article below.

Find out more and get involved

Interpretability is one of the van der Schaar Lab’s core research pillars, and we are constantly pushing forward our understanding of the area. We have produced a great deal of content on the topic, some of which has been shared below.

Codebase for Interpretability

We have gathered relevant code from our lab and beyond into an Interpretability Suite. The GitHub repository for this can be viewed here. The front page of the GitHub provides information about when a user may want to apply each method and the repository itself contains an interface to help users implement the a few of the methods. A talk introducing this suite of Interpretability methods can be viewed at the bottom of this page.

Lecture on interpretability at The Alan Turing Institute and related blog post

A Turing Lecture (delivered March 11, 2020) introducing a number of cutting edge approaches our lab have developed to turn machine learning’s opaque black boxes into transparent and understandable white boxes. A written companion piece from April 2020 can also be found below.

Roundtables on interpretability with clinicians

In March and April, 2021, our lab held two roundtables in which we discussed the topic of interpretability with clinicians.

In our first session, we aimed to have a relatively high-level conversation about different definitions and types of interpretability, whereas the second session focused more on how interpretability can help build trust in machine learning models and benefit healthcare stakeholders. Underlying both of these were two recurring questions: to what degree can interpretable machine learning really benefit healthcare stakeholders, and will it provide the key to acceptance of machine learning technologies?

Both roundtables yielded spirited discussions and remarkable insights that could genuinely change the way we design machine learning models for clinical applications. They can be viewed below.

Rob Davis on the ML Interpretability Suite

This is a quick intro to our Interpretability Suite by Rob Davis, research engineer at CCAIM. It discusses why ML interpretability is so important and shows the array of different methods developed by the van der Schaar Lab and CCAIM that are available on the van der Schaar lab GitHub.

Click here for the Interpretability Suite

Click here for the SimplEx Demonstrator

Our engagement sessions

We encourage you to stay abreast of ongoing developments in this and other areas of machine learning for healthcare by signing up to take part in one of our two streams of online engagement sessions.

If you are a practicing clinician, please sign up for Revolutionizing Healthcare, which is a forum for members of the clinical community to share ideas and discuss topics that will define the future of machine learning in healthcare (no machine learning experience required).

If you are a machine learning student, you can join our Inspiration Exchange engagement sessions, in which we introduce and discuss new ideas and development of new methods, approaches, and techniques in machine learning for healthcare.

A full list of our papers on interpretability and related topics can be found here.