van der Schaar Lab
Interpretable machine learning

Interpretable machine learning


Machine learning is capable of enabling truly personalized healthcare; this is what our lab calls “bespoke medicine.”

More info on bespoke medicine can be found here.

Interpretability is essential to the success of the machine learning and AI models that will make bespoke medicine a reality. Despite its acknowledged importance and value, the actual concept of interpretability has resisted definition and is not well understood.

Our lab has conducted field-leading research into a variety of forms of interpretability for years, and has developed a unique and cohesive framework for categorizing and developing interpretable machine learning models. Our framework is presented on this page, alongside much of the accompanying research, in the hope of advancing the discussion on this crucial topic and inspiring readers to engage in new projects and research.

The content of this page is designed to be accessible and useful to a wide range of readers, from machine learning novices to experts.

You can find our publications on interpretability and explainability here.

This page is one of several introductions to areas that we see as “research pillars” for our lab. It is a living document, and the content here will evolve as we continue to reach out to the machine learning and healthcare communities, building a shared vision for the future of healthcare.

Our primary means of building this shared vision is through two groups of online engagement sessions: Inspiration Exchange (for machine learning students) and Revolutionizing Healthcare (for the healthcare community). If you would like to get involved, please visit the page below.

This page is authored and maintained by Mihaela van der Schaar and Nick Maxfield.


This page proposes a unique and coherent framework for categorizing and developing interpretable machine learning models. We will demonstrate this framework using a range of examples from the van der Schaar Lab’s extensive research into interpretability, and our ongoing interdisciplinary discussions with members of the clinical and other non-ML communities.

First, we will discuss the many potential definitions and uses of interpretability. We will then lay out a framework of four distinct types of interpretability, and explain the potential roles and applications of each type. Finally, we will turn the debate on its head by examining how interpretability can also be applied to understand and support humans, rather than AI and machine learning models.

Interpretability: a concept with clear value but an unclear definition

There are several reasons to make a “black box” machine learning model interpretable. First, an interpretable output can be more readily understood and trusted by its users (for example, clinicians deciding whether to prescribe a treatment), making its outputs more actionable. Second, a model’s outputs often need to be explained by its users to the subjects of its outputs (for example, patients deciding whether to accept a proposed treatment course) . Third, by uncovering valuable information that otherwise would have remained hidden within the model’s opaque inner workings, an interpretable output can empower users such as researchers with powerful new insights.

The value of interpretability as a broad concept is, therefore, clear. Yet despite this, the meaning of the term itself is too seldom discussed and too often oversimplified. There is no single “type” of interpretability, after all, since there are many potential ways to extract and present information from the output of a model, and many types of information to choose to extract.

This is something we explored in 2018, when we designed a reinforcement learning system capable of learning from its interactions with users and accurately predicting which outputs would maximize their confidence in the underlying clinical risk prediction model. This work was introduced in a paper entitled “What is Interpretable? Using Machine Learning to Design Interpretable Decision-Support Systems.”

What is Interpretable? Using Machine Learning to Design Interpretable Decision-Support Systems

Owen Lahav, Nicholas Mastronarde, Mihaela van der Schaar

Recent efforts in Machine Learning (ML) interpretability have focused on creating methods for explaining black-box ML models. However, these methods rely on the assumption that simple approximations, such as linear models or decision-trees, are inherently human-interpretable, which has not been empirically tested. Additionally, past efforts have focused exclusively on comprehension, neglecting to explore the trust component necessary to convince non-technical experts, such as clinicians, to utilize ML models in practice.

In this paper, we posit that reinforcement learning (RL) can be used to learn what is interpretable to different users and, consequently, build their trust in ML models. To validate this idea, we first train a neural network to provide risk assessments for heart failure patients. We then design a RL-based clinical decision-support system (DSS) around the neural network model, which can learn from its interactions with users. We conduct an experiment involving a diverse set of clinicians from multiple institutions in three different countries.

Our results demonstrate that ML experts cannot accurately predict which system outputs will maximize clinicians’ confidence in the underlying neural network model, and suggest additional findings that have broad implications to the future of research into ML interpretability and the use of ML in medicine.

Our lab has been researching interpretability methods and approaches (for application in healthcare and beyond) for many years. Our work so far has led us to a unique but powerful framework for considering the multiple types of interpretability.

Our framework divides interpretability into 4 broad “types”:
1) feature importance;
2) similarity classification;
3) unraveled rules and laws; and
4) transparent risk equations.

Each of these types of interpretability represents a distinct set of challenges from a model development perspective, and can benefit different users in a variety of applications. These will be explored below, but an in-depth discussion on each type—driven by insights from colleagues from the clinical community—can be found in a recent piece of content entitled “Making machine learning interpretable: a dialog with clinicians.”

Type 1 interpretability: feature importance

This type of interpretability involves identifying and showing which patient-specific features the machine learning model has considered when issuing a prediction for a patient. We can do this either by identifying features that are important for an entire population or by identifying features the model has considered specifically for the patient at hand.

Our lab has already developed a number of models offering this type of interpretability. One such approach is INVASE, which was first introduced in a paper published at ICLR 2019.

INVASE: Instance-wise Variable Selection using Neural Networks

Jinsung Yoon, James Jordon, Mihaela van der Schaar

ICLR 2019

The advent of big data brings with it data with more and more dimensions and thus a growing need to be able to efficiently select which features to use for a variety of problems. While global feature selection has been a well-studied problem for quite some time, only recently has the paradigm of instance-wise feature selection been developed.

In this paper, we propose a new instance-wise feature selection method, which we term INVASE. INVASE consists of 3 neural networks, a selector network, a predictor network and a baseline network which are used to train the selector network using the actor-critic methodology. Using this methodology, INVASE is capable of flexibly discovering feature subsets of a different size for each instance, which is a key limitation of existing state-of-the-art methods.

We demonstrate through a mixture of synthetic and real data experiments that INVASE significantly outperforms state-of-the-art benchmarks.

We have continued to make progress in developing methods that offer interpretations based on explanatory patient features. In a paper recently accepted for publication at ICML 2021, for example, we introduced an approach specifically designed for multivariate time series, using saliency masks to identify and highlight important features at each time step.

Explaining Time Series Predictions with Dynamic Masks

Jonathan Crabbé, Mihaela van der Schaar

ICML 2021

How can we explain the predictions of a machine learning model? When the data is structured as a multivariate time series, this question induces additional difficulties such as the necessity for the explanation to embody the time dependency and the large number of inputs.

To address these challenges, we propose dynamic masks (Dynamask). This method produces instance-wise importance scores for each feature at each time step by fitting a perturbation mask to the input sequence. In order to incorporate the time dependency of the data, Dynamask studies the effects of dynamic perturbation operators. In order to tackle the large number of inputs, we propose a scheme to make the feature selection parsimonious (to select no more feature than necessary) and legible (a notion that we detail by making a parallel with information theory).

With synthetic and real-world data, we demonstrate that the dynamic underpinning of Dynamask, together with its parsimony, offer a neat improvement in the identification of feature importance over time. The modularity of Dynamask makes it ideal as a plug-in to increase the transparency of a wide range of machine learning models in areas such as medicine and finance, where time series are abundant.

Clinicians have explained to us that this type of interpretability would be particularly valuable to them: since they are required to work out the best way to treat a patient, it is helpful to understand the features that influenced a model’s output. By contrast, clinicians see the value of this type of interpretability for patients as lower. Patients may not consider it particularly useful to know the relative importance of their features: for example, a patient may not benefit from knowing that the most important features determining her cancer mortality risk are her age and ER status.

Type 2 interpretability: similarity classification

Through similarity classification, we seek to identify and explain which similar patients a machine learning model has provided the same–or different–predictions for. An approach based on similarity classification would involve cross-referencing the black box model’s prediction with available observational data regarding the features and outcomes of similar patients, and then explaining the model’s prediction in terms of those features and outcomes.

Several of our lab’s projects have sought to provide interpretable explanations based on similarity classification. Most notable among these is an approach using deep learning to cluster time series data, where each cluster comprises patients who share similar future outcomes of interest. This was introduced in a paper published at ICML 2020.

Temporal Phenotyping using Deep Predictive Clustering of Disease Progression

Changhee Lee, Mihaela van der Schaar

ICML 2020

Due to the wider availability of modern electronic health records, patient care data is often being stored in the form of time-series. Clustering such time-series data is crucial for patient phenotyping, anticipating patients’ prognoses by identifying “similar” patients, and designing treatment guidelines that are tailored to homogeneous patient subgroups.

In this paper, we develop a deep learning approach for clustering time-series data, where each cluster comprises patients who share similar future outcomes of interest (e.g., adverse events, the onset of comorbidities). To encourage each cluster to have homogeneous future outcomes, the clustering is carried out by learning discrete representations that best describe the future outcome distribution based on novel loss functions.

Experiments on two real-world datasets show that our model achieves superior clustering performance over state-of-the-art benchmarks and identifies meaningful clusters that can be translated into actionable information for clinical decision-making.

In addition to the paper above, we have a number of other research projects related to similarity classification underway at the time of writing.

In our discussions with clinicians, they generally felt that this type of interpretability has far more value to patients than feature importance (type 1). Patients generally find it easier to make a decision based on a prediction or recommendation when it is explained with reference to similarities or differences with patients like them.

Type 3 interpretability: unraveled rules and laws

With this type of interpretability, we seek to discover “rules” and “laws” learned by the machine model. These can be in the form of decision rules, or even “counterfactual” explanations in the form of “What if?” question-answer pairs that describe the smallest adjustment to the patient’s features that would change the model’s prediction to a predefined output. For example, a clinician could use this type of interpretability to establish the smallest difference in tumor size that would change the model’s prediction for a patient with cancer.

Our lab’s work at the forefront of research into this type of interpretability is in its early stages, but one particularly relevant recent paper can be found below.

Integrating Expert ODEs into Neural ODEs: Pharmacology and Disease Progression

Zhaozhi Qian, William R. Zame, Lucas M. Fleuren, Paul Elbers, Mihaela van der Schaar

Modeling a system’s temporal behaviour in reaction to external stimuli is a fundamental problem in many areas. Pure Machine Learning (ML) approaches often fail in the small sample regime and cannot provide actionable insights beyond predictions. A promising modification has been to incorporate expert domain knowledge into ML models.

The application we consider is predicting the progression of disease under medications, where a plethora of domain knowledge is available from pharmacology. Pharmacological models describe the dynamics of carefully-chosen medically meaningful variables in terms of systems of Ordinary Differential Equations (ODEs). However, these models only describe a limited collection of variables, and these variables are often not observable in clinical environments. To close this gap, we propose the latent hybridisation model (LHM) that integrates a system of expert-designed ODEs with machine-learned Neural ODEs to fully describe the dynamics of the system and to link the expert and latent variables to observable quantities.

We evaluated LHM on synthetic data as well as real-world intensive care data of COVID-19 patients. LHM consistently outperforms previous works, especially when few training samples are available such as at the beginning of the pandemic.

Type 4 interpretability: transparent risk equations

This approach to interpretability allows us to turn black box models into white boxes by generating transparent risk equations that describe the predictions made by machine learning models. Unlike regression models, this involves mapping non-linear interactions between different features. We can then discard the black box model, and rely on the transparent risk equation to issue predictions.

The bulk of our own work focusing on this type of interpretability has involved symbolic metamodeling frameworks for expressing black-box models in terms of transparent mathematical equations that can be easily understood and analyzed by human subjects. A symbolic metamodel is a model of a model—a surrogate model of a trained (machine learning) model expressed through a succinct symbolic expression that comprises familiar mathematical functions and can be subjected to symbolic manipulation. We first introduced symbolic metamodels in a paper published at NeurIPS 2019.

Demystifying Black-box Models with Symbolic Metamodels

Ahmed Alaa, Mihaela van der Schaar

NeurIPS 2019

Understanding the predictions of a machine learning model can be as crucial as the model’s accuracy in many application domains. However, the black-box nature of most highly-accurate (complex) models is a major hindrance to their interpretability.

To address this issue, we introduce the symbolic metamodeling framework — a general methodology for interpreting predictions by converting “black-box” models into “white-box” functions that are understandable to human subjects. A symbolic metamodel is a model of a model, i.e., a surrogate model of a trained (machine learning) model expressed through a succinct symbolic expression that comprises familiar mathematical functions and can be subjected to symbolic manipulation.

We parameterize symbolic metamodels using Meijer G-functions — a class of complex-valued contour integrals that depend on scalar parameters, and whose solutions reduce to familiar elementary, algebraic, analytic and closed-form functions for different parameter settings. This parameterization enables efficient optimization of metamodels via gradient descent, and allows discovering the functional forms learned by a machine learning model with minimal a priori assumptions.

We show that symbolic metamodeling provides an all-encompassing framework for model interpretation — all common forms of global and local explanations of a model can be analytically derived from its symbolic metamodel.

We built on our symbolic metamodeling work by developing Symbolic Pursuit, which was first introduced in a paper published at NeurIPS 2020. The Symbolic Pursuit algorithm benefits from the ability to produce parsimonious expressions that involve a small number of terms. Such interpretations permit easy understanding of the relative importance of features and feature interactions.

Learning outside the Black-Box: The pursuit of interpretable models

Jonathan Crabbe, Yao Zhang, William Zame, Mihaela van der Schaar

NeurIPS 2020

Machine learning has proved its ability to produce accurate models — but the deployment of these models outside the machine learning community has been hindered by the difficulties of interpreting these models.

This paper proposes an algorithm that produces a continuous global interpretation of any given continuous black-box function. Our algorithm employs a variation of projection pursuit in which the ridge functions are chosen to be Meijer G-functions, rather than the usual polynomial splines. Because Meijer G-functions are differentiable in their parameters, we can “tune” the parameters of the representation by gradient descent; as a consequence, our algorithm is efficient.

Using five familiar data sets from the UCI repository and two familiar machine learning algorithms, we demonstrate that our algorithm produces global interpretations that are both faithful (highly accurate) and parsimonious (involve a small number of terms). Our interpretations permit easy understanding of the relative importance of features and feature interactions. Our interpretation algorithm represents a leap forward from the previous state of the art.

It should be noted that transparent risk equations can be applied to the other three types of interpretability listed above. Using patient features as inputs and risk as outputs, we can identify variable importance, classify similarities, discover variable interactions, and enable hypothesis induction.

Peering into the ultimate black box

The bulk of this page has been dedicated to exploring what it means to make machine learning models “interpretable,” and showing how this can be done in a variety of ways. In our view, this is still premised on a relatively blinkered view that ignores some very exciting possibilities for interpretability and machine learning—namely, for humans to use interpretability to understand our own decision-making process.

This possibility is at the heart of quantitative epistemology, a new and transformationally significant research pillar pioneered by our lab. The purpose of this research is to develop a strand of machine learning aimed at understanding, supporting, and improving human decision-making. We aim to do so by building machine learning models of decision-making, including how humans acquire and learn from new information, establish and update their beliefs, and act on the basis of their cumulative knowledge. Because our approach is driven by observational data in studying knowledge as well as using machine learning methods for supporting and improving knowledge acquisition and its impact on decision-making, we call this “quantitative epistemology.”

We develop machine learning models that capture how humans acquire new information, how they pay attention to such information, how their beliefs may be represented, how their internal models may be structured, how these different levels of knowledge are leveraged in the form of actions, and how such knowledge is learned and updated over time. Our methods are aimed at studying human decision-making, identifying potential suboptimalities in beliefs and decision processes (such as cognitive biases, selective attention, imperfect retention of past experience, etc.), and understanding risk attitudes and their implications for learning and decision-making. This would allow us to construct decision support systems that provide humans with information pertinent to their intended actions, their possible alternatives and counterfactual outcomes, as well as other evidence to empower better decision-making.

You can learn more about quantitative epistemology and explore some of our first papers in this area in the article below.

Find out more and get involved

Interpretability is one of the van der Schaar Lab’s core research pillars, and we are constantly pushing forward our understanding of the area. We have produced a great deal of content on the topic, some of which has been shared below.

Lecture on interpretability at The Alan Turing Institute and related blog post

A Turing Lecture (delivered March 11, 2020) introducing a number of cutting edge approaches our lab have developed to turn machine learning’s opaque black boxes into transparent and understandable white boxes. A written companion piece from April 2020 can also be found below.

Roundtables on interpretability with clinicians

In March and April, 2021, our lab held two roundtables in which we discussed the topic of interpretability with clinicians.

In our first session, we aimed to have a relatively high-level conversation about different definitions and types of interpretability, whereas the second session focused more on how interpretability can help build trust in machine learning models and benefit healthcare stakeholders. Underlying both of these were two recurring questions: to what degree can interpretable machine learning really benefit healthcare stakeholders, and will it provide the key to acceptance of machine learning technologies?

Both roundtables yielded spirited discussions and remarkable insights that could genuinely change the way we design machine learning models for clinical applications. They can be viewed below.

Our engagement sessions

We encourage you to stay abreast of ongoing developments in this and other areas of machine learning for healthcare by signing up to take part in one of our two streams of online engagement sessions.

If you are a practicing clinician, please sign up for Revolutionizing Healthcare, which is a forum for members of the clinical community to share ideas and discuss topics that will define the future of machine learning in healthcare (no machine learning experience required).

If you are a machine learning student, you can join our Inspiration Exchange engagement sessions, in which we introduce and discuss new ideas and development of new methods, approaches, and techniques in machine learning for healthcare.

A full list of our papers on interpretability and related topics can be found here.