van der Schaar Lab

AutoML: powering the new human-machine learning ecosystem

This page is the first in a series of pieces of long-form content highlighting the van der Schaar Lab’s primary “research pillars.”

As a group that develops new and powerful machine learning tools and techniques for healthcare, our aim is to ensure that these are put to practical use, and that this is done in the service of a longer-term vision.

It is our hope that this page, and others that will follow it, will help articulate that vision and encourage the machine learning and healthcare communities to implement new machine learning methods and transform healthcare.

As a “living document,” this page will continue to grow and evolve. Over the coming months, we will reach out to the machine learning and healthcare communities, and will flesh out our vision for the future of AutoML for healthcare based on the discussions we have.

To do this, we will hold two groups of online engagement sessions: Inspiration Exchange (for Machine Learning students) and Revolutionizing Healthcare (for the healthcare community). The first Inspiration Exchange session will be held on September 2.

This page is authored and maintained by Mihaela van der Schaar and Nick Maxfield.


Automated machine learning (AutoML) will play an extremely important role in the future of machine learning. It provides an essential path to enabling the widespread adoption of machine learning by specialists without machine learning expertise. In healthcare, AutoML can empower the clinical community by enabling the crafting of actionable analytics that inform and improve decision-making, benefiting users from clinicians to administrators, researchers, policymakers, and beyond. In fact, the adoption of recent AutoML techniques could already empower healthcare professionals, and offer significant improvement over the norms and technologies that are currently prevalent in healthcare.

At the same time, AutoML is a broad, nebulous, and often misunderstood area that is still in development, and its potential isn’t even close to being fully realized yet.

This page offers a very quick introduction to automated machine learning (AutoML), and describes some of our lab’s pioneering work in the area. More importantly, we explore the purpose, implications, and limitations of AutoML, and share our vision for this complex but potent area of machine learning. We are primarily writing about healthcare (our lab’s specialization), the methods and ideas discussed are applicable in many areas in which machine learning can be applied.

The content here is intended to be accessible (at least in its broad strokes) to someone with limited  background in machine learning or other quantitative disciplines. We have added optional “focus” sections, which contain in-depth explanations intended for the machine learning community and those who want to dig a little deeper.

This page initially uses relatively simple classification and prediction problems to introduce the basics of AutoML. This is primarily to avoid information overload. The lab’s work on machine learning for healthcare extends beyond such problems, and, in fact, centers around the development of ways to provide much-needed clinical analytics to support decision-making in healthcare, such as survival analysis, personalized treatment effect estimation, and dynamic (time-sensitive) forecasting from time-series data.

What is AutoML?

As the name suggests, AutoML involves building machine learning methods that can create optimized and effective machine learning models.

The best way to get a basic understanding of AutoML is to start by considering the fundamental issues and limitations of machine learning models that are “hand-crafted”—meaning models that are constructed on an ad-hoc basis to address a specific problem.

Hand-crafted machine learning models: one answer to one question

Crafting a machine learning model involves piecing together many complex components and steps (such as missing data imputation, feature engineering, classification, and calibration). For each step, there are numerous algorithms and hyperparameters to choose from. As those familiar with the “no free lunch” theorem will know, no universally applicable “best” algorithm for all problems exists: the effectiveness of any algorithm at any stage will depend on a number of factors, including the dataset at hand and the nature of the problem itself.

Questions facing anyone trying to build a machine learning model include:
– Deciding whether certain steps (such as missing data imputation, or uncertainty estimation) are necessary or not;
– Deciding which algorithm to use for each step, and fine-tuning hyperparameters for each selected algorithm; and
– Assessing the impact of algorithm-hyperparameter selection on other steps within the same model.

As a result, creating a useful machine learning model often involves a painstaking and cyclical process of analyzing the data (e.g. visualization and statistics) and performing fine-tuning, in which combinations of algorithms and hyperparameters are tested over and over again until the combination that best fits the dataset is (eventually) found. This is often extremely time-consuming, and can generally only be done efficiently by experts (thanks to their prior knowledge). Even then, such models may well be prone to biases and assumptions on the part of experts, potentially undermining performance and reliability.

Occasionally, those less familiar with machine learning seek to skip this process entirely by picking “well-known” algorithms that have perhaps been used in similar research (such as Random Forest or a specific type of deep learning algorithm) or by only focusing on a certain type of machine learning model (such as classification for risk prediction) without realizing that machine learning can offer numerous other useful analytics (such as dynamic forecasting, estimation of treatment effects or of the value of information acquisition). The former approach achieves results equivalent to putting the wrong kind of fuel in your car and trying to drive it; the latter is like installing an engine but opting to forego wheels. Unfortunately, the suboptimal performance of a lot of models created in this manner can have an (undeserved) adverse impact on the broader reputation of machine learning.

The problems outlined above have held back the progress of machine learning in several ways:
– Machine learning is out of reach for non-expert end users (i.e. individuals with only a rudimentary knowledge of machine learning) and smaller organizations without access to expert users.
-The process of constructing a hand-crafted model is dependent on assumptions, and therefore may yield poor results.
– The time-consuming process of hand-crafting a model needs to be repeated for every new problem or dataset, since the best model will vary (for example, the COVID-19 pandemic has created the need to address new questions with new data, while producing new kinds of analytics and predictions).

This is why many in the field of machine learning have heralded AutoML as a potential solution that could allow non-expert users to create flexible high-quality models to tackle new problems.

AutoML: one answer to many questions

The first precursors to AutoML emerged in the 1990s, and focused primarily on automation of parameter optimization and (to a limited degree) model selection. If we fast-forward through a couple of decades of development, we have now reached a point at which machine learning itself can play a role in practically the entire model construction process, including feature engineering, algorithm selection and hyperparameter optimization, model training, and evaluation (insofar as AutoML enables evaluation with multiple performance metrics).

Among those familiar with the drawbacks of hand-crafted machine learning models, AutoML has quickly become an appealing prospect. It has the potential to liberate data scientists from the chore of manual model tuning, allowing them to focus on more important areas (such as new algorithm development) while also enabling faster iteration and delivery of results. It’s versatile, efficient, and could eventually democratize machine learning by putting powerful tools in the hands of end users.

The potential of AutoML in healthcare

This page has, so far, introduced and examined the broad potential of AutoML without delving into healthcare as a specific domain for application. Healthcare is, of course, the focus of our lab’s work, and presents some very unique—and quite fascinating—problems to solve from a machine learning perspective, as described below.

Even in its current form, AutoML can build analytics that can assist decision-making. We are starting to see the emergence of AutoML that can go beyond providing a single answer to a single question, offering an entire analytical system for existing diseases, new diseases, risk scores, treatment effect estimation, healthcare management, and much more.

At the same time, healthcare is a domain in which conventional AutoML frameworks and approaches struggle to function reliably and efficiently.

The sections below will outline our own lab’s work to apply AutoML in healthcare by i) developing AutoML frameworks and tools and ii) engineering new ML tools that synergize particularly well with the aims and capabilities of AutoML.

Shortcomings of many AutoML frameworks when applied in healthcare

Despite the many notable achievements of AutoML methods, these methods achieve only limited performance gains when applied off-the-shelf to clinical datasets.

Most currently prevalent AutoML methods are focused on standard classification or regression problems for risk prediction. By contrast, clinicians and healthcare researchers need a more versatile analytics toolkit. This toolkit must include AutoML methods capable of survival analysis, screening, monitoring, forecasting of disease trajectories, dealing with competing risks, estimation of treatment effects, and more—and its outputs must all be personalized to the specific patient at hand . These pose substantial and complex challenges that many AutoML approaches cannot handle.

In addition, even for standard problems such as classification and regression, many AutoML frameworks currently fail to adequately account for key problems in healthcare—such as dealing with missing data and providing uncertainty estimates associated with each (risk) prediction.

AutoPrognosis: the first comprehensive analytic toolkit for healthcare

For researchers in the van der Schaar Lab, the lack of an AutoML tool capable of effectively supporting the healthcare community presented both a challenge and an opportunity; after all, the lab exists to create cutting-edge machine learning methods and apply them to drive a revolution in healthcare. Since the lab collaborates extensively with clinicians and clinical researchers, we are particularly well-positioned to understand the very specific needs of the healthcare community and reflect them in the models and techniques we create.

Our first major breakthrough in this area was AutoPrognosis, initially presented in a paper for the 2018 International Conference on Machine Learning (ICML). The name leaves little to the imagination: AutoPrognosis is a framework that applies the principles of AutoML to the medical area of prognosis—the calculation of risk of future health outcomes in patients with given features. More specifically, AutoPrognosis was created to automate the design of actionable predictive models that can inform clinicians about future course of patients’ clinical conditions in order to guide screening and therapeutic decisions.

The core component of AutoPrognosis is an algorithm that automatically configures machine learning pipelines, with each pipeline comprising a combination of algorithms for missing data imputation, feature processing, classification, and calibration, as shown schematically above. The imputation and calibration stages are particularly important for clinical prognostic modeling, and are not supported in many other AutoML frameworks.

The design space of AutoPrognosis contains 5,460 possible machine learning pipelines (7 possible imputation algorithms, 9 feature processing algorithms, 20 classification algorithms, and 3 calibration methods). New algorithms can be—and are—added to the mix whenever needed. While this makes for a potent toolkit, it also means that finding the best paths and tuning parameters becomes a complex optimization problem. Furthermore, since the utility of any given selection of algorithms and parameters is unknown when working with a new dataset, this must be learned. These issues were solved through the joint application of existing Bayesian Optimization (BO) techniques and a new approach called structured kernel learning, an explanation of which is offered in the box below.

Focus: under the hood of AutoPrognosis

AutoPrognosis follows a principled Bayesian approach in all of its components.

As described above, configuring a pipeline is challenging because it involves selecting which algorithm to use at each stage of the pipeline as well as which hyper-parameters to select for each chosen algorithm, and because the performance of an algorithm will depend on the algorithms chosen at other stages of the pipeline. This represents a complex combinatorial optimization problem. However, determining the performance of a pipeline does not solely involve solving an optimization problem – it also involves solving a challenging learning problem since the performance of an algorithm for a new dataset is not known in advance; it must be learned. To solve this challenging learning and optimization problem, the pipeline configuration algorithm uses Bayesian optimization to estimate the performance of different pipeline configurations (algorithms chosen at each stage of the pipeline as well as hyper-parameters used for each algorithm) in a scalable fashion by learning a structured kernel decomposition that identifies algorithms with similar performance.

An in-depth explanation of these approaches is provided below, in a talk given by Mihaela van der Schaar at the Royal Society in 2018.

Pipeline configuration via Bayesian optimization with structured kernels is also discussed here, in our lab’s original paper on AutoPrognosis.

Instead of identifying and choosing a “single best pipeline,” AutoPrognosis uses an ensemble approach, weighting the predictions of each pipeline according to the empirical probability of that pipeline being the best. This is useful because it is uncertain which pipeline is actually best, and also because it makes use of the information in all pipelines. This also helps with model uncertainty, as mentioned in the focus section below.

Focus: post-hoc ensemble construction

The frequentist approach to pipeline configuration would be to choose the pipeline with the best observed performance from the set explored by the BO algorithm. However, such an approach would not capture the uncertainty in the pipelines’ performances, and would wastefully discard the information from all the other evaluated pipelines.

On the contrary, AutoPrognosis makes use of all the constructed pipelines via post-hoc Bayesian model averaging, creating an ensemble of weighted pipelines. This model averaging is particularly useful in cohorts with small sample sizes, where large uncertainty about the pipelines’ performances would render frequentist solutions unreliable.

Our post-hoc approach enables ensembles to be built without requiring extra hyperparameters. (Ensemble construction in other AutoML frameworks would require a significant increase in the number of hyperparameters.)

As an AutoML framework for healthcare, AutoPrognosis is inherently adaptable. Since its first introduction in 2018, it has already been applied in a number of clinical settings, including predicting outcomes for cardiovascular disease, cystic fibrosis, breast cancer, and ICU admission prediction (initially using past patient data in the U.S. and subsequently using live patient data in the U.K.). Most recently, AutoPrognosis was adapted into a tool for hospital capacity planning as part of the U.K. National Health Service’s response to COVID-19.

AutoPrognosis has consistently displayed accuracy surpassing both frequently used statistical methods and cutting-edge machine learning models. For reference, an array of benchmarks is provided in the focus section below.

Focus: AutoPrognosis benchmarks

Comparison of the performance performance of various competing prognostic modeling approaches measured by area under the receiver-operating curve (AUC-ROC), with 5-fold cross-validation.

We compared the performance of AutoPrognosis with the clinical risk scores used for predicting prognosis in each cohort, as well as various AutoML frameworks, and finally a standard Cox proportional hazards (Cox PH) model, which is the model most commonly used in clinical prognostic research.

As the table shows, AutoPrognosis outperforms all the competing models on all the cohorts under consideration. This reflects the robustness of our system, since the 10 cohorts had very different characteristics.

Note: Bold numbers correspond to the best result. The “best predictor” row lists the prediction algorithms picked by vanilla AutoPrognosis.

For more information on the 10 cohorts used for the purpose of this comparison, view our original paper here.

While AutoPrognosis remains an important breakthrough and a key milestone in our lab’s work, we have continued to build on it with new AutoML techniques and frameworks, as described below.

Beyond prediction

For the purpose of narrative simplicity, we have so far used fairly simple prediction problems to show how AutoPrognosis can be used to craft machine learning pipelines for static predictions while outperforming other statistical and machine learning techniques.

AutoML can, however, deliver a far broader range of informative and actionable analytics to support clinical decision-making and research. These include personalized approaches to screening and monitoring, survival analysis (time-to-event analysis), treatment effect estimation, and treatment plans, as well as interpretations and uncertainty estimates. Many of these are discussed below.

Survival analysis

The importance of survival analysis (time-to-event analysis) in healthcare has led to the development of a variety of approaches to modeling the survival function (the probability of surviving past a given time). Models constructed via various approaches offer different strengths and weaknesses in terms of discriminative performance and calibration, but often no one model is best across all datasets or even across all time horizons within a single dataset.

Because we require both good calibration and good discriminative performance over different time horizons, conventional model selection and ensemble approaches cannot be used. This presents a challenge to familiar methods of model selection or ensemble creation. To address this need, our lab has adopted a novel approach, known as SurvivalQuilts (introduced in a paper for AISTATS 2019), which combines the collective intelligence of different underlying survival models to produce a valid survival function that is well-calibrated and offers superior discriminative performance at different time horizons. Empirical results show that our approach provides significant gains over the benchmarks on a variety of real-world datasets.

One of the virtues of our approach is that it can make use of new survival models as those become available and prove their value (one example being DeepHit, a state-of-the-art deep learning-based survival model introduced by our lab at AIII in 2018). Importantly, it also provides a way to free clinicians from the concern of choosing one particular survival model for each dataset and for each time horizon of interest, so that it can be seamlessly integrated into an AutoML framework.

Focus: how SurvivalQuilts works

Existing survival models may fail to capture true survival behavior in different settings and over different time horizons. SurvivalQuilts addresses both these failings by forming time-varying ensembles of different survival models.

SurvivalQuilts pieces together existing survival analysis models according to endogenously determined, time-varying weights. We refer to our construction as temporal quilting, and to the resultant model as a survival quilt.

An example of temporal quilting with prescribed weights for survival models (COX, RISF, CISF) at t1, t2, and t3. A risk function is constructed by stitching together the weighted increment functions of each survival model between two adjacent time horizons.

The core part of our method is an algorithm for configuring the weights sequentially over a (perhaps very fine) grid of time intervals. To render the problem tractable, we apply constrained Bayesian Optimization (BO), which models the discrimination and calibration performance metrics as black-box functions, whose input is an array of weights (over different time horizons) and whose output is the corresponding performance achieved. Based on the constructed array of weights, our method makes a single predictive model—a survival quilt—that provides a valid risk function.

A schematic depiction of Survival Quilts and its pattern optimization. SurvivalQuilts provides risk functions that are constructed on the basis of the final quilting pattern. Here, colored boxes show the three main components of our method and dotted lines imply feedback loops for sequential computations.

More info is available in our 2019 paper introducing survival quilts (supplementary material here), including a range of benchmarks demonstrating the superiority of survival quilts over previous survival models over six real-world datasets.

An introduction to SurvivalQuilts by Mihaela van der Schaar (an excerpt from her keynote at the ICML 2020 AutoML workshop).

Personalized treatment effects and causal inference

Clinical decision-makers often face difficult choices between alternative plans for treatments for patients. Such choices cannot be made properly without reliable estimates of the effects of these treatment plans.

Estimating the effects of treatments—causal inference—from data generally falls beyond the realm of conventional machine learning techniques. The fundamental problem of causal inference is that after a subject receives a treatment and displays an outcome, it is impossible to know what the counterfactual outcome would have been had they received an alternative treatment. Because we never observe these counterfactuals, we can never observe the true causal effects. The problem is exacerbated because, in order to determine the best possible course of treatment, we need to predict effects over time.

The majority of previous work focuses on the effects of interventions at a single point in time, but clinical data also captures information on complex time-dependent treatment scenarios, such as where the efficacy of treatments changes over time (e.g. drug resistance in cancer patients), or where patients receive multiple interventions administered at different points in time (e.g. radiotherapy followed by chemotherapy).

Our lab developed the first AutoML approach for causal inference based on influence functions. This was presented at ICML 2019, and details are provided in the focus section below.

Focus: new tools for automating causal inference

In 2019, our lab introduced a first-of-its-kind validation procedure for estimating the performance of causal inference methods using influence functions—the functional derivatives of a loss function.

Our procedure utilizes a Taylor-like expansion to approximate the loss function of a method on a given dataset in terms of the influence functions of its loss on a “synthesized”, proximal dataset with known causal effects.

This automated and data-driven approach to model selection enables confident deployment of (black-box) machine learning-based methods, and safeguards against naïve modeling choices.

IF-based model validation was introduced in a 2019 ICML paper. As mentioned earlier on, there is no single algorithm that will outperform all others on all problems. This is shown below in a comparison of the performance of methods published at ICML, NeurIPS and ICLR conferences from 2016 to 2018 on 77 datasets (full details here).

Interpretability

One key factor in gaining trust is interpretability, which—broadly speaking—means ensuring that the outputs made by machine learning models can be understood, rather than remaining “black boxes.” This is particularly important in the healthcare domain where black box predictions are unlikely to be acceptable to patients, clinicians or regulatory bodies.

Interpretability is particularly important in AutoML, in which the pipelines that are created are, by nature, not hand-crafted. This means they may be harder to analyze and understand, and they may also need additional “debugging” due to the lack of hands-on input in the pipeline construction process.

Additionally, interpretability can be done more efficiently at scale within an AutoML framework: AutoML can generate a suite of interpretations, which would be extremely time-consuming if done for a hand-crafted machine learning model. This is particularly important given the need to build user-specific interpretability into machine models: a doctor’s needs would vary substantially from those of a clinical researcher, for example.

Our work on interpretability, and its integration into AutoML frameworks, is outlined here.

Uncertainty estimates

Prediction in the face of uncertainty is central to clinical practice. As a group that works extensively with clinicians in formulating problems and developing solutions, our lab places particular emphasis on ensuring that models can be trusted by users who lack extensive machine learning expertise. This is an essential factor in being able to narrow the divide between the builders and users of machine learning models. Estimation of the uncertainty of a healthcare associated inference is often as important as the prediction itself, as it allows the clinician to know how much weight to give it.

In addition to interpretability, trust requires robust confidence estimates: user must be provided with an indication of the degree of certainty accompanying predictions or recommendations. This is an area in which our lab has done extensive work, most recently through the development of a new approach called Automatic Machine Learning for Nested Conformal Prediction (AutoNCP). AutoNCP is a simple and powerful AutoML framework for constructing predictive confidence intervals with valid coverage guarantees, without human intervention. Because AutoNCP provides tighter confidence intervals in real-world applications, it allows non-experts to use machine learning methods with more certainty.

Focus: automating uncertainty estimates with AutoNCP

As a recognized and distribution-free approach to uncertainty estimation, Conformal Prediction achieves valid coverage and provides valid confidence intervals in finite samples. However, the confidence intervals constructed by Conformal Prediction are often (because of over-fitting, inappropriate measures of nonconformity, or other issues) overly conservative and hence not sufficiently adequate for the application(s) at hand.

AutoNCP is an AutoML framework, but unlike familiar AutoML frameworks that attempt to select the best model (from among a given set of models) for a particular dataset or application, AutoNCP uses frequentist and Bayesian methodologies to construct a prediction pipeline that achieves the desired frequentist coverage while simultaneously optimizing the length of confidence intervals.

A depiction of the operation of AutoNCP. The steps of NCP are enumerated in order. Implementing NCP on a data set D is a compound decision problem, requiring the choice of a model, hyperparameters for that model, an estimator, and a calibration method. AutoNCP aims to solve this compound decision efficiently.

Experiments using real-world datasets from a variety of domains demonstrate that AutoNCP provides much tighter confidence intervals than previous methods (full details here).

The value of information

In addition to providing recommendations that directly support patient treatment decisions, machine learning can help drive discovery in healthcare (and beyond) by providing new insight into the value of specific information.

AutoML is well-positioned to do this at scale, because creating and comparing a large number of potential pipelines offers a particularly effective way to assess the actual benefit of machine learning algorithms given the data at hand, as well as informing operators of the amount of information required in order to make a successful prediction.

Furthermore, on a computational level, assessing the value of information is a particularly complex process that requires many comparisons to be run at once. AutoML shows significant benefits here, since this would be extremely time-consuming to do with a hand-crafted model—especially since these comparisons would need to be run again whenever a substantial amount of new data becomes available. When using a hand-crafted machine learning model or a statistical model, complexity outlined above could prevent a user such as a clinician from adding “too many” variables; AutoML resolves this quandary, as it can take all variables provided and assess their importance.

AutoML can also tell us, for any given prediction we may wish to make, what to observe (i.e. what source of information is most valuable) and when to observe it (i.e. when this information will be most valuable). Because making observations is costly, this decision must trade off the value of information against the cost of observation. Making observations (i.e. sensing) should be an active choice. To solve the problem of active sensing, our lab developed a new deep learning architecture called Deep Sensing, which was first introduced in a paper for ICLR 2018.

Deep Sensing learns how to issue predictions at various cost-performance points. At runtime, the operator prescribes a performance level or a cost constraint, and Deep Sensing determines what measurements to take and what to infer from those measurements, and then issues predictions.

Deep Sensing complements other features that can be incorporated into AutoML pipelines, such as personalized predictions and estimation of treatment effects over time, due to its ability to determine what info is needed for these predictions. To learn more about Deep Sensing, view our focus section below.

Additionally, we can use AutoML to determine when a model’s algorithms and hyperparameter configurations need to be re-optimized based on changes or “drifts” caused by newly acquired data, and to conduct this optimization as needed. In a clinical setting, COVID-19 offers a good example: a model optimized for a dataset with 500 patients at the onset of the pandemic would likely be sub-optimal by the time the dataset increases to 50 thousand patients, let alone 5 million patients. Another example might be predicting the outcome of transplants: as transplantation methods and techniques have improved substantially in recent years, the likelihood of survival has (in many cases) increased substantially. A model optimized for an older dataset would, therefore, underperform unless reoptimized to reflect the new norm.

In 2019, our lab developed an AutoML technique to perform ongoing re-optimization of models: lifelong Bayesian optimization (LBO). By automating the model optimization process based on new data acquisition, LBO not only speeds up the learning process for newly arriving datasets, but also outperforms the results achieved under the standard approach of repeatedly optimizing a model by hand.

Focus: AutoML and the value of information

Deep Sensing: deciding what to observe, and when

Deep Sensing was specifically developed to meet the need to estimate the value of information; this is something that must be learned at training time. We can learn the estimated value for a specified set of measurements by first predicting on the basis of the information we have, then deleting the specified set of measurements, inferring what we have deleted on the basis of the data that remains, making a new prediction on the basis of the inferred measurements and the remaining data, and comparing the two predictions. (Part of our architecture is designed specifically for these tasks.)

Deep Sensing was introduced in a conference paper for ICLR 2018. To demonstrate its capabilities, we applied it to two real-world medical datasets with significantly improved performance.

Lifelong Bayesian Optimization (LBO)

As already mentioned, LBO is an online, multitask Bayesian optimization (BO) algorithm designed to solve the problem of model selection for datasets arriving and evolving over time. In LBO, we exploit the correlation between black-box functions by using components of previously learned functions to speed up the learning process for newly arriving datasets.

Inspired by Indian Buffet Process (IBP), as datasets arrive over time, we treat the black-box function on each dataset as a new customer arriving in a restaurant; we apply IBP to generate dishes (neural networks) to approximate the new black-box function. Using IBP enables us to limit the number of neural networks used at each time-step, while also introducing new neural networks when the new black-box function is distinct to the previous ones. LBO learns a suitable number of neural networks to span the black-box functions such that the correlated functions can share information, and the modelling complexity for each function is restricted to ensure a good variance estimate in BO.

A depiction of Lifelong Bayesian Optimization. As the datasets arrive over time, the cross-validation performance is treated as a black-box function ft. Some latent functions gm are generated in an Indian Buffet Process and trained on the acquisition set to fit ft.

Through synthetic and real-world experiments, we have demonstrated that LBO can improve the speed and robustness of BO. For more information, refer to our 2019 paper on LBO.

Learnings gleaned by using techniques such as deep sensing and lifelong Bayesian optimization can not only help develop research and policy-making, but also inform and improve the kind of information gathered from patients within the healthcare system.

Computing comprehensive performance reports

Lastly, it’s worth noting that, even if a hand-crafted model is developed using appropriate methods and fine-tuned to perform well, it will usually be optimized to show only one performance metric, such as area under the receiver-operating curve (AUC-ROC) or precision-recall, C-index, Brier scores, and so forth. By contrast, AutoML frameworks can generate reports that include many such performance metrics at the push of a button. This, in turn, can inform end users about which methods perform best with respect to a variety of metrics, the strengths and weaknesses of a model, or the relative value of including certain information (covariate) in terms of the performance obtained by a specific model.

In a clinical setting, for example, this kind of detailed analysis may be able to tell us that a certain additional number of people (beyond those identified before applying the model) should now be treated as “high risk” for a disease or condition, or even how many deaths could be prevented by following a certain recommendation (this has already been successfully implemented for cystic fibrosis and cardiovascular disease). After all, the ultimate goal of AutoML is to democratize machine learning by catering to the needs of its end users.

Clairvoyance: a game-changer

Clairvoyance, first announced in May 2020, represents the evolution of our lab’s vision for a comprehensive analytic toolkit for clinical use. While AutoPrognosis operates on static (cross-sectional) data, Clairvoyance operates on time-series (longitudinal) data. Such data are central to clinical practice and yet there have been comparatively few studies published, in large part due to the complexity of the methods involved. Moreover, Clairvoyance incorporates a significant range of additional features and capabilities: personalized screening and monitoring, personalized dynamic forecasts of various outcomes, personalized treatment effect estimation over time. Clairvoyance can be applied to practically any disease or condition for which time-series patient datasets are available—from electronic health records (from GP records, clinics, and hospitals), clinical registries, and more.

Time-series data is of vital importance because it provides much more information than the “snapshots” presented by static data, and hence permits much greater insight. Time-series data is the bread and butter of evidence-based clinical decision support. With the increasing availability of electronic patient records, there is enormous untapped potential to apply AutoML to time-series data, providing accurate and actionable predictive models for real-world concerns.

Until now, however, that potential has been hard to harness. Existing applications of machine learning to such problems have treated these component tasks as separate problems, leading to a siloed and stylized development approach that often fails to account for complexities and interdependencies within the real-world machine learning lifecycle. Given the difficulties involved in adapting machine learning approaches to time-series data, AutoML has seemed comparatively even further out of reach. This has created a substantial gulf between the inherent capabilities of machine learning methods and their actual effectiveness in clinical research and decision support. This is why Clairvoyance is a game-changer.

Under a simple, consistent API, Clairvoyance encapsulates all major steps of time-series modeling, including (i) loading and (ii) preprocessing patient records, (iii) handling missing or irregular samples in both static and temporal contexts, (iv) conducting feature selection, (v) fitting personalized dynamic prediction and treatment estimation models, performing (vi) calibration and (vii) uncertainty estimation of model outputs, (viii) applying global or instance-wise methods for interpreting learned models, and (ix) computing evaluation metrics per selected criteria. Many of these are shown in the pipeline overview below.

As shown in the focus section below, Clairvoyance has already been shown to outperform a range of standalone models.

Focus: taking variations over time into account when crafting AutoML pipelines for time-series data

A unique challenge in building AutoML models in the time-series setting is the fact that the optimal model itself varies over time, depending on the entire history of features and labels at each point in time, and not just on the features and labels at that moment.

Our approach develops a novel Bayesian optimization (BO) algorithm to tackle the challenge of model selection in this setting. This is accomplished by treating the performance at each time step as its own black-box function.

In order to solve the resulting multiple black-box function optimization problems jointly and efficiently, we exploit potential correlations among black-box functions using deep kernel learning (DKL).

Comparison of related methods in the context of model selection for sequence prediction. Each Mx indicates a model (hyper-)parameterized by x. (a) depicts Multi-objective Bayesian optimization, which is constrained to learn a single model for all time steps. (b) depicts Multi-task Bayesian optimization, which can be applied sequentially across time steps. (c) depicts our proposed technique for stepwise model selection via deep kernel learning, which jointly learns all models for all time steps.

To the best of our knowledge, our lab was the first to formulate the problem of stepwise model selection (SMS) for sequence prediction (where the underlying task is to emit predictions et at every step given a sequence of observations as input), and to design and demonstrate an efficient joint learning algorithm for this purpose.

Comparison of related methods in the context of model selection for sequence prediction. Note that correlations within black-box functions are exploited in all BO methods; SMS-DKL additionally exploits correlations across functions. Some MOBO methods achieve this, but they are not scalable to problems with large numbers of objectives.

Using multiple real-world datasets, we have verified that our proposed method outperforms both standard BO and multi-objective BO algorithms on a variety of sequence prediction tasks. For more information on stepwise model selection and deep kernel learning, view our 2020 AISTATS paper here.

Focus: Clairvoyance performance benchmarks

To illustrate the utility of Clairvoyance for automatic time-series learning and optimization, we employed three datasets for evaluation in comparison with standalone baseline models.

The first was an intensive care dataset (“ICU”) consisting of patients in intensive care units from the Medical Information Mart for Intensive Care, which records physiological data streams for over 23,000 patients. The second was a general wards dataset (“WARD”), consisting of over 6,000 patients hospitalized in the general medicine floor of a major medical center. The third was an outpatient dataset (“OUTP”) consisting of a cohort of patients enrolled in the UK Cystic Fibrosis registry, which records follow-up trajectories for over 10,000 patients. Measurement frequencies for ICU, WARD, and OUTP are 1-hour, 4-hour, and 6-months respectively.

Both as baselines for comparison and as serving as primitive prediction models within Clairvoyance itself, we used 6 commonly used models for time series learning. While these are currently included as the most general/popular classes of primitives, Clairvoyance can easily be extended with additional time series models—as long as they conform to the fit-transform abstraction.

For one-shot predictions, endpoints were ICU mortality after 48 hours from ICU admission (ICU), ICU admission after 48 hours from hospital admission (WARD), and 3-year mortality after 2 years from the first hospital visit (OUTP).  For online predictions, using static and longitudinal data until each time point, we predicted patient outcomes for specific horizons; endpoints were the use of ventilator support after 4 hours, 12 hours, and 24 hours (ICU), and O2 device support after 12 hours (WARD). As shown below, for one-shot predictions as well as online predictions, Clairvoyance generally optimizes for the best performance in comparison with its primitives. The performance metrics are area under receiver operating characteristic curve (AUC) and area under precision-recall curve (APR).

Online predictions performance of Clairvoyance versus its primitives (described above) for various prediction horizons.

One-shot predictions performance of Clairvoyance versus its primitives (described above).

Clairvoyance alpha is publicly available, and can be downloaded and tested. Like AutoPrognosis, current and future versions of Clairvoyance will continue to be made available at no cost.

The current state of AutoML

In the sections above, we have introduced AutoML as a promising solution to the resource-intensive and often inaccessible process of hand-crafting machine learning models. We have also highlighted the ability of AutoML to go beyond producing basic risk scores and predictions into personalized screening and monitoring, personalized dynamic forecasting of biomarkers, disease trajectories and clinical outcomes, and personalized estimation of treatment effects over time. As with the rest of the content on this page, the focus has been on healthcare, but the broader points may be more broadly applicable.

AutoML can already empower clinical professionals

We have already reached a point at which AutoML can, and should, be implemented within healthcare at scale. Our own lab’s AutoML frameworks have been successfully applied to problems as diverse as breast cancer, cystic fibrosis, cardiovascular disease, and, most recently, COVID-19. We are confident that, with limited investment, AutoML could be used for a variety of purposes at a local (e.g. clinic, hospital etc.), regional, national (e.g. NHS), or even international, scale—especially given its capabilities relative to existing digital infrastructure and highly limited machine learning implementation. AutoML—even in its current form—could be used today to make healthcare professionals (clinicians, nurses, hospital managers, national healthcare providers and regulators) more effective.

AutoML can turn electronic health records from a tedious data-entry infrastructure into a potent tool that empowers healthcare providers and patients alike—especially given the fact that a great deal of potentially relevant information is “lost” because it cannot be appreciated by clinicians, whereas machine learning predictions could make such data salient. In short, AutoML can form the backbone of nationwide engines for healthcare delivery. It can drive clinical discovery.

Why is this not happening yet?

While AutoML can already be used at scale to empower healthcare professionals and patients, the removal of several roadblocks could make its adoption all the more successful.

The first roadblock is data cleaning, preparation, and standardization. As machine learning and data science experts alike will know, a substantial amount of work goes into cleaning and preparing the datasets on which machine learning models are trained and make predictions. This is the case with hand-crafted machine learning models, but it’s also true of AutoML: in almost every case, a data scientist in collaboration with a healthcare professional will need to review, clean, and prepare data so that it can be understood by an AutoML framework.

To support this, common protocols or methods for formatting and preparing data will also need to be developed and adhered to. In the healthcare domain, for example, methods for maintaining and sharing electronic health records (EHR) vary between countries and regions; furthermore, the types of patient data that are recorded, and the manner in which they are recorded, vary between locations, organizations, and even individuals. Taking AutoML from ad-hoc operation and giving it true sustainability and scalability will require data to be recorded and shared in a consistent, high-quality, and easy-to-read manner across the healthcare network.

The second roadblock to enabling using AutoML at scale is the ongoing need to build layers of abstraction that permit various users to interact with an AutoML framework at their own knowledge level, and with their intended usage in mind. This includes creating intuitive, easy-to-use interfaces that enable end-users (such as clinicians or clinical researchers) to describe problems they want to solve in simple terms, and for these descriptions they share to be visible and understandable when interacted with by other users (such as data scientists) operating in different layers.  Additionally, it is important to ensure that the “hand-off” between humans and AutoML is smooth and that roles are well-defined. For example, end users and data scientists may need to formulate problems together at the beginning of the process and evaluate results at the end; by contrast, AutoML can generally handle tasks in the middle, such as model construction, prediction, treatment estimation, and uncertainty qualification.

The third roadblock to enable large-scale use of AutoML in healthcare is commonization of components: User-friendly AutoML frameworks must be constructible from modular “blocks” that can be selected depending on the purpose of the model. These blocks must also speak a universal “language,” with plug-and-play-like functionality that allows them to operate seamlessly when added to a framework (in this sense, we consider Clairvoyance to be setting a trend that we hope will be more broadly followed). This will entail developing algorithms with AutoML in mind—including accounting for interdependency between different stages of a pipeline. In essence, algorithms will need to be conceived from the start as “part of a whole.”

A simplified visualization showing how humans and machine learning may interact within the healthcare ecosystem in the near future.

Our lab’s future projects

In the short term, the removal of the roadblocks outlined above would be an important foundational step toward the realization of an ambitious and meaningful long-term vision: the creation of an AutoML-powered “ecosystem” in which machine learning can understand, collaborate with, and empower humans. In addition to improving outcomes for patients, this could enable professionals such as clinicians, administrators, and researchers to allocate their time and energy to areas where they can add the most value.

It is no secret that professionals throughout healthcare are currently required to spend an unenviable amount of time on highly repetitive and relatively low-value tasks, such as updating electronic health records and scheduling basic tests. As noted in the 2019 NHS Topol Review, between 15 and 70 per cent of a clinician’s working time is spent on administrative tasks, many of which could be partially or fully automated through machine learning.

Building this ecosystem would require the integration of machine learning across multiple levels of operation. On one level, AutoML pipelines such as those described earlier on this page (for example, a descendant of Clairvoyance) would analyze datasets and offer predictions and recommendations for the patient at hand related to personalized screening and monitoring, diagnosis (including early diagnosis!), and personalized treatment—while also providing interpretable results and uncertainty estimates associated with the various predictions and recommendations. On another level, AutoML would provide recommendations (including offering to automate mundane or repetitive work) tailored to individual decision-makers, such as clinicians, administrators, or researchers. This would be based on a deeper, more fundamental understanding of human behavior and decision-making, combined with informed judgements regarding which tasks can be automated, which tasks can be recommended (but not fully automated), and which tasks should be left entirely to humans.

While the ecosystem described above requires a level of cognition that is arguably beyond the current abilities of machine learning, it could be within reach in the relatively near future. In fact, in recent years our lab has been conducting pioneering research on developing new strands of machine learning which are able to understand, collaborate with and empower humans, especially healthcare personnel such as clinicians, nurses and hospital managers.

One of our most recent studies led to the development of a technique we call Inverse Active Sensing, in which a novel machine learning method was developed which is able to uncover the decision making of individuals (clinicians) under time pressure. This entails examining what an individual clinician appears to effectively prioritize—in a healthcare context, this could yield actionable findings, such as which tests are being over-prescribed and which tests are prescribed too late and for which patients and which diseases are diagnosed early or too late. Inverse Active Sensing, and similar new research projects being undertaken by our lab, will form the basis of a new area of study in machine learning for healthcare—an area built around understanding, automating, and improving human decision-making by building ecosystems that can empower healthcare professionals and patients alike.

The ecosystem outlined above is doubtless many years away, and is built upon many prerequisite steps. In addition to near-term improvements related to the accessibility and commonization of AutoML, further cognitive advances are needed with regard to understanding how decisions are made, and which tasks can be handled by machine learning. Additionally, machine learning will need to function simultaneously on a system-wide level and an individual level, requiring a degree of interoperability that will only be achievable through AutoML.

Given the many challenges facing us in making turning this vision into reality, some may question whether AutoML is, in fact, an area worth investing time and resources in pursuing. Our view, as a machine learning lab that works extensively alongside clinicians, medical researchers, hospital managers, and patients, is that AutoML is essential to ensuring that machine learning can be applied effectively and consistently throughout healthcare, and that its impact can be felt and appreciated by communities outside our own. A substantial area of focus for our future work will be encouraging and supporting the development of an ecosystem to unlock the true potential of AutoML.

To find out more about our lab’s work on AutoML, visit our publications page.

For a general-purpose introduction to our lab’s broad vision for the future of machine learning in healthcare, see Mihaela van der Schaar’s chapter in the 2018 Annual Report of the Chief Medical Officer for England, entitled “Machine learning for individualised medicine.”


Get involved!

We’ve decided to create a number of engagement events to share ideas and discuss topics that will define the future of machine learning in healthcare.

At present, we’re planning to hold two types of session on a regular basis: Inspiration Exchange will provide a forum to brainstorm methods and techniques with machine learning students, while Revolutionizing Healthcare will target the the healthcare community and focus on challenges and opportunities in clinical application of machine learning.

Sign up for Inspiration Exchange or Revolutionizing Healthcare

Further reading and references

You can find our publications on AutoML here, and our software here.

Mihaela van der Schaar’s keynote at the 7th ICML Workshop on Automated Machine Learning (AutoML 2020).

Resources and papers cited on this page, in order of appearance

Acknowledgements

We would like thank the many reviewers who have kindly reviewed and improved this piece. They include:

– Prof. Bill Zame
– Dr. Ahmed Alaa
– Dr. Jinsung Yoon
– Ioana Bica
– Yao Zhang
– Zhaozhi Qian
– Alicia Curth

– Dr. Alexander Gimson
– Dr. Andres Floto
– Dr. Ari Ercole
– Dr. Eoin McKinney
– Dr. Paul Elbers