van der Schaar Lab

Data Imputation: An essential yet overlooked problem in machine learning

Missing data is a problem that’s often overlooked, especially by ML researchers that assume access to complete input datasets to train their models.

Yet, it is a problem haunting not only healthcare professionals and researchers but anyone engaging with scientific methods. Data might be missing because it was never collected, entries were lost, or for many other reasons. Other pieces of information could be difficult or even costly to acquire.

In the past, data imputation has been done mostly using statistical methods ranging from simple methods such as mean imputation to more sophisticated iterative imputation. These imputation algorithms can be used to estimate missing values based on data that has been observed/measured.

But to do imputation well, we have to solve very interesting ML challenges. The van der Schaar Lab is leading in its work on data imputation with the help of machine learning. Pioneering novel approaches, we create methodologies that not only deal with the most common problems of missing data, but also address new scenarios. Solving this problem required us to incorporate and extend ideas from fields such as causality, autoML, generative modelling, and even time series modelling.

However, there are a plethora of methods one can use to impute the missing values in a dataset. As such, our lab has created a package– called Hyperimpute –that selects the best method for you. From its internal library of imputation methods, Hyperimpute uses principles in auto-ml to match a method with your data. An overview of this is provided below and below that, our presentation at ICML 2022.

Inspiration Exchange – Causal Deep Learning Part 1

Generalized Iterative Imputation with Automatic Model Selection

Daniel Jarrett*, Bogdan Cebere*, Tennison Liu, Alicia Curth, Mihaela van der Schaar

ICML 2022

Consider the problem of imputing missing values in a dataset. One the one hand, conventional approaches using iterative imputation benefit from the simplicity and customizability of learning conditional distributions directly, but suffer from the practical requirement for appropriate model specification of each and every variable. On the other hand, recent methods using deep generative modeling benefit from the capacity and efficiency of learning with neural network function approximators, but are often difficult to optimize and rely on stronger data assumptions. In this work, we study an approach that marries the advantages of both: We propose *HyperImpute*, a generalized iterative imputation framework for adaptively and automatically configuring column-wise models and their hyperparameters. Practically, we provide a concrete implementation with out-of-the-box learners, optimizers, simulators, and extensible interfaces. Empirically, we investigate this framework via comprehensive experiments and sensitivities on a variety of public datasets, and demonstrate its ability to generate accurate imputations relative to a strong suite of benchmarks. Contrary to recent work, we believe our findings constitute a strong defense of the iterative imputation paradigm.

Hyperimpute is a very useful tool for people trying to solve their issues with missing data easily and quickly. However, besides tools, we also think about missingness as a theoretical problem.

Causal networks show us that missing data is a hard problem. Especially when considering the setting where missingness may not occur completely randomly. Imagine there being missingness in the data because there was some confounder present. In a recent paper, our lab investigates this in the setting of treatment effects. In particular, we find that current solutions for missing data imputations may introduce bias in treatment effect estimates.

The reason for this is that there exist scenarios (for example in healthcare) where treatment is causing missingness, but also, where treatment is chosen on the presence (or absence) of other variables. This realisation leads to a certain causal structure (which is depicted below) which includes both a confounded path and a collider path between covariates and treatment.

To Impute or not to Impute?
Missing Data in Treatment Effect Estimation

Jeroen Berrevoets, Fergus Imrie, Trent Kyono, James Jordon, Mihaela van der Schaar


Missing data is a systemic problem in practical scenarios that causes noise and bias when estimating treatment effects. This makes treatment effect estimation from data with missingness a particularly tricky endeavour. A key reason for this is that standard assumptions on missingness are rendered insufficient due to the presence of an additional variable, treatment, besides the individual and the outcome. Having a treatment variable introduces additional complexity with respect to why some variables are missing that is not fully explored by previous work. In our work, we identify a new missingness mechanism, which we term mixed confounded missingness (MCM), where some missingness determines treatment selection and other missingness is determined by treatment selection. Given MCM, we show that naively imputing all data leads to poor performing treatment effects models, as the act of imputation effectively removes information necessary to provide unbiased estimates. However, no imputation at all also leads to biased estimates, as missingness determined by treatment divides the population in distinct subpopulations, where estimates across these populations will be biased. Our solution is selective imputation, where we use insights from MCM to inform precisely which variables should be imputed and which should not. We empirically demonstrate how various learners benefit from selective imputation compared to other solutions for missing data.


One such method included in Hyperimpute’s library is one of the lab’s earliest and most adopted methods: GAIN. GAIN is a method based on the well known GAN-framework where missing values are treated as corrupted samples to be completed by the generative network. An architectural overview of this method can be seen below.

Generative Adversarial Imputation Nets (GAIN)

Jinsung Yoon*, James Jordon*, Mihaela van der Schaar

ICML 2018

We propose a novel method for imputing missing data by adapting the well-known Generative Adversarial Nets (GAN) framework. Accordingly, we call our method Generative Adversarial Imputation Nets (GAIN). The generator (G) observes some components of a real data vector, imputes the missing components conditioned on what is actually observed and outputs a completed vector. The discriminator (D) then takes a completed vector and attempts to determine which components were actually observed and which were imputed. To ensure that D forces G to learn the desired distribution, we provide D with some additional information in the form of a hint vector. The hint reveals to D partial information about the missingness of the original sample, which is used by D to focus its attention on the imputation quality of particular components. This hint ensures that G does in fact learn to generate according to the true data distribution. We tested our method on various datasets and found that GAIN significantly outperforms state-of-the-art imputation methods.

While GAIN builds a generative model using purely neural networks, one could imagine a more principled approach through causality. As such, the lab has developed MIRACLE which completes data with missingness using a causal deep learning approach. Specifically, MIRACLE regularises the hypothesis space of a neural net by simultaneously learning a causal graph, such as depicted below.

MIRACLE: Causally-Aware Imputation via Learning Missing Data Mechanisms

Trent Kyono*, Yao Zhang*, Alexis Bellot, Mihaela van der Schaar

NeurIPS 2021

Missing data is an important problem in machine learning practice. Starting from the premise that imputation methods should preserve the causal structure of the data, we develop a regularization scheme that encourages any baseline imputation method to be causally consistent with the underlying data generating mechanism. Our proposal is a causally-aware imputation algorithm (MIRACLE). MIRACLE iteratively refines the imputation of a baseline by simultaneously modelling the missingness generating mechanism, encouraging imputation to be consistent with the causal structure of the data. We conduct extensive experiments on synthetic and a variety of publicly available datasets to show that MIRACLE is able to consistently improve imputation over a variety of benchmark methods across all three missingness scenarios: at random, completely at random, and not at random.

Parallel to causality is time series data. The above work all assumes static situations, yet, time series are incredibly common in all sorts of settings. Our lab has introduced M-RNN, a method based on recurrent neural networks. With M-RNN we interpolate within as well as across data streams for a dramatically improved estimation of missing data. We show this in the architectural overview below.

Estimating Missing Data in Temporal Data Streams Using Multi-Directional Recurrent Neural Networks

Jinsung Yoon, William R. Zame, Mihaela van der Schaar


Most time-series datasets with multiple data streams have (many) missing measurements that need to be estimated. Most existing methods address this estimation problem either by interpolating within data streams or imputing across data streams; we develop a novel approach that does both. Our approach is based on a deep learning architecture that we call a Multi- directional Recurrent Neural Network (M-RNN). An M-RNN differs from a bi-directional RNN in that it operates across streams in addition to within streams, and because the timing of in- puts into the hidden layers is both lagged and advanced. To demonstrate the power of our approach we apply it to a familiar real-world medical dataset and demonstrate significantly improved performance.

Clearly, imputation is an important problem in machine learning. Our lab recognises this and is actively contributing to resolve the many issues involved to perform accurate and reliable imputation.

Jeroen Berrevoets

Jeroen Berrevoets joined the van der Schaar Lab from the Vrije Universiteit Brussel (VUB). Prior to this, he analyzed traffic data at 4 of Belgium’s largest media outlets and performed structural dynamics analysis at BMW Group in Munich.

As a PhD student in the van der Schaar Lab, Jeroen plans to explore the potential of machine learning in aiding medical discovery, rather than simply applying it to non-obvious predictions. His main research interests involve using machine learning and causal inference to gain understanding of various diseases and medications.

Much of this draws from his firmly-held belief that, “while learning to predict, machine learning models captivate some of the underlying dynamics and structure of the problem. Exposing this structure in fields such as medicine, could prove groundbreaking for disease understanding, and consequentially drug discovery.”

Jeroen’s studentship is supported under the W. D. Armstrong Trust Fund. He will be supervised jointly by Mihaela van der Schaar and Dr. Eoin McKinney.

Alicia Curth

Alicia Curth, a self-described “full-blooded applied statistician,” recently completed an MSc in Statistical Science at the University of Oxford, where she graduated with distinction and was awarded the Gutiérrez Toscano Prize (awarded to the best-performing MSc candidates in Statistical Science each year). Her previous professional experience includes a data science role for Media Analytics, and a research internship at Pacmed, a healthcare tech start-up.

Alicia also holds a BSc in Econometrics and Operations Research and a BSc in Economics and Business Economics from the Erasmus University Rotterdam.

Since meeting Mihaela van der Schaar at Oxford, Alicia says she’s “been fascinated by the diverse, creative and bleeding edge work of everyone in the lab ever since.”

Alicia hopes to explore ways of making machine learning ready for use in applied statistics, where problems are inferential rather than purely predictive in nature and the ability to give theoretical guarantees is essential. As she sees it, “there is much to gain by replacing linear regression with more flexible machine learning models.” She is particularly excited by potential applications in the areas of personalized and precision medicine, where she hopes machine learning can help healthcare “consider more than just the average patient in the future.”

Alicia is interested in building a better understanding of which algorithms work when and why, and aims to contribute to bridging the gap between theory and practice in machine learning. She is particularly interested in building decision support systems for doctors, and aiding knowledge discovery through next-generation clinical trials as well as analyses of genomics (and other omics) data.

Alicia’s studentship is funded by AstraZeneca.

Alicia has played waterpolo since the age of 12, and was German champion during high school. At Oxford, she represented the university as part of the women’s Blues team.

Bogdan Cebere

Bogdan is one of the lab’s research engineers, having joined the team in 2021. He received his bachelor’s degree in computer science in 2012 and his master’s degree in distributed systems in 2014, both from the University of Bucharest.

Prior to joining the van der Schaar Lab, Bogdan worked for roughly 10 years at a cybersecurity company. During this time, he contributed to a range of research projects related to network security, cryptography, and data privacy, which required high-performance solutions in embedded or cloud environments.

Bogdan has also made substantial contributions to open-source projects, mostly focused on privacy preserving techniques for machine learning. Some of his key contributions in this space have been for the OpenMined community; he and his collaborators published this work in workshops at the prominent NeurIPS and ICLR conferences.

Bogdan is driven to keep learning new things every day, and to keep improving—that’s his main reason for joining the van der Schaar lab.

Mihaela van der Schaar

Mihaela van der Schaar is the John Humphrey Plummer Professor of Machine Learning, Artificial Intelligence and Medicine at the University of Cambridge and a Fellow at The Alan Turing Institute in London.

Mihaela has received numerous awards, including the Oon Prize on Preventative Medicine from the University of Cambridge (2018), a National Science Foundation CAREER Award (2004), 3 IBM Faculty Awards, the IBM Exploratory Stream Analytics Innovation Award, the Philips Make a Difference Award and several best paper awards, including the IEEE Darlington Award.

In 2019, she was identified by National Endowment for Science, Technology and the Arts as the most-cited female AI researcher in the UK. She was also elected as a 2019 “Star in Computer Networking and Communications” by N²Women. Her research expertise span signal and image processing, communication networks, network science, multimedia, game theory, distributed systems, machine learning and AI.

Mihaela’s research focus is on machine learning, AI and operations research for healthcare and medicine.