van der Schaar Lab

Data Imputation: An essential yet overlooked problem in machine learning

Missing data is a problem that’s often overlooked, especially by ML researchers that assume access to complete input datasets to train their models.

Yet, it is a problem haunting not only healthcare professionals and researchers but anyone engaging with scientific methods. Data might be missing because it was never collected, entries were lost, or for many other reasons. Other pieces of information could be difficult or even costly to acquire.

In the past, data imputation has been done mostly using statistical methods ranging from simple methods such as mean imputation to more sophisticated iterative imputation. These imputation algorithms can be used to estimate missing values based on data that has been observed/measured.

But to do imputation well, we have to solve very interesting ML challenges. The van der Schaar Lab is leading in its work on data imputation with the help of machine learning. Pioneering novel approaches, we create methodologies that not only deal with the most common problems of missing data, but also address new scenarios. Solving this problem required us to incorporate and extend ideas from fields such as causality, autoML, generative modelling, and even time series modelling.

However, there are a plethora of methods one can use to impute the missing values in a dataset. As such, our lab has created a package– called Hyperimpute –that selects the best method for you. From its internal library of imputation methods, Hyperimpute uses principles in auto-ml to match a method with your data. An overview of this is provided below and below that, our presentation at ICML 2022.

Inspiration Exchange – Causal Deep Learning Part 1

Generalized Iterative Imputation with Automatic Model Selection

Daniel Jarrett*, Bogdan Cebere*, Tennison Liu, Alicia Curth, Mihaela van der Schaar

ICML 2022


Hyperimpute is a very useful tool for people trying to solve their issues with missing data easily and quickly. However, besides tools, we also think about missingness as a theoretical problem.

Causal networks show us that missing data is a hard problem. Especially when considering the setting where missingness may not occur completely randomly. Imagine there being missingness in the data because there was some confounder present. In a recent paper, our lab investigates this in the setting of treatment effects. In particular, we find that current solutions for missing data imputations may introduce bias in treatment effect estimates.

The reason for this is that there exist scenarios (for example in healthcare) where treatment is causing missingness, but also, where treatment is chosen on the presence (or absence) of other variables. This realisation leads to a certain causal structure (which is depicted below) which includes both a confounded path and a collider path between covariates and treatment.

To Impute or not to Impute?
Missing Data in Treatment Effect Estimation

Jeroen Berrevoets, Fergus Imrie, Trent Kyono, James Jordon, Mihaela van der Schaar




One such method included in Hyperimpute’s library is one of the lab’s earliest and most adopted methods: GAIN. GAIN is a method based on the well known GAN-framework where missing values are treated as corrupted samples to be completed by the generative network. An architectural overview of this method can be seen below.

Generative Adversarial Imputation Nets (GAIN)

Jinsung Yoon*, James Jordon*, Mihaela van der Schaar

ICML 2018


While GAIN builds a generative model using purely neural networks, one could imagine a more principled approach through causality. As such, the lab has developed MIRACLE which completes data with missingness using a causal deep learning approach. Specifically, MIRACLE regularises the hypothesis space of a neural net by simultaneously learning a causal graph, such as depicted below.

MIRACLE: Causally-Aware Imputation via Learning Missing Data Mechanisms

Trent Kyono*, Yao Zhang*, Alexis Bellot, Mihaela van der Schaar

NeurIPS 2021


Parallel to causality is time series data. The above work all assumes static situations, yet, time series are incredibly common in all sorts of settings. Our lab has introduced M-RNN, a method based on recurrent neural networks. With M-RNN we interpolate within as well as across data streams for a dramatically improved estimation of missing data. We show this in the architectural overview below.

Estimating Missing Data in Temporal Data Streams Using Multi-Directional Recurrent Neural Networks

Jinsung Yoon, William R. Zame, Mihaela van der Schaar



Data not missing at random and Informative missingness

Imputation is not always a good idea. At times, data may not be missing at random; instead, missingness may be informative and we can learn from it. Below are listed some of our works on this topic.

Learning from Clinical Judgements: Semi-Markov-Modulated Marked Hawkes Processes for Risk Prognosis

Ahmed M. Alaa, Scott Hu, Mihaela van der Schaar

ICML 2017


Accounting for Informative Sampling when Learning to Forecast Treatment Outcomes over Time

Toon Vanderschueren*, Alicia Curth*, Wouter Verbeke, Mihaela van der Schaar

ICML 2023


Active Sensing

The field of imputation is also closely related to the field of active sensing, in which information which is costly needs to be judiciously acquired as a result of a careful tradeoff between the benefits and costs of acquiring information.

Our group has done extensive work on active sensing over the years. Below are listed a few of these works.

Deep Sensing: Active Sensing using Multi-Directional Recurrent Neural Networks

Jinsung Yoon, William R. Zame, Mihaela van der Schaar

ICLR 2018


ASAC: Active Sensing using Actor-Critic models

Toon Vanderschueren*, Alicia Curth*, Wouter Verbeke, Mihaela van der Schaar

Machine Learning for Healthcare Conference 2019


Clearly, imputation is an important problem in machine learning. Our lab recognises this and is actively contributing to resolve the many issues involved to perform accurate and reliable imputation.

Jeroen Berrevoets

Jeroen Berrevoets joined the van der Schaar Lab from the Vrije Universiteit Brussel (VUB). Prior to this, he analyzed traffic data at 4 of Belgium’s largest media outlets and performed structural dynamics analysis at BMW Group in Munich.

As a PhD student in the van der Schaar Lab, Jeroen plans to explore the potential of machine learning in aiding medical discovery, rather than simply applying it to non-obvious predictions. His main research interests involve using machine learning and causal inference to gain understanding of various diseases and medications.

Much of this draws from his firmly-held belief that, “while learning to predict, machine learning models captivate some of the underlying dynamics and structure of the problem. Exposing this structure in fields such as medicine, could prove groundbreaking for disease understanding, and consequentially drug discovery.”

Jeroen’s studentship is supported under the W. D. Armstrong Trust Fund. He will be supervised jointly by Mihaela van der Schaar and Dr. Eoin McKinney.

Alicia Curth

Alicia Curth, a self-described “full-blooded applied statistician,” recently completed an MSc in Statistical Science at the University of Oxford, where she graduated with distinction and was awarded the Gutiérrez Toscano Prize (awarded to the best-performing MSc candidates in Statistical Science each year). Her previous professional experience includes a data science role for Media Analytics, and a research internship at Pacmed, a healthcare tech start-up.

Alicia also holds a BSc in Econometrics and Operations Research and a BSc in Economics and Business Economics from the Erasmus University Rotterdam.

Since meeting Mihaela van der Schaar at Oxford, Alicia says she’s “been fascinated by the diverse, creative and bleeding edge work of everyone in the lab ever since.”

Alicia hopes to explore ways of making machine learning ready for use in applied statistics, where problems are inferential rather than purely predictive in nature and the ability to give theoretical guarantees is essential. As she sees it, “there is much to gain by replacing linear regression with more flexible machine learning models.” She is particularly excited by potential applications in the areas of personalized and precision medicine, where she hopes machine learning can help healthcare “consider more than just the average patient in the future.”

Alicia is interested in building a better understanding of which algorithms work when and why, and aims to contribute to bridging the gap between theory and practice in machine learning. She is particularly interested in building decision support systems for doctors, and aiding knowledge discovery through next-generation clinical trials as well as analyses of genomics (and other omics) data.

Alicia’s studentship is funded by AstraZeneca.

Alicia has played waterpolo since the age of 12, and was German champion during high school. At Oxford, she represented the university as part of the women’s Blues team.

Bogdan Cebere

Bogdan is one of the lab’s research engineers, having joined the team in 2021. He received his bachelor’s degree in computer science in 2012 and his master’s degree in distributed systems in 2014, both from the University of Bucharest.

Prior to joining the van der Schaar Lab, Bogdan worked for roughly 10 years at a cybersecurity company. During this time, he contributed to a range of research projects related to network security, cryptography, and data privacy, which required high-performance solutions in embedded or cloud environments.

Bogdan has also made substantial contributions to open-source projects, mostly focused on privacy preserving techniques for machine learning. Some of his key contributions in this space have been for the OpenMined community; he and his collaborators published this work in workshops at the prominent NeurIPS and ICLR conferences.

Bogdan is driven to keep learning new things every day, and to keep improving—that’s his main reason for joining the van der Schaar lab.

Mihaela van der Schaar

Mihaela van der Schaar is the John Humphrey Plummer Professor of Machine Learning, Artificial Intelligence and Medicine at the University of Cambridge and a Fellow at The Alan Turing Institute in London.

Mihaela has received numerous awards, including the Oon Prize on Preventative Medicine from the University of Cambridge (2018), a National Science Foundation CAREER Award (2004), 3 IBM Faculty Awards, the IBM Exploratory Stream Analytics Innovation Award, the Philips Make a Difference Award and several best paper awards, including the IEEE Darlington Award.

In 2019, she was identified by National Endowment for Science, Technology and the Arts as the most-cited female AI researcher in the UK. She was also elected as a 2019 “Star in Computer Networking and Communications” by N²Women. Her research expertise span signal and image processing, communication networks, network science, multimedia, game theory, distributed systems, machine learning and AI.

Mihaela’s research focus is on machine learning, AI and operations research for healthcare and medicine.