Missing data is a problem that’s often overlooked, especially by ML researchers that assume access to complete input datasets to train their models.Tweet
Yet, it is a problem haunting not only healthcare professionals and researchers but anyone engaging with scientific methods. Data might be missing because it was never collected, entries were lost, or for many other reasons. Other pieces of information could be difficult or even costly to acquire.
In the past, data imputation has been done mostly using statistical methods ranging from simple methods such as mean imputation to more sophisticated iterative imputation. These imputation algorithms can be used to estimate missing values based on data that has been observed/measured.
But to do imputation well, we have to solve very interesting ML challenges. The van der Schaar Lab is leading in its work on data imputation with the help of machine learning. Pioneering novel approaches, we create methodologies that not only deal with the most common problems of missing data, but also address new scenarios. Solving this problem required us to incorporate and extend ideas from fields such as causality, autoML, generative modelling, and even time series modelling.
However, there are a plethora of methods one can use to impute the missing values in a dataset. As such, our lab has created a package– called Hyperimpute –that selects the best method for you. From its internal library of imputation methods, Hyperimpute uses principles in auto-ml to match a method with your data. An overview of this is provided below and below that, our presentation at ICML 2022.
Generalized Iterative Imputation with Automatic Model Selection
Daniel Jarrett*, Bogdan Cebere*, Tennison Liu, Alicia Curth, Mihaela van der Schaar
Consider the problem of imputing missing values in a dataset. One the one hand, conventional approaches using iterative imputation benefit from the simplicity and customizability of learning conditional distributions directly, but suffer from the practical requirement for appropriate model specification of each and every variable. On the other hand, recent methods using deep generative modeling benefit from the capacity and efficiency of learning with neural network function approximators, but are often difficult to optimize and rely on stronger data assumptions. In this work, we study an approach that marries the advantages of both: We propose *HyperImpute*, a generalized iterative imputation framework for adaptively and automatically configuring column-wise models and their hyperparameters. Practically, we provide a concrete implementation with out-of-the-box learners, optimizers, simulators, and extensible interfaces. Empirically, we investigate this framework via comprehensive experiments and sensitivities on a variety of public datasets, and demonstrate its ability to generate accurate imputations relative to a strong suite of benchmarks. Contrary to recent work, we believe our findings constitute a strong defense of the iterative imputation paradigm.
Hyperimpute is a very useful tool for people trying to solve their issues with missing data easily and quickly. However, besides tools, we also think about missingness as a theoretical problem.
Causal networks show us that missing data is a hard problem. Especially when considering the setting where missingness may not occur completely randomly. Imagine there being missingness in the data because there was some confounder present. In a recent paper, our lab investigates this in the setting of treatment effects. In particular, we find that current solutions for missing data imputations may introduce bias in treatment effect estimates.
The reason for this is that there exist scenarios (for example in healthcare) where treatment is causing missingness, but also, where treatment is chosen on the presence (or absence) of other variables. This realisation leads to a certain causal structure (which is depicted below) which includes both a confounded path and a collider path between covariates and treatment.
To Impute or not to Impute?
Missing Data in Treatment Effect Estimation
Jeroen Berrevoets, Fergus Imrie, Trent Kyono, James Jordon, Mihaela van der Schaar
Missing data is a systemic problem in practical scenarios that causes noise and bias when estimating treatment effects. This makes treatment effect estimation from data with missingness a particularly tricky endeavour. A key reason for this is that standard assumptions on missingness are rendered insufficient due to the presence of an additional variable, treatment, besides the individual and the outcome. Having a treatment variable introduces additional complexity with respect to why some variables are missing that is not fully explored by previous work. In our work, we identify a new missingness mechanism, which we term mixed confounded missingness (MCM), where some missingness determines treatment selection and other missingness is determined by treatment selection. Given MCM, we show that naively imputing all data leads to poor performing treatment effects models, as the act of imputation effectively removes information necessary to provide unbiased estimates. However, no imputation at all also leads to biased estimates, as missingness determined by treatment divides the population in distinct subpopulations, where estimates across these populations will be biased. Our solution is selective imputation, where we use insights from MCM to inform precisely which variables should be imputed and which should not. We empirically demonstrate how various learners benefit from selective imputation compared to other solutions for missing data.
One such method included in Hyperimpute’s library is one of the lab’s earliest and most adopted methods: GAIN. GAIN is a method based on the well known GAN-framework where missing values are treated as corrupted samples to be completed by the generative network. An architectural overview of this method can be seen below.
Generative Adversarial Imputation Nets (GAIN)
Jinsung Yoon*, James Jordon*, Mihaela van der Schaar
We propose a novel method for imputing missing data by adapting the well-known Generative Adversarial Nets (GAN) framework. Accordingly, we call our method Generative Adversarial Imputation Nets (GAIN). The generator (G) observes some components of a real data vector, imputes the missing components conditioned on what is actually observed and outputs a completed vector. The discriminator (D) then takes a completed vector and attempts to determine which components were actually observed and which were imputed. To ensure that D forces G to learn the desired distribution, we provide D with some additional information in the form of a hint vector. The hint reveals to D partial information about the missingness of the original sample, which is used by D to focus its attention on the imputation quality of particular components. This hint ensures that G does in fact learn to generate according to the true data distribution. We tested our method on various datasets and found that GAIN significantly outperforms state-of-the-art imputation methods.
While GAIN builds a generative model using purely neural networks, one could imagine a more principled approach through causality. As such, the lab has developed MIRACLE which completes data with missingness using a causal deep learning approach. Specifically, MIRACLE regularises the hypothesis space of a neural net by simultaneously learning a causal graph, such as depicted below.
MIRACLE: Causally-Aware Imputation via Learning Missing Data Mechanisms
Trent Kyono*, Yao Zhang*, Alexis Bellot, Mihaela van der Schaar
Missing data is an important problem in machine learning practice. Starting from the premise that imputation methods should preserve the causal structure of the data, we develop a regularization scheme that encourages any baseline imputation method to be causally consistent with the underlying data generating mechanism. Our proposal is a causally-aware imputation algorithm (MIRACLE). MIRACLE iteratively refines the imputation of a baseline by simultaneously modelling the missingness generating mechanism, encouraging imputation to be consistent with the causal structure of the data. We conduct extensive experiments on synthetic and a variety of publicly available datasets to show that MIRACLE is able to consistently improve imputation over a variety of benchmark methods across all three missingness scenarios: at random, completely at random, and not at random.
Parallel to causality is time series data. The above work all assumes static situations, yet, time series are incredibly common in all sorts of settings. Our lab has introduced M-RNN, a method based on recurrent neural networks. With M-RNN we interpolate within as well as across data streams for a dramatically improved estimation of missing data. We show this in the architectural overview below.
Estimating Missing Data in Temporal Data Streams Using Multi-Directional Recurrent Neural Networks
Jinsung Yoon, William R. Zame, Mihaela van der Schaar
IEEE TBME 2018
Most time-series datasets with multiple data streams have (many) missing measurements that need to be estimated. Most existing methods address this estimation problem either by interpolating within data streams or imputing across data streams; we develop a novel approach that does both. Our approach is based on a deep learning architecture that we call a Multi- directional Recurrent Neural Network (M-RNN). An M-RNN differs from a bi-directional RNN in that it operates across streams in addition to within streams, and because the timing of in- puts into the hidden layers is both lagged and advanced. To demonstrate the power of our approach we apply it to a familiar real-world medical dataset and demonstrate significantly improved performance.
Clearly, imputation is an important problem in machine learning. Our lab recognises this and is actively contributing to resolve the many issues involved to perform accurate and reliable imputation.