van der Schaar Lab

Why Tabular Foundation Models Should Be a Research Priority

Recent text and image foundation models are incredibly impressive, and these models are attracting an ever-increasing portion of research resources (see figure below, representing different modalities in foundation model research across recent ML conferences).

In our ICML 2024 position paper, we argue why foundation model research should explore other modalities more, in particular tabular data. We believe the time is now to start developing tabular foundation models, or what we coin a Large Tabular Model (LTM). LTMs could revolutionise the way science and ML use tabular data: not as single datasets that are analysed in a vacuum but contextualised with respect to related datasets.

Why we should care about tabular data

Tabular data is ubiquitous in the real world, from electronic healthcare records to census data, from finance to natural sciences. These datasets are the fundament for the progress of scientific knowledge and to influence public policy. And yet, the tabular domain offers uniquely exciting, large, unsolved challenges for researchers. For example, in contrast to other domains, deep learning models still often perform on par with tree-based models.

At last, humans are very limited at understanding and analysing tabular data themselves (cf. image and text), hence a good LTM could truly extend human capabilities.

Why the tabular domain is so challenging

In our paper, we discuss the requirements for an LTM to be adaptable to a wide range of different tasks. We consider four requirements:

Because of these requirements, building an LTM is conceptually and architecturally complicated and poses unique and exciting challenges for researchers. We discuss how current work on LTMs falls short of this, and why LLM-based approaches are not very suitable.

Building generative LTMs presents significant challenges due to the complexity of modelling diverse datasets and types with context. These models are currently limited in scale and generalisation, often trained on small, varied datasets that can be noisy or incomplete. The diversity and quality of data, particularly in the tabular domain, complicate the creation of broadly applicable LTMs. Evaluation is difficult due to the intrinsic and extrinsic metrics required, and these models must also be assessed for privacy and bias, as tabular data can contain discriminatory features and biases similar to those found in other data types.

Why solving the challenge is worth it

Tabular foundation models can significantly enhance machine learning by improving the representation and inclusiveness of underrepresented groups and scarce data domains such as healthcare. They could enable both direct adaptation for specific tasks and indirect adaptation through the generation of synthetic data, which can augment real datasets and address data scarcity. LTMs also have the potential to support responsible AI by improving robustness, privacy, and data democratisation. In scientific applications, LTMs may facilitate meta-analyses by consolidating heterogeneous datasets and serve as powerful assistants for data scientists, enhancing productivity and automating complex data tasks.

Boris van Breugel

Boris van Breugel most recently completed a MSc in Machine Learning at University College London. His study was supported under a Young Talent Award by Prins Bernhard Cultuurfonds, and a VSBfonds scholarship. Prior to this, he received a MASt in Applied Mathematics from the University of Cambridge.

Reflecting his broad research background, Boris’ current research interests range from model interpretability to learning from missing data, from modelling treatment effects to high-dimensional omics data.

While studying for his MSc in Machine Learning at UCL, Boris developed a model to detect Alzheimer’s disease in different forms of medical imaging data, potentially enabling diagnosis at an earlier stage and thereby aiding the development of more effective treatment plans. He found the healthcare domain uniquely challenging and rewarding, and decided to continue research in the domain.

As a PhD student with the van der Schaar Lab, Boris aims to develop methods for finding meaningful structure in omics data—in essence, he says, “the amount of omics data is increasing at a huge speed, and machine learning methods can allow us to interpret and make sense of all this data.”

Boris’ studentship is funded by the Office of Naval Research (ONR).

Mihaela van der Schaar

Mihaela van der Schaar is the John Humphrey Plummer Professor of Machine Learning, Artificial Intelligence and Medicine at the University of Cambridge and a Fellow at The Alan Turing Institute in London.

Mihaela has received numerous awards, including the Oon Prize on Preventative Medicine from the University of Cambridge (2018), a National Science Foundation CAREER Award (2004), 3 IBM Faculty Awards, the IBM Exploratory Stream Analytics Innovation Award, the Philips Make a Difference Award and several best paper awards, including the IEEE Darlington Award.

In 2019, she was identified by National Endowment for Science, Technology and the Arts as the most-cited female AI researcher in the UK. She was also elected as a 2019 “Star in Computer Networking and Communications” by N²Women. Her research expertise span signal and image processing, communication networks, network science, multimedia, game theory, distributed systems, machine learning and AI.

Mihaela’s research focus is on machine learning, AI and operations research for healthcare and medicine.