van der Schaar Lab

Clarification on Strictly Batch Imitation Learning by Energy-based Distribution Matching

The purpose of this short note is to prevent possible confusion regarding Strictly Batch Imitation Learning by Energy-based Distribution Matching, one of our lab’s papers published in 2020.

It has come to our attention that the notation employed in this paper may have inadvertently blurred the line between online and offline distributions. This distinction is clarified below.

We can speak of two distinct “state distribution” terms when referring to ρθ:
(1) a state distribution parameterized directly by θ, which we call ρθ, and
(2) a state distribution induced by some πθ, which is perhaps more explicitly denoted ρπ_θ.

Importantly, 1) and 2) are entirely distinct, and have no prior relation to one another. Since we cannot learn the latter in the batch setting, all we are doing is exploring whether there is there any benefit in learning the former instead.

When we wrote about these quantities in the paper, we noted that the two are not the same thing:

[p4] “The actual occupancy measure corresponds to rolling out πθ, and if we could do that, we would naturally recover an approach not unlike the variety of distribution-aware algorithms in the literature; see e.g. [50]. In the strictly batch setting, we clearly cannot sample directly from this (online) distribution. However, as a matter of multitask learning, we still hope to gain from jointly learning an (offline) model of the state distribution.”

[p6] “In the online setting, minimizing Equation 8 is equivalent to injecting temporal consistency into behavioral cloning. In the offline setting, instead of this temporal relationship we are now leveraging the parameter relationship between πθ and ρθ—that is, from the joint EBM.”

Again, there is no guarantee that the state distribution defined by ρθ is consistent with what (in this clarificatory note) we are explicitly referring to as ρπ_θ.

Rather, the point of this exercise is more primitive: using very flexible policy classes like deep networks, with very small datasets for training, it has been shown that adding straightforward forms of regularization could already alleviate overfitting from behavioral cloning, such as weight decay or assumed reward sparsity [9]. Using the idea in [47], we are simply testing the hypothesis that some amount of arbitrary parameter tying would empirically do something similar.

Daniel Jarrett

Ioana Bica

Ioana Bica is a second year PhD student at the University of Oxford and at the Alan Turing Institute. She has previously completed a BA and MPhil in Computer Science at the University of Cambridge where she has specialised in machine learning and its applications to biomedicine.

Ioana’s PhD research focuses on building machine learning methods for causal inference and individualised treatment effect estimation from observational data. In particular, she has developed methods capable of estimating the heterogeneous effects of time-dependent treatments, thus enabling us to determine when to give treatments to patients and how to select among multiple treatments over time.

Recently, Ioana has started working on methods for understanding and modelling clinical decision making through causality, inverse reinforcement learning and imitation learning.

Nick Maxfield

From 2020 to 2022, Nick oversaw the van der Schaar Lab’s communications, including media relations, content creation, and maintenance of the lab’s online presence.