The purpose of this short note is to prevent possible confusion regarding Strictly Batch Imitation Learning by Energy-based Distribution Matching, one of our lab’s papers published in 2020.
It has come to our attention that the notation employed in this paper may have inadvertently blurred the line between online and offline distributions. This distinction is clarified below.
We can speak of two distinct “state distribution” terms when referring to ρθ:
(1) a state distribution parameterized directly by θ, which we call ρθ, and
(2) a state distribution induced by some πθ, which is perhaps more explicitly denoted ρπ_θ.
Importantly, 1) and 2) are entirely distinct, and have no prior relation to one another. Since we cannot learn the latter in the batch setting, all we are doing is exploring whether there is there any benefit in learning the former instead.
When we wrote about these quantities in the paper, we noted that the two are not the same thing:
[p4] “The actual occupancy measure corresponds to rolling out πθ, and if we could do that, we would naturally recover an approach not unlike the variety of distribution-aware algorithms in the literature; see e.g. . In the strictly batch setting, we clearly cannot sample directly from this (online) distribution. However, as a matter of multitask learning, we still hope to gain from jointly learning an (offline) model of the state distribution.”
[p6] “In the online setting, minimizing Equation 8 is equivalent to injecting temporal consistency into behavioral cloning. In the offline setting, instead of this temporal relationship we are now leveraging the parameter relationship between πθ and ρθ—that is, from the joint EBM.”
Again, there is no guarantee that the state distribution defined by ρθ is consistent with what (in this clarificatory note) we are explicitly referring to as ρπ_θ.
Rather, the point of this exercise is more primitive: using very flexible policy classes like deep networks, with very small datasets for training, it has been shown that adding straightforward forms of regularization could already alleviate overfitting from behavioral cloning, such as weight decay or assumed reward sparsity . Using the idea in , we are simply testing the hypothesis that some amount of arbitrary parameter tying would empirically do something similar.