Provably Breaking the Quadratic Error Compounding Barrier in Imitation Learning, Optimally

With Yanjun Han, Lin F. Yang, Kannan Ramchandran and Jiantao Jiao

With a known model, Mimic-MD provably breaks the error-compounding barrier in IL. However, it is a-priori unclear whether the dependence on the horizon, H^{3/2}, can be further improved given an accurate model. To this end, we establish a lower bound showing that for any learner which knows the MDP transition, there exists an IL instance on $$|S| = 3$$ states such that the suboptimality of the learner scales as $$\Omega (H^{3/2}/N)$$ with constant probability. This result shows that in terms of the horizon dependence, Mimic-MD is optimal in the worst case. The lower bound construction relies on a novel reduction to mean estimation with subsampled observations and establishing statistical limits for the same, which may be of independent interest.

Often in practice, the demonstrator turns out to carry out the task at hand very efficiently and is near optimal. In this work, we also explore IL under the additional restriction that the expert is an optimal policy under the true underlying reward function. This setting turns out to be quite challenging because of the highly non-convex reward structure imposed by the observations in the expert dataset. While it is trivial to construct good estimates of the expert state distribution, realizing good state distributions via Markovian policies is a far more challenging problem. Achieving the latter is a necessary hurdle to overcome in coming up with policies with small suboptimality. In this paper, we propose an efficient algorithm, termed Mimic-Mixture, which can provably realize a near unbiased (upto an error of $$O(1/N)$$) state estimate of the state distribution of any single state in the MDP. As a consequence, for MDPs on 3-states with rewards only on the terminal layer, Mimic-Mixture returns a policy incurring suboptimality O(1/N).

In contrast, we show that no algorithm can achieve suboptimality $$O(\sqrt{H/N})$$ with high probability if the expert is not constrained to be optimal. Thus, our work formally establishes the benefit of imposing optimality of the expert when the model is known, which contrasts with the result in Rajaraman et al (2020) which shows that it does not help in the worst case, when the learner cannot interact with the environment