## On the Value of Interaction and Function Approximation in Imitation Learning

With Jingbo Liu, Yanjun Han, Lin F. Yang, Jiantao Jiao and Kannan Ramchandran

In practice, often learners implement algorithms such as DAgger which actively query the expert while interacting with the environment. In contrast, our previous work studies approaches such as behavior cloning as well as Mimic-MD which passively interact with the expert. In the worst case Rajaraman et al (2020) establishes that there is no benefit of actively querying the expert if the model is not known. The lower bound relies on constructing an MDP with a “bad” state which is absorbing and offers no reward and showing that any learner visits this state with reasonable probability. This worst case example is pathological in the following sense: in practical situations such as driving a car, experts often can recover at states and collect a high reward even if a “mistake” is made locally. This assumption is captured by the $$\mu$$-recoverability assumption of Ross and Bagnell (2011) which assumes that the difference in the $$Q$$-value under the expert policy across different actions in a state does not deviate by more than $$\mu$$ from the maximum.

In addition, Ross and Bagnell (2011) propose a reduction showing that any policy which minimizes the $$0$$-$$1$$ loss with the expert action distribution to $$\le \epsilon$$ when the state is generated by the learner's own state distribution incurs suboptimality $$\le \mu H \epsilon$$ under this assumption. In this work, we show that this reduction is in fact statistically optimal - the resulting policy upon interacting with the MDP for $$N$$ episodes incurs a suboptimality of $$O ( \mu S H / N )$$ which we show is optimal up to log-factors. In contrast, we show that any algorithm which does not interact with the MDP and uses an offline dataset of $$N$$ expert trajectories must incur suboptimality growing as $$\Omega (S H^2/N)$$ even under the $$\mu$$-recoverability assumption. This result establishes a clear and provable separation of the minimax suboptimality of IL when the learner can actively query the expert, and when the model is unknown on instances satisfying $$\mu$$-recoverability.

In this paper, we also study of IL going beyond the tabular setting, in the presence of function approximation. We initiate the study in the linear-expert setting, where the expert plays actions according to a linear classifier of known state-action features. This setting is a generalization of the linear-$$Q^*$$ setting when the expert policy is optimal. Here, we reduce IL to multi-class classification and show that with high probability, the suboptimality incurred by behavior cloning is in fact $$\tilde{O} (dH^2/N)$$. This is optimal up to log-factors, but can be improved to $$\tilde{O} (dH/N)$$ if we have a linear expert with parameter-sharing across time steps.

In contrast, when the MDP transition structure is known to the learner such as in the case of simulators, we demonstrate a fundamental difference compared to the tabular setting in terms of the performance Mimic-MD when extended to the function approximation setting. We introduce a new problem called confidence set linear classification, that can be used to construct sample-efficient IL algorithms. Informally, in confidence set linear classification, the learner must output a classifier as well as a measurable set of inputs (confidence set) on which the output classifier certifiably does not make a mistake. The loss incurred is the probability measure of the confidence set. The intuition behind this reduction is that a good confidence set linear classifier can be used to construct a large measure of states on which the expert action is known - this is amenable for the learner to generate large volumes of training data when the model is known by artificially rolling out the expert policy wherever possible. Indeed, we show that with a linear expert and with linearly realizable reward functions, a confidence set linear classification algorithm with worst case expected error $$\le \epsilon$$ can be used to instantiate Mimic-MD with suboptimality incurred $$O (\epsilon H^{3/2} \sqrt{d/N})$$.

We instantiate this reduction in the case of $$|A| = 2$$ actions and with state-action features are uniformly distributed on the unit sphere. Here, we show that the minimax rate of confidence set linear classification is $$\tilde{\Theta} (d^{3/2} / N)$$. As a corollary, this result establishes preliminary evidence towards breaking the quadratic-$$H$$ barrier going beyond the tabular setting