## On the Value of Interaction and Function Approximation in Imitation LearningWith In practice, often learners implement algorithms such as DAgger which actively query the expert while interacting with the environment. In contrast, our previous work studies approaches such as behavior cloning as well as Mimic-MD which passively interact with the expert. In the worst case Rajaraman et al (2020) establishes that there is no benefit of actively querying the expert if the model is not known. The lower bound relies on constructing an MDP with a “bad” state which is absorbing and offers no reward and showing that any learner visits this state with reasonable probability. This worst case example is pathological in the following sense: in practical situations such as driving a car, experts often can recover at states and collect a high reward even if a “mistake” is made locally. This assumption is captured by the \(\mu\)-recoverability assumption of Ross and Bagnell (2011) which assumes that the difference in the \(Q\)-value under the expert policy across different actions in a state does not deviate by more than \(\mu\) from the maximum. In addition, Ross and Bagnell (2011) propose a reduction showing that any policy which minimizes the \(0\)-\(1\) loss with the expert action distribution to \(\le \epsilon\) when the state is generated by the learner's own state distribution incurs suboptimality \(\le \mu H \epsilon\) under this assumption. In this work, we show that this reduction is in fact statistically optimal - the resulting policy upon interacting with the MDP for \(N\) episodes incurs a suboptimality of \(O ( \mu S H / N )\) which we show is optimal up to log-factors. In contrast, we show that any algorithm which does not interact with the MDP and uses an offline dataset of \(N\) expert trajectories must incur suboptimality growing as \(\Omega (S H^2/N)\) even under the \(\mu\)-recoverability assumption. This result establishes a clear and provable separation of the minimax suboptimality of IL when the learner can actively query the expert, and when the model is unknown on instances satisfying \(\mu\)-recoverability. In this paper, we also study of IL going beyond the tabular setting, in the presence of In contrast, when the MDP transition structure is known to the learner such as in the case of simulators, we demonstrate a fundamental difference compared to the tabular setting in terms of the performance Mimic-MD when extended to the function approximation setting. We introduce a new problem called confidence set linear classification, that can be used to construct sample-efficient IL algorithms. Informally, in confidence set linear classification, the learner must output a classifier as well as a measurable set of inputs (confidence set) on which the output classifier We instantiate this reduction in the case of \(|A| = 2\) actions and with state-action features are uniformly distributed on the unit sphere. Here, we show that the minimax rate of confidence set linear classification is \(\tilde{\Theta} (d^{3/2} / N)\). As a corollary, this result establishes preliminary evidence towards breaking the quadratic-\(H\) barrier going beyond the tabular setting |