Online Convex Programming

This blog post discusses some aspects of Online Convex Optimization around Martin Zinkevich’s 2003 paper Online Convex Programming and Generalized Infinitesimal Gradient Ascent.

The Online Convex Programming framework introduced in Zinkevich’s 2003 paper models a wide range of “decision-under-uncertainty” optimization problems. As the name suggests, OCP is a natural online extension of convex optimization. This article is intended as a basic and by no means comprehensive introduction to the problem which has evolved into a subject of its own.

The Problem

At each time $t = 1,2,\dots,T$ an agent must pick a value $x_t$ from a given time-invariant compact domain $D$ . Once the agent makes a choice, the environment reveals a convex function $g_t$ and the agent incurs a cost of $g_t (x_t)$ . The goal of the agent is to minimize the total cost incurred. The regret $R$ associated with a set of choices $\mathbf{x} = \{ x_t \}$ made by the agent is measured against the value of the optimal static decision as \begin{equation} R(\mathbf{x}) = \sum_t g_t (x_t) - \min_{x^* \in D} \sum_{t} g_t (x^*) \end{equation} There are some important things to discuss about this problem:

Even if $D = [0,1]^d$ and $g_t : D \to [0,1]$ , it is apparent that no deterministic Turing machine can solve the problem with a regret bound of $\Omega(T)$ when the regret is defined as \begin{equation} R(\mathbf{x}) = \sum_t g_t (x_t) - \sum_t \min_{x \in D} g_t (x) \end{equation} This can be argued by the simple adverserial choice of: \begin{equation} g_t (x) = \begin{cases} \| x \|_\infty \quad & \| x_t \|_\infty \ge 0.5 \\ 1 - \| x \|_\infty & \| x_t \|_\infty \le 0.5 \end{cases} \end{equation} where $x_t$ is the choice made by the agent at time $t$ and $\| \cdot \|_\infty$ is $\ell_\infty$ -norm. At every time, the agent incurs a cost of at least $0.5$ while the minimum of $g_t$ is always $0$ . On the other hand, it is also apparent that a non-deterministic Turing machine can solve the problem “super-optimally” by picking $\min_{x \in D} g_t (x)$ at every time $t$. In this case, the regret can be negative.
The optimal static decision is a natural notion to define the regret against in many classes of problems. An example is that of finding the value of the optimal solution to an online constrained minimization problem. Strong Lagrangian duality allows the constrained minimization primal problem to be converted into an equivalent unconstrained maximization dual problem which is concave. The catch is that the optimal dual solution $\lambda^*$ cannot usually be computed causally as the Lagrange dual function cannot be computed causally. However this framework asks a crucial question - can we design a primal-dual algorithm that “learns” $\lambda^*$ for certain problems where the Lagrange dual function can be thought of as being revealed incrementally as $\mathcal{L} = \frac{1}{T} \sum_{t} g_t$ for some $\{ g_t \}$ . This question is the subject of discussion of Shipra Agarwal et al which I may talk about in a future blog post.

The Greedy Projection algorithm

When is a sublinear (in $T$ ) regret bound attainable? Zinkevich shows that if $g_t$ ’s are Lipschitz, one can attain a regret bound of $\mathcal{O}(\sqrt{T})$ using a very simple greedy projection algorithm: \begin{equation*} x_{t+1} \leftarrow P_D(x_t - \eta_t \nabla g_t (x_t)), \quad \text{where } \eta_t = t^{-\frac{1}{2}} \end{equation*} where $P_D$ denotes projection onto the domain $D$ . The proof relies on two observations:

The convexity of $g_t$ and the fact that it satisfies the Lipschitz condition can be used to bound the 1-step regret as: \begin{equation} \label{eq:000} g_t (x_t) - g_t (x^) \le (\nabla g_t) (x_t)^T \cdot (x_t - x^) \end{equation}
The update rule of the algorithm generates an upper-bound on this linear function comprising of two components - a difference of potentials that captures the $\ell_2$ -convergence of $\{ \dots,x_t,x_{t+1},\dots\}$ to $x^*$ and a second-order error term due to the update rule moving against the local gradient $(\nabla g_t )(x_t)$ and not the global gradient $(\nabla \sum_t g_t) (x_t)$ .

Informally, the rate of decay of $\eta_t$ plays a role in capturing the correct tradeoff between remembering past gradient information and rate of convergence. Zinkevich also presents the “lazy” greedy projection algorithm. The laziness comes from using the gradient to update the pre-projection vector and then projecting this onto $D$ .

Minimizing regret against dynamic strategies

A natural question to ask is - can good strategies be designed when the regret is measured against not the optimal static strategy, but the optimal dynamic strategy having a total “path length” of at-most $L$ . The path length of a dynamic strategy $\{ x^*_t \}$ is defined as: \begin{equation} \sum_{t=1}^{T-1} \| x_{t+1}^* - x_t^* \| \end{equation} It is apparent from the arguments made previously that it not possible to get sublinear regret guarantees when $L$ is $\Omega(T)$ . However the definition of path-length makes things conducive to hope to find good first order incremental algorithms. Zinkevich shows that the same greedy projection algorithm attains a regret guarantee of $\mathcal{O} (\sqrt{TL})$ (for an appropriate choice of learning rate).

Logarithmic regret?: The greedy projection algorithm achieves a regret guarantee that is sublinear in $T$ , but can one hope to get $\mathcal{O} (\log T)$ ? In general, this is not possible, but in the case the functions are all $\alpha$ -strongly convex, the greedy projection algorithm with learning rate $\eta_t = \frac{1}{\alpha t}$ achieves the logarithmic regret bound. The proof follows the same lines as that for general greedy projection.

Regularized Approaches

An natural algorithm one can think about for OCO is one which chooses $x_{t+1}$ as the the optimal choice in hindsight: \begin{equation} x_{t+1} \leftarrow \mathrm{arg} \min_{x \in D} \sum_{\tau = 1}^t g_{\tau} (x) \end{equation} Algorithms in this spirit are referred to by the self explanatory name “Follow the Leader” (FTL). Similar arguments to those made previously concur that every such update rule may have an adverserial choice of function for which it performs extremely poorly (linear regret). An informal reason for why such a phenomenon may occur is because of the fact that $x_{t+1}$ may be far away from $x_t$ (non-convergent sequence) and in what we have seen so far, approaches that converge to the static optimal solution would be expected to perform better. It will become more apparent why convergence is important through the proof sketch below.

In order to circumvent non-convergence, regularization may be a means of “stabilizing” the sequence generated by the FTL approach. This motivates the class of RFTL algorithms which consider a strongly convex and smooth regularization function, $R$ . RTFL projections are of the form: \begin{equation} \label{eq:002} x_{t+1} \leftarrow \mathrm{arg} \min_{x \in D} \left\{ \eta \sum_{\tau = 1}^t \nabla f_\tau(x_t)^T \cdot x + R(x) \right\} \end{equation}

Standard gradient descent can be thought of as a

In order to understand this algorithm, we first introduce some definitions. For some regularization function $R$ , the Bregman divergence $B_R (x || y)$ is defined as the deviation of $R (x)$ from the first order Taylor expansion of $R$ about $y$ : \begin{equation} B_R (x || y) = R(x) - R(y) - \nabla R(y)^T \cdot (x-y) \end{equation} The mean value theorem ascertains that for every $x,y$ , there exists a $\alpha \in [0,1]$ such that for $z = \alpha x + (1-\alpha) y$ , \begin{equation} \label{eq:003} B_R (x || y) = \frac{1}{2} (x-y)^T \cdot \nabla^2 R(z) \cdot (x-y) \end{equation} Observe that equation (\ref{eq:003}) takes the form of a matrix norm, specifically by expanding $x-y$ in the eigenbasis of $\nabla^2 R(z)$ (by the convexity of $R$ , the Hessian is positive semidefinite). The proof of low regret of the algorithm proceeds in three steps:

Observe that one can replace the convex functions by linear functions to bound the regret (refer to equation (\ref{eq:000})). A simple inductive argument gives: \begin{equation*} \sum_{t=1}^T g_t (x^) + \frac{1}{\eta} R(x^) \ge \sum_{t=1}^T g_t (x_{t+1}) + \frac{1}{\eta} R(x_1) \end{equation*}
This can be used to bound the regret, simply as: \begin{equation*} \sum_{t=1}^T g_t (x_t) - g_t (x^) \le \sum_{t=1}^T \left( \underbrace{g_t (x_t) - g_t (x_{t+1})}_{(a)} \right) \le \frac{R(x^) - R(x_1)}{\eta} \end{equation*} Observe the point made previously - if $x_t$ is a non-convergent sequence, then $(a)$ can grow linearly with $T$ and regularization seeks to counter this problem.
The last step is to bound $(a)$ using the convergence of $\\{ x_t \\}$ . An application of the Cauchy Schwartz inequality can be used to bound this term as a function of the Bregman divergence (observe that this is a norm).

There is also another visualization.

Written on August 28, 2018