## Adaptive bound optimization for online convex optimization (extended version (2010)

Citations: | 11 - 3 self |

### BibTeX

@MISC{Mcmahan10adaptivebound,

author = {H. Brendan Mcmahan and Google Inc and Matthew Streeter},

title = {Adaptive bound optimization for online convex optimization (extended version},

year = {2010}

}

### OpenURL

### Abstract

We introduce a new online convex optimization algorithm that adaptively chooses its regularization function based on the loss functions observed so far. This is in contrast to previous algorithms that use a fixed regularization function such as L2-squared, and modify it only via a single time-dependent parameter. Our algorithm’s regret bounds are worst-case optimal, and for certain realistic classes of loss functions they are much better than existing bounds. These bounds are problem-dependent, which means they can exploit the structure of the actual problem instance. Critically, however, our algorithm does not need to know this structure in advance. Rather, we prove competitive guarantees that show the algorithm provides a bound within a constant factor of the best possible bound (of a certain functional form) in hindsight. 1

### Citations

3667 |
L.: Convex Optimization
- BOYD, VANDENBERGHE
- 2004
(Show Context)
Citation Context ...d u2 = −Q −1 (v + g). Then, letting x1 = PF,A(u1) and x2 = PF,A(u2), g ⊤ (x1 − x2) ≤ ‖A −1 g‖ 2 . Proof: The fact that Q = A ⊤ A ≻ 0 implies that ‖A · ‖ and ‖A −1 · ‖ are dual norms (see for example (=-=Boyd & Vandenberghe, 2004-=-, Sec. 9.4.1, pg. 476)). Using this fact, g ⊤ (x1 − x2) ≤ ‖A −1 g‖ · ‖A(x1 − x2)‖ ≤ ‖A −1 g‖ · ‖A(u1 − u2)‖ (Lemma 5) = ‖A −1 g‖ · ‖A(Q −1 g)‖ = ‖A −1 g‖ · ‖A(A −1 A −1 )g)‖ (Because Q −1 = (AA) −1 ) ... |

280 | Pegasos: Primal estimated sub-gradient solver for svm
- Shalev-Shwartz, Singer, et al.
- 2011
(Show Context)
Citation Context ...nvex optimization applied to learning problems. Many of these algorithms can be thought of as (significant) extensions of online subgradient descent, including (Duchi & Singer, 2009; Do et al., 2009; =-=Shalev-Shwartz et al., 2007-=-). Apart from the very general work of (Kalai & Vempala, 2005), few general follow-the-regularizedleader algorithms have been analyzed, with the notable exception of the recent work of Xiao (2009). i=... |

183 | Online convex programming and generalized infinitesimal gradient ascent - Zinkevich - 2003 |

133 | The tradeoffs of large scale learning
- Bottou, Bousquet
- 2007
(Show Context)
Citation Context ...tion 3. 1.2 The practical importance of adaptive regularization In the past few years, online algorithms have emerged as state-of-the-art techniques for solving large-scale machine learning problems (=-=Bottou & Bousquet, 2008-=-; Zhang, 2004). Two canonical examples of such largescale learning problems are text classification on large datasets and predicting click-through rates for ads on a search engine. For such problems, ... |

131 | AND SANTOSH VEMPALA: Efficient algorithms for online decision problems
- KALAI
- 2005
(Show Context)
Citation Context ...e point. To prove Theorem 2 we will make use of the following bound on the regret of FTRL, which holds for arbitrary (possibly non-convex) loss functions. This lemma can be proved along the lines of (=-=Kalai & Vempala, 2005-=-); for a complete proof see (McMahan & Streeter, 2010, Appendix A). Lemma 3 Let r1, r2, . . . , rT be a sequence of non-negative functions. The regret of FTPRL (which plays xt as defined by Equation (... |

64 |
Prediction, Learning and Games
- Cesa-Bianchi, Lugosi
- 2006
(Show Context)
Citation Context ...egularized leader (FTPRL). This proximal centering of additional regularization is similar in spirit to the optimization solved by online gradient descent (and more generally, online mirror descent, (=-=Cesa-Bianchi & Lugosi, 2006-=-)). However, rather than considering only the current gradient, our algorithm considers the sum of all previous gradients, and so solves a global rather than local optimization on each round. We discu... |

62 | Adaptive and self-confident on-line learning algorithms
- Auer, Cesa-Bianchi, et al.
(Show Context)
Citation Context ...s of generality, as we can always hallucinate an initial loss function with arbitrarily small components, and this changes regret by an arbitrarily small amount. We will also use the following Lemma (=-=Auer & Gentile, 2000-=-): Lemma 7 For any non-negative real numbers x1, x2, . . . , xn, n∑ ≤ 2√ n ∑ i=1 xi √ ∑i j=1 xj 3.1 Adaptive coordinate-constant regularization We derive bounds where Qt is chosen from the set Qconst,... |

59 | Dual averaging methods for regularized stochastic learning and online optimization - Xiao |

56 | Solving large scale linear prediction problems using stochastic gradient descent algorithms
- Zhang
- 2004
(Show Context)
Citation Context ... importance of adaptive regularization In the past few years, online algorithms have emerged as state-of-the-art techniques for solving large-scale machine learning problems (Bottou & Bousquet, 2008; =-=Zhang, 2004-=-). Two canonical examples of such largescale learning problems are text classification on large datasets and predicting click-through rates for ads on a search engine. For such problems, extremely lar... |

43 | Adaptive subgradient methods for online learning and stochastic optimization - Duchi, Hazan, et al. |

32 | Adaptive regularization of weight vectors
- Crammer, Kulesza, et al.
- 2009
(Show Context)
Citation Context ...(the online analogue of convergence rates). Perhaps the closest algorithms in spirit to our diagonal adaptation algorithm are confidence-weighted linear classification (Drezde et al., 2008) and AROW (=-=Crammer et al., 2009-=-), in that they make different-sized adjustments for different coordinates. Unlike our algorithm, these algorithms apply only to classification problems and not to general online convex optimization, ... |

21 | Optimal strategies and minimax lower bounds for online convex games
- Abernethy, Bartlett, et al.
- 2008
(Show Context)
Citation Context ...al sense, as on easy problem instances such an algorithm is still allowed to incur the worst-case regret. In particular, although this bound is minimax optimal when the feasible set is a hypersphere (=-=Abernethy et al., 2008-=-), we will see that much better algorithms exist when the feasible set is the hypercube. To improve over the existing worst-case guarantees, we introduce additional parameters that capture more of the... |

16 | Efficient learning using forward-backward splitting. NIPS
- Duchi, Singer
- 2009
(Show Context)
Citation Context ...n fact general algorithms for online convex optimization applied to learning problems. Many of these algorithms can be thought of as (significant) extensions of online subgradient descent, including (=-=Duchi & Singer, 2009-=-; Do et al., 2009; Shalev-Shwartz et al., 2007). Apart from the very general work of (Kalai & Vempala, 2005), few general follow-the-regularizedleader algorithms have been analyzed, with the notable e... |

6 | Theoretical guarantees for algorithms in multi-agent settings
- ZINKEVICH
- 2004
(Show Context)
Citation Context ... and not to general online convex optimization, and the guarantees are in the form of mistake bounds rather than regret bounds. FTPRL is similar to the lazily-projected gradient descent algorithm of (=-=Zinkevich, 2004-=-, Sec. 5.2.3), but with a critical difference: the latter effectively centers regularization outside of the current feasible region (at ut rather than xt). As a consequence, lazily-projected gradient ... |

4 |
Less regret via online conditioning
- Streeter, McMahan
(Show Context)
Citation Context ...α 2 ), which is sublinear for α > 1. This performance difference is not merely a weakness in the regret bounds for ordinary gradient descent, but is a difference in actual regret. In concurrent work (=-=Streeter & McMahan, 2010-=-), we showed that for some problem families, a per-coordinate learning rate for online gradient descent provides asymptotically less regret than even the best non-increasing global learning rate (chos... |

1 | Adaptive online gradient descent. NIPS - Bartlett, Hazan, et al. - 2008 |

1 |
Proximal regularization for online and batch learning. ICML
- Do, Le, et al.
- 2009
(Show Context)
Citation Context ...hms for online convex optimization applied to learning problems. Many of these algorithms can be thought of as (significant) extensions of online subgradient descent, including (Duchi & Singer, 2009; =-=Do et al., 2009-=-; Shalev-Shwartz et al., 2007). Apart from the very general work of (Kalai & Vempala, 2005), few general follow-the-regularizedleader algorithms have been analyzed, with the notable exception of the r... |