## Adaptive Subgradient Methods for Online Learning and Stochastic Optimization (2010)

### Cached

### Download Links

Citations: | 43 - 0 self |

### BibTeX

@MISC{Duchi10adaptivesubgradient,

author = {John Duchi and Elad Hazan and Yoram Singer},

title = {Adaptive Subgradient Methods for Online Learning and Stochastic Optimization},

year = {2010}

}

### OpenURL

### Abstract

Stochastic subgradient methods are widely used, well analyzed, and constitute effective tools for optimization and online learning. Stochastic gradient methods ’ popularity and appeal are largely due to their simplicity, as they largely follow predetermined procedural schemes. However, most common subgradient approaches are oblivious to the characteristics of the data being observed. We present a new family of subgradient methods that dynamically incorporate knowledge of the geometry of the data observed in earlier iterations to perform more informative gradient-based learning. The adaptation, in essence, allows us to find needles in haystacks in the form of very predictive but rarely seenfeatures. Ourparadigmstemsfromrecentadvancesinstochasticoptimizationandonlinelearning which employ proximal functions to control the gradient steps of the algorithm. We describe and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal function that can be chosen in hindsight. In a companion paper, we validate experimentally our theoretical analysis and show that the adaptive subgradient approach outperforms state-of-the-art, but non-adaptive, subgradient algorithms. 1

### Citations

4688 |
Matrix Analysis
- Horn, Johnson
- 1985
(Show Context)
Citation Context ... tr(X p ) is easy to compute for integer values of p. However, when p is real we need the following lemma. The lemma tacitly uses the fact that there is a unique positive semidefinite X p when X ≽ 0 (=-=Horn and Johnson, 1985-=-, Theorem 7.2.6). Lemma 18. Let p ∈ R and X ≻ 0. Then ∇Xtr(X p ) = pX p−1 . Proof. We do a first order expansion of (X +A) p when X ≻ 0 and A is symmetric. Let X = UΛU ⊤ be the symmetric eigen-decompo... |

3701 |
L.: Convex Optimization
- Boyd, Vandenberghe
- 2004
(Show Context)
Citation Context ...to get L(s,λ,θ) = d∑ i=1 ‖g1:T,i‖ 2 2 si 7 −〈λ,s〉+θ(〈1,s〉−c).Taking partial derivatives to find the infimum of L, we see that −‖g1:T,i‖ 2 2 /s2i − λi + θ = 0, and complimentarity conditions on λisi (=-=Boyd and Vandenberghe, 2004-=-) imply that λi = 0. Thus we have si = θ−1 2 ‖g1:T,i‖2 , and normalizing appropriately using θ gives that si = c‖g1:T,i‖2 / ∑d j=1‖g1:T,j‖ 2 . As one final note, we can plug si in to the above to see ... |

1538 | Term-weighting approaches in automatic text retrieval
- Salton, Buckley
- 1988
(Show Context)
Citation Context ...that infrequently occurring features are highly informative and discriminative. The informativeness of rare features has led practitioners to craft domain-specific feature weightings, such as TF-IDF (=-=Salton and Buckley, 1988-=-), which pre-emphasize infrequently occurring features. We use this old idea as a motivation for applying modern learning-theoretic techniques to the problem of online and stochastic learning, focusin... |

744 |
Neuro-dynamic programming. Athena Scientific
- Bertsekas, Tsitsiklis
- 1996
(Show Context)
Citation Context ...or any x t ∗ , η(ft(xt)−ft(x ∗ ))+η(ϕ(xt+1)−ϕ(x ∗ )) ≤ Bψt (x∗ ,xt)−Bψt (x∗ ,xt+1)+ η2 2σ ‖f′ t(xt)‖ 2 ψ ∗ t Proof. The optimality of xt+1 for Eq. (3) implies for all x ∈ X and ϕ ′ (xt+1) ∈ ∂ϕ(xt+1) (=-=Bertsekas, 1999-=-) 〈x−xt+1,ηf ′ (xt)+∇ψt(xt+1)−∇ψt(xt)+ηϕ ′ (xt+1)〉 ≥ 0. (23) In particular, this obtains for x = x ∗ . From the subgradient inequality for convex functions, we have ft(x ∗ ) ≥ ft(xt) + 〈f ′ t(xt),x ∗ ... |

640 |
UCI machine learning repository
- Asuncion, Newman
- 2011
(Show Context)
Citation Context ...tabase (Deng et al., 2009), the Reuters RCV1 text classification data set (Lewis et al., 2004), the MNIST multiclass digit recognition problem, and the census income data set from the UCI repository (=-=Asuncion and Newman, 2007-=-). For uniformity across experiments, we focus on the completely online (fully stochastic) optimization setting, in which at each iteration the learning algorithm receives a single example. We measure... |

441 | Rcv1: A new benchmark collection for text categorization research
- Lewis, Yang, et al.
- 2004
(Show Context)
Citation Context ... Experiments We performed experiments with several real world data sets with different characteristics: the ImageNet image database (Deng et al., 2009), the Reuters RCV1 text classification data set (=-=Lewis et al., 2004-=-), the MNIST multiclass digit recognition problem, and the census income data set from the UCI repository (Asuncion and Newman, 2007). For uniformity across experiments, we focus on the completely onl... |

293 | ImageNet: A Large-Scale Hierarchical Image Database
- Deng, Dong, et al.
(Show Context)
Citation Context ...riments with multiclass prediction problems in the next section. 6. Experiments We performed experiments with several real world data sets with different characteristics: the ImageNet image database (=-=Deng et al., 2009-=-), the Reuters RCV1 text classification data set (Lewis et al., 2004), the MNIST multiclass digit recognition problem, and the census income data set from the UCI repository (Asuncion and Newman, 2007... |

254 | Smooth minimization of non-smooth functions
- Nesterov
- 2005
(Show Context)
Citation Context ...convex with respect to the norm ‖·‖ ψt , the function ψ∗ t has η-Lipschitz continuous gradients with respect to ‖·‖ ψ ∗ t : ‖∇ψ ∗ t(g1)−∇ψ ∗ t(g2)‖ ψt ≤ η‖g1− g2‖ ψ ∗ t (30) for any g1,g2 (see, e.g., =-=Nesterov, 2005-=-, Theorem 1 or Hiriart-Urruty and Lemaréchal, 1996, Chapter X). Further, a simple argument with the fundamental theorem of calculus gives that if f has L-Lipschitz gradients, f(y)≤ f(x)+〈∇ f(x),y−x〉+(... |

150 |
A new approach to variable metric algorithms
- Fletcher
- 1970
(Show Context)
Citation Context ... no means new and can be traced back at least to the 1970s. There, we find Shor’s work on space dilation methods (1972) as well as variable metric methods, such as the BFGS family of algorithms (e.g. =-=Fletcher, 1970-=-). This older work usually assumes that the function to be minimized is differentiable and, to our knowledge, did not consider stochastic, online, or composite optimization. More recently, Bordes et a... |

135 | On the generalization ability of on-line learning algorithms
- Cesa-Bianchi, Conconi, et al.
- 2004
(Show Context)
Citation Context ... of each subgradient we observe, by g1:t,i. We also define the outer product matrix Gt = ∑t τ=1 gτgτ ⊤ . Online learning and stochastic optimization are closely related and basically interchangeable (=-=Cesa-Bianchi et al., 2004-=-). In order to keep our presentation simple, we confine our discussion and algorithmic descriptions to the online setting with the regret bound model. In online learning, the learner repeatedly predic... |

134 | Efficient algorithms for online decision problems - Kalai, Vempala - 2003 |

122 | Logarithmic regret algorithms for online convex optimization
- Hazan, Agarwal, et al.
- 2007
(Show Context)
Citation Context ...), and in particular its specialized versions: regularized dual asent (RDA) of Xiao (2009) and the follow-the-regularized-leader (FTRL) family of algorithms (see for instance Kalai and Vempala, 2003; =-=Hazan et al., 2006-=-; or Abernethy et al., 2008b). In the primal-daul subgradient method we make a prediction xt on round t using the average gradient ¯gt = 1 ∑t t τ=1 gτ. The update encompasses a trade-off between a gra... |

92 | Stochastic Approximation Approach to Stochastic Programming
- Nemirovski, Juditsky, et al.
(Show Context)
Citation Context ...i et al., 2010), which in turn include as special cases projected gradients (Zinkevich, 2003) and mirror descent (Nemirovski and Yudin, 1983; Beck and Teboulle, 2003). Recent work by several authors (=-=Nemirovski et al., 2009-=-; Juditsky et al., 2008; Lan, 2010; Xiao, 2010) considered efficient and robust methods for stochastic optimization, especially in the case when the expected objective f is smooth. It may be interesti... |

84 | Mirror descent and nonlinear projected subgradient methods for convex optimization - Beck, Teboulle - 2003 |

73 | Primal-Dual Subgradient Methods for Convex Problems
- Nesterov
(Show Context)
Citation Context ...lly sub-linear, namely, Rφ(T) = o(T). Our analysis applies to related, yet different, methods for for minimizing the regret defined in Eq. (1). The first is Nesterov’s primal-dual subgradient method (=-=Nesterov, 2009-=-), and in particular its specialized versions: regularized dual asent (RDA) of Xiao (2009) and the follow-the-regularized-leader (FTRL) family of algorithms (see for instance Kalai and Vempala, 2003; ... |

70 |
On Accelerated Proximal Gradient Methods for Convex-Concave Optimization
- Tseng
- 2008
(Show Context)
Citation Context ... xt+1 = argmin η〈¯gt,x〉+ηϕ(x)+ x∈X 1 t ψt(x) } , (5)where η is a step-size. The second method also has many names, such as proximal gradient, forwardbackward splitting, and composite mirror descent (=-=Tseng, 2008-=-; Duchi and Singer, 2009; Duchi et al., 2010b). We use the term composite mirror descent. The composite mirror descent method employs a more immediate trade-off between the current gradient gt, ϕ, and... |

63 |
Introductory Lectures on Convex Optimization
- Nesterov
(Show Context)
Citation Context ...ically, ∀ g1,g2 : ‖∇Vt(g1)−∇Vt(g2)‖ ψt ≤ η‖g1 −g2‖ ψ ∗ t . The main consequence of the above reasoning with which we are concerned is the following consequence of the fundamental theorem of calculus (=-=Nesterov, 2004-=-, Theorem 2.1.5): For the remainder of this proof, we set ¯gt = 1 t lemma characterizing Vt as a function of ¯gt. Vt(g +h) ≤ Vt(g)+〈h,∇Vt(g)〉+ η 2 ‖h‖2 ψ ∗ t Lemma 16. For any t ≥ 1, we have Vt(−t¯gt)... |

62 | Adaptive and self-confident on-line learning algorithms - Auer, Cesa-Bianchi, et al. |

62 | Online passive aggressive algorithms - Crammer, Dekel, et al. - 2003 |

59 | Dual averaging methods for regularized stochastic learning and online optimization. JMLR
- Xiao
- 2010
(Show Context)
Citation Context ... and (3) either remained intact or was simply multiplied by a time-dependent scalar throughout the run of the algorithm. Zinkevich’s projected gradient, for example, uses ψt(x) = ‖x‖ 2 2 , while RDA (=-=Xiao, 2009-=-) employs ψt(x) = √ tψ(x) where ψ is a strongly convex function. The bounds for both types of algorithms are similar, and both rely on the norm ‖·‖ (and its associated dual ‖·‖∗ ) with respect to whic... |

58 |
Concavity of certain maps on positive definite matrices and applications to Hadamard products
- Ando
(Show Context)
Citation Context ...efinite. We also use the previous lemma which implies that the gradient of tr(A 1/2 ) is 1 2 A−1/2 when A ≻ 0. First, A p is matrix-concave for A ≻ 0 and 0 ≤ p ≤ 1 (see, for example, Corollary 4.1 in =-=Ando, 1979-=- or Theorem 16.1 in Bondar, 1994). That is, for A, B ≻ 0 and α ∈ [0, 1] we have (αA + (1 − α)B) p ≽ αA p + (1 − α)B p . (25) Now suppose simply A, B ≽ 0 (but neither is necessarily strict). Then for a... |

57 | A second-order perceptron algorithm - Cesa-Bianchi, Conconi, et al. - 2005 |

55 | Y.: Efficient online and batch learning using forward backward splitting - Duchi, Singer - 2009 |

50 | Algebra and its applications - Linear |

49 | Competing in the dark: An efficient algorithm for bandit linear optimization - Abernethy, Hazan, et al. - 2008 |

48 |
Convex analysis and minimization algorithms I and II
- Hiriart-Urruty, Lemaréchal
- 1993
(Show Context)
Citation Context ...t , the function ψ ∗ t has η-Lipschitz continuous gradients with respect to ‖·‖ ψ ∗ t : ‖∇ψ ∗ t (g1) − ∇ψ ∗ t (g2)‖ ψt ≤ η ‖g1 − g2‖ ψ ∗ t (30) for any g1, g2 (see, e.g., Nesterov, 2005, Theorem 1 or =-=Hiriart-Urruty and Lemaréchal, 1996-=-, Chapter X). Further, a simple argument with the fundamental theorem of calculus gives that if f has L-Lipschitz gradients, f(y) ≤ f(x) + 〈∇f(x), y − x〉 + (L/2) ‖y − x‖ 2 , and ∇ψ ∗ { t (g) = argmin ... |

45 | A discriminative kernelbased model to rank images from text queries - Grangier, Bengio - 2008 |

44 | Improved secondorder bounds for prediction with expert advice - CESA-BIANCHI, MANSOUR, et al. - 2007 |

33 | P.: SGD-QN: Careful quasi-Newton stochastic gradient descent - Bordes, Bottou, et al. - 2009 |

32 | Adaptive regularization of weight vectors
- Crammer, Kulesza, et al.
- 2009
(Show Context)
Citation Context ...αt = [1−yt〈zt,µt〉] + , µt+1 = µt+αtΣtytzt, Σt+1 = Σt−βtΣtxtx ⊤ t Σt. (8) In the above, one can set Σt to be diagonal, which reduces run-time and storage requirements but still gives good performance (=-=Crammer et al., 2009-=-). In contrast to AROW, the ADAGRAD family uses the root of a covariance-like matrix, a consequence of our formal analysis. Crammer et al.’s algorithm and our algorithms have similar run times—linear ... |

32 |
An O(n) Algorithm For Quadratic Knapsack Problems
- Brucker
- 1984
(Show Context)
Citation Context ...to {z : 〈a, z〉 ≤ c, z ≽ 0}. We next consider the setting in which ϕ ≡ 0 and X = {x : ‖x‖1 ≤ c}, for which it is straightforward to adapt efficient solutions to continuous quadratic knapsack problems (=-=Brucker, 1984-=-). We use the matrix Ht = δI + diag(Gt) 1/2 from Algorithm 1. We provide a brief derivation sketch and an O(d log d) algorithm in this section. First, we convert the problem (18) into a projection pro... |

30 | Adaptive online gradient descent
- Bartlett, Hazan, et al.
- 2007
(Show Context)
Citation Context ...s associated dual ‖·‖∗ ) with respect to which ψ is strongly convex. Mirror-descent type first order algorithms, such as projected gradient methods, attain regret bounds of the form (Zinkevich, 2003; =-=Bartlett et al., 2007-=-; Duchi et al., 2010) Rφ(T) ≤ 1 η Bψ(x ∗ ,x1)+ η 2 T∑ t=1 ‖f ′ t(xt)‖ 2 ∗ . (7) Choosing η ∝ 1/ √ T gives Rφ(T) = O( √ T). When Bψ(x,x ∗ ) is bounded for all x ∈ X, we choose step sizes ηt ∝ 1/ √ t wh... |

29 | An optimal method for stochastic composite optimization
- Lan
- 2009
(Show Context)
Citation Context ...l cases projected gradients (Zinkevich, 2003) and mirror descent (Nemirovski and Yudin, 1983; Beck and Teboulle, 2003). Recent work by several authors (Nemirovski et al., 2009; Juditsky et al., 2008; =-=Lan, 2010-=-; Xiao, 2010) considered efficient and robust methods for stochastic optimization, especially in the case when the expected objective f is smooth. It may be interesting to investigate adaptive metric ... |

27 | Composite objective mirror descent
- Duchi, Shalev-Shwartz, et al.
- 2010
(Show Context)
Citation Context ...re η is a fixed step-size and x1 = argmin x∈X ϕ(x). The second method similarly has numerous names, including proximal gradient, forward-backward splitting, and composite mirror descent (Tseng, 2008; =-=Duchi et al., 2010-=-). We use the term composite mirror descent. The composite mirror descent method employs a more immediate trade-off between the current gradient gt, ϕ, and staying close to xt using the proximal funct... |

24 | Exact convex confidence-weighted learning
- Crammer, Dredze, et al.
- 2008
(Show Context)
Citation Context ...dependent bounds for algorithms such as those above is a natural one and has been studied extensively in the past. The framework that is most related to ours is probably confidence weighted learning (=-=Crammer et al., 2008-=-) and the adaptive regularization of weights algorithm (AROW) of Crammer et al. (2009). These papers give a mistake-bound analysis for second-order algorithms for the Perceptron, which are similar in ... |

24 | An algorithm for a singly constrained class of quadratic programs subject to upper and lower bounds - Pardalos, Kovoor - 1990 |

21 | Optimal strategies and minimax lower bounds for online convex games
- Abernethy, Bartlett, et al.
- 2008
(Show Context)
Citation Context ...ch we take the infimum. To conclude the outline of results, we would like to point to two relevant research papers. First, Zinkevich’s regret bound is tight and cannot be improved in a minimax sense (=-=Abernethy et al., 2008-=-). Therefore, improving the regret bound requires further reasonable assumptions on the input space. Second, in a independent work, performed concurrently to the research presented in this paper, McMa... |

20 | Utilization of the operation of space dilatation in the minimization of convex functions, Kibernetika 1 - Shor - 1970 |

16 | Extracting certainty from uncertainty: regret bounded by variation in costs
- HAZAN, KALE
(Show Context)
Citation Context ...ery iteration. Plain usage of the FTRL approach fails to achieve low regret, however, adding a proximal2 term to the past predictions leads to numerous low regret algorithms (Kalai and Vempala, 2003; =-=Hazan and Kale, 2008-=-; Rakhlin, 2009). The proximal term strongly affects the performance of the learning algorithm. Therefore, adapting the proximal function to the characteristics of the problem at hand is desirable. Ou... |

16 | Logarithmic regret algorithms for stronglyconvexrepeatedgames
- Shalev-Shwartz, Singer
- 2007
(Show Context)
Citation Context ...gly Convex Functions It is now well established that strong convexity of the functions ft can give significant improvements in the regret of online convex optimization algorithms (Hazan et al., 2006; =-=Shalev-Shwartz and Singer, 2007-=-). We can likewise derive lower regret bounds in the presence of strong convexity. We assume that our functions ft + ϕ are strongly convex with respect to a norm ‖·‖. For simplicity, we assume that ea... |

15 | UCI machine learning repository, 2007, URL: /http:// www.ics.uci.edu/ mlearn/MLRepository.htmlS - Asuncion, Newman |

12 | Online and Batch Learning using Forward-Backward Splitting - Duchi, Singer - 2009 |

12 | Solving variational inequalities with stochastic mirror prox algorithm
- Juditsky, Nemirovski, et al.
- 2008
(Show Context)
Citation Context ... turn include as special cases projected gradients (Zinkevich, 2003) and mirror descent (Nemirovski and Yudin, 1983; Beck and Teboulle, 2003). Recent work by several authors (Nemirovski et al., 2009; =-=Juditsky et al., 2008-=-; Lan, 2010; Xiao, 2010) considered efficient and robust methods for stochastic optimization, especially in the case when the expected objective f is smooth. It may be interesting to investigate adapt... |

11 | Algebra and its - Linear |

11 | Adaptive bound optimization for online convex optimization - McMahan, Streeter - 2010 |

10 |
Joint covariate selection for grouped classification
- Obozinski, Taskar, et al.
- 2007
(Show Context)
Citation Context ...egularization We now turn to the case where ϕ(x)=λ‖x‖ 2 while X =R d . This type of regularization is useful for zeroing multiple weights in a group, for example in multi-task or multiclass learning (=-=Obozinski et al., 2007-=-). Recalling the general proximal step (18), we must solve min x 〈u,x〉+ 1 2 〈x,Hx〉+λ‖x‖ 2 . (21) There is no closed form solution for this problem, but we give an efficient bisection-based procedure f... |

9 |
Notions generalizing convexity for functions defined on spaces of matrices
- Davis
- 1963
(Show Context)
Citation Context ...e inequality to get δt ≤ 1 η ψt(x ∗ )+ η 2 t∑ τ=1 ‖g τ ‖ 2 ψ∗ . τ−1 Combining the above equation with the lower bound on δt from Eq. (28) finishes the proof. . C Technical Lemmas Lemma 17 (Example 3, =-=Davis, 1963-=-). Let A ≽ B ≽ 0 be symmetric d × d PSD matrices. Then A 1/2 ≽ B 1/2 . The gradient of the function tr(X p ) is easy to compute for integer values of p. However, when p is real we need the following l... |

7 | Subgradient Methods for Convex Minimization - Nedić - 2002 |

6 |
Comments on and complements to Inequalities: Theory of Majorization and Its Applications. Linear Algebra and its Applications
- Bondar
- 1994
(Show Context)
Citation Context ...ous lemma which implies that the gradient of tr(A 1/2 ) is 1 2 A−1/2 when A ≻ 0. First, A p is matrix-concave for A ≻ 0 and 0 ≤ p ≤ 1 (see, for example, Corollary 4.1 in Ando, 1979 or Theorem 16.1 in =-=Bondar, 1994-=-). That is, for A, B ≻ 0 and α ∈ [0, 1] we have (αA + (1 − α)B) p ≽ αA p + (1 − α)B p . (25) Now suppose simply A, B ≽ 0 (but neither is necessarily strict). Then for any δ > 0, we have A + δI ≻ 0 and... |

3 | Translated from - Cybernetics, Analysis - 1972 |