#### DMCA

## Adaptive Subgradient Methods for Online Learning and Stochastic Optimization (2010)

### Cached

### Download Links

Citations: | 285 - 3 self |

### Citations

7509 |
Matrix Analysis
- Horn, Johnson
- 1985
(Show Context)
Citation Context ... tr(X p ) is easy to compute for integer values of p. However, when p is real we need the following lemma. The lemma tacitly uses the fact that there is a unique positive semidefinite X p when X ≽ 0 (=-=Horn and Johnson, 1985-=-, Theorem 7.2.6). Lemma 18. Let p ∈ R and X ≻ 0. Then ∇Xtr(X p ) = pX p−1 . Proof. We do a first order expansion of (X +A) p when X ≻ 0 and A is symmetric. Let X = UΛU ⊤ be the symmetric eigen-decompo... |

7188 | Convex optimization
- Boyd, Vandenberghe
- 2004
(Show Context)
Citation Context ...to get L(s,λ,θ) = d∑ i=1 ‖g1:T,i‖ 2 2 si 7 −〈λ,s〉+θ(〈1,s〉−c).Taking partial derivatives to find the infimum of L, we see that −‖g1:T,i‖ 2 2 /s2i − λi + θ = 0, and complimentarity conditions on λisi (=-=Boyd and Vandenberghe, 2004-=-) imply that λi = 0. Thus we have si = θ−1 2 ‖g1:T,i‖2 , and normalizing appropriately using θ gives that si = c‖g1:T,i‖2 / ∑d j=1‖g1:T,j‖ 2 . As one final note, we can plug si in to the above to see ... |

2143 | Term-weighting approaches in automatic text retrieval. Information processing & management
- Salton, Buckley
- 1988
(Show Context)
Citation Context ...that infrequently occurring features are highly informative and discriminative. The informativeness of rare features has led practitioners to craft domain-specific feature weightings, such as TF-IDF (=-=Salton and Buckley, 1988-=-), which pre-emphasize infrequently occurring features. We use this old idea as a motivation for applying modern learning-theoretic techniques to the problem of online and stochastic learning, focusin... |

1122 |
Nonlinear Programming. Athena Scientific
- Bertsekas
- 1999
(Show Context)
Citation Context ...or any x t ∗ , η(ft(xt)−ft(x ∗ ))+η(ϕ(xt+1)−ϕ(x ∗ )) ≤ Bψt (x∗ ,xt)−Bψt (x∗ ,xt+1)+ η2 2σ ‖f′ t(xt)‖ 2 ψ ∗ t Proof. The optimality of xt+1 for Eq. (3) implies for all x ∈ X and ϕ ′ (xt+1) ∈ ∂ϕ(xt+1) (=-=Bertsekas, 1999-=-) 〈x−xt+1,ηf ′ (xt)+∇ψt(xt+1)−∇ψt(xt)+ηϕ ′ (xt+1)〉 ≥ 0. (23) In particular, this obtains for x = x ∗ . From the subgradient inequality for convex functions, we have ft(x ∗ ) ≥ ft(xt) + 〈f ′ t(xt),x ∗ ... |

954 |
UCI machine learning repository
- Asuncion, Newman
- 2007
(Show Context)
Citation Context ...tabase (Deng et al., 2009), the Reuters RCV1 text classification data set (Lewis et al., 2004), the MNIST multiclass digit recognition problem, and the census income data set from the UCI repository (=-=Asuncion and Newman, 2007-=-). For uniformity across experiments, we focus on the completely online (fully stochastic) optimization setting, in which at each iteration the learning algorithm receives a single example. We measure... |

796 | ImageNet: a large-scale hierarchical image database
- Deng, Dong, et al.
(Show Context)
Citation Context ...riments with multiclass prediction problems in the next section. 6. Experiments We performed experiments with several real world data sets with different characteristics: the ImageNet image database (=-=Deng et al., 2009-=-), the Reuters RCV1 text classification data set (Lewis et al., 2004), the MNIST multiclass digit recognition problem, and the census income data set from the UCI repository (Asuncion and Newman, 2007... |

642 | Rcv1: A new benchmark collection for text categorization research
- Lewis, Yang, et al.
- 2004
(Show Context)
Citation Context ... Experiments We performed experiments with several real world data sets with different characteristics: the ImageNet image database (Deng et al., 2009), the Reuters RCV1 text classification data set (=-=Lewis et al., 2004-=-), the MNIST multiclass digit recognition problem, and the census income data set from the UCI repository (Asuncion and Newman, 2007). For uniformity across experiments, we focus on the completely onl... |

498 | Smooth minimization of non-smooth functions
- Nesterov
(Show Context)
Citation Context ...convex with respect to the norm ‖·‖ ψt , the function ψ∗ t has η-Lipschitz continuous gradients with respect to ‖·‖ ψ ∗ t : ‖∇ψ ∗ t(g1)−∇ψ ∗ t(g2)‖ ψt ≤ η‖g1− g2‖ ψ ∗ t (30) for any g1,g2 (see, e.g., =-=Nesterov, 2005-=-, Theorem 1 or Hiriart-Urruty and Lemaréchal, 1996, Chapter X). Further, a simple argument with the fundamental theorem of calculus gives that if f has L-Lipschitz gradients, f(y)≤ f(x)+〈∇ f(x),y−x〉+(... |

260 |
A new approach to variable metric algorithms
- Fletcher
- 1970
(Show Context)
Citation Context ... no means new and can be traced back at least to the 1970s. There, we find Shor’s work on space dilation methods (1972) as well as variable metric methods, such as the BFGS family of algorithms (e.g. =-=Fletcher, 1970-=-). This older work usually assumes that the function to be minimized is differentiable and, to our knowledge, did not consider stochastic, online, or composite optimization. More recently, Bordes et a... |

255 | Robust stochastic approximation approach to stochastic programming
- Nemirovski, Juditsky, et al.
- 2009
(Show Context)
Citation Context ...i et al., 2010), which in turn include as special cases projected gradients (Zinkevich, 2003) and mirror descent (Nemirovski and Yudin, 1983; Beck and Teboulle, 2003). Recent work by several authors (=-=Nemirovski et al., 2009-=-; Juditsky et al., 2008; Lan, 2010; Xiao, 2010) considered efficient and robust methods for stochastic optimization, especially in the case when the expected objective f is smooth. It may be interesti... |

202 | Logarithmic regret algorithms for online convex optimization
- Hazan, Kalai, et al.
- 2006
(Show Context)
Citation Context ...), and in particular its specialized versions: regularized dual asent (RDA) of Xiao (2009) and the follow-the-regularized-leader (FTRL) family of algorithms (see for instance Kalai and Vempala, 2003; =-=Hazan et al., 2006-=-; or Abernethy et al., 2008b). In the primal-daul subgradient method we make a prediction xt on round t using the average gradient ¯gt = 1 ∑t t τ=1 gτ. The update encompasses a trade-off between a gra... |

182 | Efficient algorithms for online decision problems - Kalai, Vempala |

176 | On the generalization ability of on-line learning algorithms
- Cesa-Bianchi, Conconi, et al.
- 2004
(Show Context)
Citation Context ... of each subgradient we observe, by g1:t,i. We also define the outer product matrix Gt = ∑t τ=1 gτgτ ⊤ . Online learning and stochastic optimization are closely related and basically interchangeable (=-=Cesa-Bianchi et al., 2004-=-). In order to keep our presentation simple, we confine our discussion and algorithmic descriptions to the online setting with the regret bound model. In online learning, the learner repeatedly predic... |

173 |
On accelerated proximal gradient methods for convex-concave optimization
- Tseng
- 2008
(Show Context)
Citation Context ... xt+1 = argmin η〈¯gt,x〉+ηϕ(x)+ x∈X 1 t ψt(x) } , (5)where η is a step-size. The second method also has many names, such as proximal gradient, forwardbackward splitting, and composite mirror descent (=-=Tseng, 2008-=-; Duchi and Singer, 2009; Duchi et al., 2010b). We use the term composite mirror descent. The composite mirror descent method employs a more immediate trade-off between the current gradient gt, ϕ, and... |

158 | Mirror descent and nonlinear projected subgradient methods for convex optimization - Beck, Teboulle |

140 | Primal-dual subgradient methods for convex problems
- Nesterov
(Show Context)
Citation Context ...lly sub-linear, namely, Rφ(T) = o(T). Our analysis applies to related, yet different, methods for for minimizing the regret defined in Eq. (1). The first is Nesterov’s primal-dual subgradient method (=-=Nesterov, 2009-=-), and in particular its specialized versions: regularized dual asent (RDA) of Xiao (2009) and the follow-the-regularized-leader (FTRL) family of algorithms (see for instance Kalai and Vempala, 2003; ... |

129 |
Introductory Lectures on Convex Optimization
- Nesterov
- 2004
(Show Context)
Citation Context ...ically, ∀ g1,g2 : ‖∇Vt(g1)−∇Vt(g2)‖ ψt ≤ η‖g1 −g2‖ ψ ∗ t . The main consequence of the above reasoning with which we are concerned is the following consequence of the fundamental theorem of calculus (=-=Nesterov, 2004-=-, Theorem 2.1.5): For the remainder of this proof, we set ¯gt = 1 t lemma characterizing Vt as a function of ¯gt. Vt(g +h) ≤ Vt(g)+〈h,∇Vt(g)〉+ η 2 ‖h‖2 ψ ∗ t Lemma 16. For any t ≥ 1, we have Vt(−t¯gt)... |

128 | Dual averaging methods for regularized stochastic learning and online optimization,” Microsoft Research
- Xiao
- 2009
(Show Context)
Citation Context ... and (3) either remained intact or was simply multiplied by a time-dependent scalar throughout the run of the algorithm. Zinkevich’s projected gradient, for example, uses ψt(x) = ‖x‖ 2 2 , while RDA (=-=Xiao, 2009-=-) employs ψt(x) = √ tψ(x) where ψ is a strongly convex function. The bounds for both types of algorithms are similar, and both rely on the norm ‖·‖ (and its associated dual ‖·‖∗ ) with respect to whic... |

127 | Efficient online and batch learning using forward backward splitting - Duchi, Singer - 2009 |

96 | Adaptive and self-confident online learning algorithms - Auer, Gentile - 2000 |

90 |
Concavity of certain maps on positive definite matrices and applications to Hadamard products. Linear Algebra and its Applications
- Ando
- 1979
(Show Context)
Citation Context ...efinite. We also use the previous lemma which implies that the gradient of tr(A 1/2 ) is 1 2 A−1/2 when A ≻ 0. First, A p is matrix-concave for A ≻ 0 and 0 ≤ p ≤ 1 (see, for example, Corollary 4.1 in =-=Ando, 1979-=- or Theorem 16.1 in Bondar, 1994). That is, for A, B ≻ 0 and α ∈ [0, 1] we have (αA + (1 − α)B) p ≽ αA p + (1 − α)B p . (25) Now suppose simply A, B ≽ 0 (but neither is necessarily strict). Then for a... |

78 | Competing in the dark: An efficient algorithm for bandit linear optimization - Abernethy, Hazan, et al. - 2008 |

77 | A second-order perceptron algorithm - Cesa-Bianchi, Conconi |

68 | Sgd-qn: Careful quasi-newton stochastic gradient descent - Bordes, Bottou, et al. |

68 | Online passive aggressive algorithms - Crammer, Dekel, et al. |

67 | Improved second-order bounds for prediction with expert advice - Cesa-Bianchi, Mansour, et al. |

66 | Algebra and its Applications - Linear - 1992 |

65 | A discriminative kernel-based model to rank images from text queries - Grangier, Bengio - 2008 |

64 | Composite objective mirror descent
- Duchi, Shalev-Shwartz, et al.
- 2010
(Show Context)
Citation Context ...re η is a fixed step-size and x1 = argmin x∈X ϕ(x). The second method similarly has numerous names, including proximal gradient, forward-backward splitting, and composite mirror descent (Tseng, 2008; =-=Duchi et al., 2010-=-). We use the term composite mirror descent. The composite mirror descent method employs a more immediate trade-off between the current gradient gt, ϕ, and staying close to xt using the proximal funct... |

64 | An optimal method for stochastic composite optimization
- Lan
- 2010
(Show Context)
Citation Context ...l cases projected gradients (Zinkevich, 2003) and mirror descent (Nemirovski and Yudin, 1983; Beck and Teboulle, 2003). Recent work by several authors (Nemirovski et al., 2009; Juditsky et al., 2008; =-=Lan, 2010-=-; Xiao, 2010) considered efficient and robust methods for stochastic optimization, especially in the case when the expected objective f is smooth. It may be interesting to investigate adaptive metric ... |

61 | Adaptive regularization of weight vectors
- Crammer, Dredze, et al.
- 2009
(Show Context)
Citation Context ...αt = [1−yt〈zt,µt〉] + , µt+1 = µt+αtΣtytzt, Σt+1 = Σt−βtΣtxtx ⊤ t Σt. (8) In the above, one can set Σt to be diagonal, which reduces run-time and storage requirements but still gives good performance (=-=Crammer et al., 2009-=-). In contrast to AROW, the ADAGRAD family uses the root of a covariance-like matrix, a consequence of our formal analysis. Crammer et al.’s algorithm and our algorithms have similar run times—linear ... |

56 |
An O(n) algorithm for quadratic knapsack problems
- Brucker
- 1984
(Show Context)
Citation Context ...to {z : 〈a, z〉 ≤ c, z ≽ 0}. We next consider the setting in which ϕ ≡ 0 and X = {x : ‖x‖1 ≤ c}, for which it is straightforward to adapt efficient solutions to continuous quadratic knapsack problems (=-=Brucker, 1984-=-). We use the matrix Ht = δI + diag(Gt) 1/2 from Algorithm 1. We provide a brief derivation sketch and an O(d log d) algorithm in this section. First, we convert the problem (18) into a projection pro... |

54 |
Convex Analysis and Minimization Algorithms II
- Hiriart-Urruty, Lemaréchal
- 1996
(Show Context)
Citation Context ...t , the function ψ ∗ t has η-Lipschitz continuous gradients with respect to ‖·‖ ψ ∗ t : ‖∇ψ ∗ t (g1) − ∇ψ ∗ t (g2)‖ ψt ≤ η ‖g1 − g2‖ ψ ∗ t (30) for any g1, g2 (see, e.g., Nesterov, 2005, Theorem 1 or =-=Hiriart-Urruty and Lemaréchal, 1996-=-, Chapter X). Further, a simple argument with the fundamental theorem of calculus gives that if f has L-Lipschitz gradients, f(y) ≤ f(x) + 〈∇f(x), y − x〉 + (L/2) ‖y − x‖ 2 , and ∇ψ ∗ { t (g) = argmin ... |

41 | Adaptive online gradient descent
- Bartlett, Hazan, et al.
- 2007
(Show Context)
Citation Context ...s associated dual ‖·‖∗ ) with respect to which ψ is strongly convex. Mirror-descent type first order algorithms, such as projected gradient methods, attain regret bounds of the form (Zinkevich, 2003; =-=Bartlett et al., 2007-=-; Duchi et al., 2010) Rφ(T) ≤ 1 η Bψ(x ∗ ,x1)+ η 2 T∑ t=1 ‖f ′ t(xt)‖ 2 ∗ . (7) Choosing η ∝ 1/ √ T gives Rφ(T) = O( √ T). When Bψ(x,x ∗ ) is bounded for all x ∈ X, we choose step sizes ηt ∝ 1/ √ t wh... |

38 | Extracting certainty from uncertainty: regret bounded by variation in costs
- Hazan, Kale
- 2010
(Show Context)
Citation Context ...ery iteration. Plain usage of the FTRL approach fails to achieve low regret, however, adding a proximal2 term to the past predictions leads to numerous low regret algorithms (Kalai and Vempala, 2003; =-=Hazan and Kale, 2008-=-; Rakhlin, 2009). The proximal term strongly affects the performance of the learning algorithm. Therefore, adapting the proximal function to the characteristics of the problem at hand is desirable. Ou... |

38 | An algorithm for a singly constrained class of quadratic programs subject to upper and lower bounds - Pardalos, Rosen - 1990 |

35 | Solving variational inequalities with Stochastic Mirror-Prox algorithm. Arxiv preprint arXiv:0809.0815
- Juditsky, Nemirovskii, et al.
- 2008
(Show Context)
Citation Context ... turn include as special cases projected gradients (Zinkevich, 2003) and mirror descent (Nemirovski and Yudin, 1983; Beck and Teboulle, 2003). Recent work by several authors (Nemirovski et al., 2009; =-=Juditsky et al., 2008-=-; Lan, 2010; Xiao, 2010) considered efficient and robust methods for stochastic optimization, especially in the case when the expected objective f is smooth. It may be interesting to investigate adapt... |

34 | Optimal strategies and minimax lower bounds for online convex games
- Abernethy, Bartlett, et al.
- 2008
(Show Context)
Citation Context ...ch we take the infimum. To conclude the outline of results, we would like to point to two relevant research papers. First, Zinkevich’s regret bound is tight and cannot be improved in a minimax sense (=-=Abernethy et al., 2008-=-). Therefore, improving the regret bound requires further reasonable assumptions on the input space. Second, in a independent work, performed concurrently to the research presented in this paper, McMa... |

32 | Exact convex confidence-weighted learning
- Crammer, Dredze, et al.
- 2008
(Show Context)
Citation Context ...dependent bounds for algorithms such as those above is a natural one and has been studied extensively in the past. The framework that is most related to ours is probably confidence weighted learning (=-=Crammer et al., 2008-=-) and the adaptive regularization of weights algorithm (AROW) of Crammer et al. (2009). These papers give a mistake-bound analysis for second-order algorithms for the Perceptron, which are similar in ... |

28 | Logarithmic regret algorithms for strongly convex repeated games
- Shalev-Shwartz, Singer
- 2007
(Show Context)
Citation Context ...gly Convex Functions It is now well established that strong convexity of the functions ft can give significant improvements in the regret of online convex optimization algorithms (Hazan et al., 2006; =-=Shalev-Shwartz and Singer, 2007-=-). We can likewise derive lower regret bounds in the presence of strong convexity. We assume that our functions ft + ϕ are strongly convex with respect to a norm ‖·‖. For simplicity, we assume that ea... |

28 | Utilization of the operation of space dilation in the minimization of convex functions., Cybernetics and Systems Analysis 6 - Shor - 1970 |

27 | Adaptive bound optimization for online convex optimization - McMahan, Streeter - 2010 |

26 | UCI machine learning repository, 2007. URL: 〈http://www.ics. uci.edu/mlearn/MLRepository.html - Asuncion |

20 | Algebra and its - Linear |

17 | Online and Batch Learning using Forward-Backward Splitting - Duchi, Singer - 2009 |

15 |
Joint covariate selection for grouped classification
- Obozinski
- 2006
(Show Context)
Citation Context ...egularization We now turn to the case where ϕ(x)=λ‖x‖ 2 while X =R d . This type of regularization is useful for zeroing multiple weights in a group, for example in multi-task or multiclass learning (=-=Obozinski et al., 2007-=-). Recalling the general proximal step (18), we must solve min x 〈u,x〉+ 1 2 〈x,Hx〉+λ‖x‖ 2 . (21) There is no closed form solution for this problem, but we give an efficient bisection-based procedure f... |

12 |
Notions generalizing convexity for functions defined on spaces of matrices
- Davis
- 1963
(Show Context)
Citation Context ...e inequality to get δt ≤ 1 η ψt(x ∗ )+ η 2 t∑ τ=1 ‖g τ ‖ 2 ψ∗ . τ−1 Combining the above equation with the lower bound on δt from Eq. (28) finishes the proof. . C Technical Lemmas Lemma 17 (Example 3, =-=Davis, 1963-=-). Let A ≽ B ≽ 0 be symmetric d × d PSD matrices. Then A 1/2 ≽ B 1/2 . The gradient of the function tr(X p ) is easy to compute for integer values of p. However, when p is real we need the following l... |

12 | Subgradient methods for convex minimization - Nedić - 2002 |

7 |
Comments on and complements to Inequalities: Theory of Majorization and Its Applications. Linear Algebra and its Applications
- Bondar
- 1994
(Show Context)
Citation Context ...ous lemma which implies that the gradient of tr(A 1/2 ) is 1 2 A−1/2 when A ≻ 0. First, A p is matrix-concave for A ≻ 0 and 0 ≤ p ≤ 1 (see, for example, Corollary 4.1 in Ando, 1979 or Theorem 16.1 in =-=Bondar, 1994-=-). That is, for A, B ≻ 0 and α ∈ [0, 1] we have (αA + (1 − α)B) p ≽ αA p + (1 − α)B p . (25) Now suppose simply A, B ≽ 0 (but neither is necessarily strict). Then for any δ > 0, we have A + δI ≻ 0 and... |

2 | Convex Optimization - Hazan, Boyd, et al. - 2004 |

1 | 2)/c in to the objective being - Abernethy, Bartlett, et al. |

1 |
Problem Complexity and Efficiency in Optimization
- Nemirovski, Yudin
- 1983
(Show Context)
Citation Context ...ger, 2009) and its composite mirror-descent generalizations (Duchi et al., 2010), which in turn include as special cases the method of projected gradient descent (Zinkevich, 2003) and mirror descent (=-=Nemirovski and Yudin, 1983-=-; Beck and Teboulle, 2003). Prior to the analysis presented in this paper, the strongly convex function ψ in the update equations (2) and (3) either remained intact or was simply multiplied by a time-... |

1 |
Lecture notes on online learning. For the Statistical Machine Learning Course at
- Rakhlin
- 2009
(Show Context)
Citation Context ...ave been developed over the past few years to minimize regret in the online learning setting. A modern view of these algorithms cast the problem as the task of following the (regularized) leader (see =-=Rakhlin, 2009-=- and the references therein) or FTRL in short. Informally, FTRL methods choose the best decision in hindsight at every iteration. Plain usage of the FTRL approach fails to achieve low regret, however,... |

1 | Translated from - Cybernetics, Analysis - 1972 |

1 | Systems Analysis, 6(1):7–15 - Cybernetics - 1972 |

1 | Optimalstrategiesandminimaxlower bounds for online convex games - Abernethy, Rakhlin - 2008 |

1 | Extractingcertaintyfromuncertainty: regretboundedbyvariationin costs - Kale - 2008 |

1 | Solvingvariationalinequalitieswiththestochastic mirror-prox algorithm - Juditsky, Tauvel - 2008 |