## Dual averaging methods for regularized stochastic learning and online optimization (2009)

### Cached

### Download Links

Venue: | In Advances in Neural Information Processing Systems 23 |

Citations: | 62 - 3 self |

### BibTeX

@INPROCEEDINGS{Xiao09dualaveraging,

author = {Lin Xiao},

title = {Dual averaging methods for regularized stochastic learning and online optimization},

booktitle = {In Advances in Neural Information Processing Systems 23},

year = {2009}

}

### OpenURL

### Abstract

We consider regularized stochastic learning and online optimization problems, where the objective function is the sum of two convex terms: one is the loss function of the learning task, and the other is a simple regularization term such as ℓ1-norm for promoting sparsity. We develop extensions of Nesterov’s dual averaging method, that can exploit the regularization structure in an online setting. At each iteration of these methods, the learning variables are adjusted by solving a simple minimization problem that involves the running average of all past subgradients of the loss function and the whole regularization term, not just its subgradient. In the case of ℓ1-regularization, our method is particularly effective in obtaining sparse solutions. We show that these methods achieve the optimal convergence rates or regret bounds that are standard in the literature on stochastic and online convex optimization. For stochastic learning problems in which the loss functions have Lipschitz continuous gradients, we also present an accelerated version of the dual averaging method.

### Citations

4084 | Convex Optimization - Boyd, Vandenberghe - 2004 |

3777 | Convex Analysis
- Rockafellar
- 1970
(Show Context)
Citation Context ...pair of data drawn from an (unknown) underlying distribution, f(w,z) is the loss function of using w and x to predict y, and Ψ(w) is a regularization term. We assume Ψ(w) is a closed convex function (=-=Rockafellar, 1970-=-), and its effective domain, domΨ = {w ∈ R n |Ψ(w) < +∞}, is closed. We also assume that f(w,z) is convex in w for each z, and it is subdifferentiable (a subgradient always exists) on domΨ. Examples o... |

2066 | Regression shrinkage and selection via the Lasso
- Tibshirani
- 1996
(Show Context)
Citation Context ...usually enough to 3produce desired sparsity. As a result, ℓ1-regularization has been very effective in obtaining sparse solutions using the batch optimization approach in statistical learning (e.g., =-=Tibshirani, 1996-=-) and signal processing (e.g., Chen et al., 1998). In contrast, the SGD method (3) hardly generates any sparse solution, and its inherent low accuracy makes the simple rounding approach very unreliabl... |

1799 | Atomic Decomposition by Basis Pursuit
- Chen, Donoho, et al.
- 1999
(Show Context)
Citation Context ..., ℓ1-regularization has been very effective in obtaining sparse solutions using the batch optimization approach in statistical learning (e.g., Tibshirani, 1996) and signal processing 2545XIAO (e.g., =-=Chen et al., 1998-=-). In contrast, the SGD method (3) hardly generates any sparse solution, and its inherent low accuracy makes the simple rounding approach very unreliable. Several principled soft-thresholding or trunc... |

814 | Gradient-based learning applied to document recognition
- LeCun, Bottou, et al.
- 1998
(Show Context)
Citation Context ...t ) otherwise, i = 1,...,n. (30) end for 6. Computational Experiments with ℓ1-Regularization Inthissection, weprovidecomputationalexperimentsoftheℓ1-RDAmethodontheMNIST dataset of handwritten digits (=-=LeCun et al., 1998-=-). Our purpose here is mainly to illustrate the basic characteristics of the ℓ1-RDA method, rather than comprehensive performance evaluation on a wide range of datasets. First, we describe a variant o... |

608 | A stochastic approximation method,” The - Robbins, Monro |

415 | A fast iterative shrinkage-thresholding algorithm for linear inverse problems
- Beck, Teboulle
- 2009
(Show Context)
Citation Context .... Various methods for rounding or truncating the solutions are proposed to generate sparse solutions (e.g., [5]). Inspired by recently developed first-order methods for optimizing composite functions =-=[6, 7, 8]-=-, the regularized dual averaging (RDA) method we develop exploits the full regularization structure at each online iteration. In other words, at each iteration, the learning variables are adjusted by ... |

318 | Gradient projection for sparse reconstruction: Application to compressed sensing and other inverse problems - Figueiredo, Nowak, et al. - 2007 |

307 | Pegasos: Primal Estimated sub-GrAdient SOlver for SVM
- Shalev-Shwartz, Singer, et al.
- 2007
(Show Context)
Citation Context ...ular in the machine learning community due to its capability of scaling with very large data sets and good generalization performances observed in practice (e.g., Bottou and LeCun, 2004; Zhang, 2004; =-=Shalev-Shwartz et al., 2007-=-). Nevertheless, a main drawback of the SGD method is its lack of capability in exploiting problem structure, especially for problems with explicit regularization. More specifically, the SGD method (3... |

279 | 2005b. Smooth minimization of non-smooth functions
- Nesterov
(Show Context)
Citation Context ... lies in the standard simplex, ... |

258 | Exponentiated gradient versus gradient descent for linear predictors
- Kivinen, Warmuth
- 1994
(Show Context)
Citation Context ...( √ 1 t t+1 = exp − Zt+1 γ ¯g(i) ) t , i = 1,...,n, 8where Zt+1 is a normalization parameter such that ∑n i=1w(i) t+1 = 1. This is the dual averaging version of the exponentiated gradient algorithm (=-=Kivinen and Warmuth, 1997-=-); see also Tseng and Bertsekas (1993) and Juditsky et al. (2005). We note that this example is also covered by Nesterov’s dual averaging method. We discuss in detail the special case of p-norm RDA me... |

242 |
Weighted sums of certain dependent random variables
- Azuma
- 1967
(Show Context)
Citation Context ... ∆t = O(lnt) using βt = O(lnt). However, the high probability bound in Theorem 5 does not improve: we still have φ(¯wt)−φ ⋆ = O(1/ √ t), not O(lnt/t). The reason is that the concentration inequality (=-=Azuma, 1967-=-) used in proving Theorem 5 cannot take advantage of the strong-convexity property. By using a refined concentration inequality due to Freedman (1975), Kakade and Tewari (2009, Theorem 2) showed that ... |

222 | Introductory lectures on convex optimization: Basic course - Nesterov - 2003 |

207 |
Gradient methods for minimizing composite objective function
- Nesterov
- 2007
(Show Context)
Citation Context ...ems, they can be solved efficiently by interior-point methods (e.g., Ferris and Munson, 2003; Koh et al., 2007), quasi-Newton methods (e.g., Andrew and Gao, 2007), or accelerated first-order methods (=-=Nesterov, 2007-=-; Tseng, 2008; Beck and Teboulle, 2009). However, this batch optimization approach may not scale well for very large problems: even with first-order methods, evaluating one single gradient of the obje... |

195 | Online convex programming and generalized infinitesimal gradient ascent
- Zinkevich
- 2003
(Show Context)
Citation Context ... regularization function Ψ(... |

185 | Sparse reconstruction by separable approximation - Wright, Nowak, et al. |

183 |
Problem Complexity and Method Efficiency in Optimization
- Nemirovsky, Yudin
- 1983
(Show Context)
Citation Context ... is a positive constant. The corresponding convergence rate is O(1/ √ t), which is indeed best possible for subgradient schemes with a black-box model, even in the case of deterministic optimization (=-=Nemirovsky and Yudin, 1983-=-). Despite such slow convergence and the associated low accuracy in the solutions (compared with batch optimization using, for example, interior-point methods), the SGD method has been very popular in... |

157 | Stochastic estimation of the maximum of a regression function - Kiefer, Wolfowitz - 1952 |

148 | The tradeoffs of large scale learning
- Bottou, Bousquet
- 2008
(Show Context)
Citation Context ...y. The low computational complexity (per iteration) of online algorithms is often associated with their slow convergence and low accuracy in solving the underlying optimization problems. As argued in =-=[1, 2]-=-, the combined low complexity and low accuracy, together with other tradeoffs in statistical learning theory, still make online algorithms a favorite choice for solving largescale learning problems. N... |

131 | Scalable training of L1-regularized log-linear models
- Andrew, Gao
- 2007
(Show Context)
Citation Context ...ion problem. Depending on the structure of particular problems, they can be solved efficiently by interior-point methods (e.g., Ferris and Munson, 2003; Koh et al., 2007), quasi-Newton methods (e.g., =-=Andrew and Gao, 2007-=-), or accelerated first-order methods (Nesterov, 2007; Tseng, 2008; Beck and Teboulle, 2009). However, this batch optimization approach may not scale well for very large problems: even with first-orde... |

128 | Logarithmic regret algorithms for online convex optimization
- Hazan, Kalai, et al.
- 2006
(Show Context)
Citation Context ...al for convex cost functions. However, if the cost functions are strongly convex, say with convexity parameter σ, then the same algorithm with stepsize αt = 1/(σt) gives an O(lnt) regret bound (e.g., =-=Hazan et al., 2006-=-; Bartlett et al., 2008). Similar to the discussions on regularized stochastic learning, the online subgradient method (3) in general lacks the capability of exploiting the regularization structure. I... |

128 | Fundamentals of Convex Analysis - Hirriart-Urruty, Lemaréchal - 2001 |

125 | Splitting algorithms for the sum of two nonlinear operators - Lions, Mercier - 1979 |

112 |
Acceleration of stochastic approximation by averaging
- Polyak, Juditsky
- 1992
(Show Context)
Citation Context ...en the objective function isstrongly convex. In this case, the convergence rate for stochastic optimization problems can be improved to O(lnt/t) (e.g., Nesterov and Vial, 2008), or even O(1/t) (e.g., =-=Polyak and Juditsky, 1992-=-; Nemirovski et al., 2009). For online convex optimization problems, the regret bound can be improved to O(lnt) (Hazan et al., 2006; Bartlett et al., 2008). But these are still far short of the best c... |

106 | Robust stochastic approximation approach to stochastic programming
- Nemirovski, Juditsky, et al.
(Show Context)
Citation Context ...own in advance, then using a constant stepsize in the classical gradient method (3), say αt = 1 γ⋆ √ √ 2 D 2 = , ∀t = 1,...,T, (19) T G T gives a slightly improved bound RT(w) ≤ √ 2GD √ T (see, e.g., =-=Nemirovski et al., 2009-=-). The bound in part (b) does not converge to zero. This result is still interesting because there is no special caution taken in the RDA method, more specifically in (8), to ensure the boundedness of... |

78 |
On accelerated proximal gradient methods for convex-concave optimization,” submitted to
- Tseng
- 2008
(Show Context)
Citation Context .... Various methods for rounding or truncating the solutions are proposed to generate sparse solutions (e.g., [5]). Inspired by recently developed first-order methods for optimizing composite functions =-=[6, 7, 8]-=-, the regularized dual averaging (RDA) method we develop exploits the full regularization structure at each online iteration. In other words, at each iteration, the learning variables are adjusted by ... |

77 | Primal–dual subgradient methods for convex problems. Mathematical Programming. Published online DOI: 10.1007/s10107-007-0149-x. Also available as CORE Discussion Paper n. 2005/67, Center for Operation Research and Econometrics
- NESTEROV
- 2007
(Show Context)
Citation Context ...mal convergence theorems are given in Sections 3. Here are some examples: ∙ Nesterov’s dual averaging method. Let Ψ(... |

70 | Efficient projections onto the ℓ1-ball for learning in high dimensions
- Duchi, Shalev-Shwartz, et al.
- 2008
(Show Context)
Citation Context ...d by... |

67 | The robustness of the p-norm algorithms - Gentile |

65 | On tail probabilities for martingales - Freedman - 1975 |

65 | Sparse online learning via truncated gradient
- Langford, Li, et al.
(Show Context)
Citation Context ...t generate sparse solutions because only in very rare cases two float numbers add up to zero. Various methods for rounding or truncating the solutions are proposed to generate sparse solutions (e.g., =-=[5]-=-). Inspired by recently developed first-order methods for optimizing composite functions [6, 7, 8], the regularized dual averaging (RDA) method we develop exploits the full regularization structure at... |

63 | Adaptive subgradient methods for online learning and stochastic optimization - Duchi, Hazan, et al. - 2011 |

63 | Solving large scale linear prediction problems using stochastic gradient descent algorithms
- Zhang
- 2004
(Show Context)
Citation Context ...been very popular in the machine learning community due to its capability of scaling with very large data sets and good generalization performances observed in practice (e.g., Bottou and LeCun, 2004; =-=Zhang, 2004-=-; Shalev-Shwartz et al., 2007). Nevertheless, a main drawback of the SGD method is its lack of capability in exploiting problem structure, especially for problems with explicit regularization. More sp... |

60 | Efficient online and batch learning using forward backward splitting
- Duchi, Singer
- 2009
(Show Context)
Citation Context ...d bytto obtain the convergence rate. 4 Related work There have been several recent work that address online algorithms for regularized learning problems, especially with ℓ1-regularization; see, e.g., =-=[14, 15, 16, 5, 17]-=-. In particular, a forwardbackward splitting method (FOBOS) is studied in [17] for solving the same problems we consider. In an online setting, each iteration of the FOBOS method can be written as { 1... |

59 | SVM optimization: Inverse dependence on training set size
- Shalev-Shwartz, Srebro
- 2008
(Show Context)
Citation Context ...y. The low computational complexity (per iteration) of online algorithms is often associated with their slow convergence and low accuracy in solving the underlying optimization problems. As argued in =-=[1, 2]-=-, the combined low complexity and low accuracy, together with other tradeoffs in statistical learning theory, still make online algorithms a favorite choice for solving largescale learning problems. N... |

56 | An interior-point method for large-scale ℓ1-regularized logistic regression
- Koh, Kim, et al.
(Show Context)
Citation Context ...larconvergencerate and the same order of computational complexity per iteration. We also compare them with the batch optimization approach, using an efficient interior-point method (IPM) developed by =-=[19]-=-. Each pair of digits have about 12,000 training examples and 2,000 testing examples. We use online algorithms to go through the (randomly permuted) data only once, therefore the algorithms stop at ... |

51 | A modified forward-backward splitting method for maximal monotone mappings - Tseng |

48 | Large scale online learning
- Bottou, LeCun
- 2003
(Show Context)
Citation Context ...f Ψ at ... |

47 | Interior point methods for massive support vector machines. Data Mining Institute - Ferris, Munson - 2000 |

44 | Convergence analysis of a proximal-like minimization algorithm using Bregman functions - Chen, Teboulle - 1993 |

44 | On the convergence of exponential multiplier method for convex programming - Tseng, Bertsekas - 1993 |

31 | An optimal method for stochastic composite optimization - Lan - 2012 |

30 | Adaptive online gradient descent
- Bartlett, Hazan, et al.
- 2007
(Show Context)
Citation Context ...unctions. However, if the cost functions are strongly convex, say with convexity parameter σ, then the same algorithm with stepsize αt = 1/(σt) gives an O(lnt) regret bound (e.g., Hazan et al., 2006; =-=Bartlett et al., 2008-=-). Similar to the discussions on regularized stochastic learning, the online subgradient method (3) in general lacks the capability of exploiting the regularization structure. In this paper, we show t... |

30 | Differentiable sparse coding
- Bradley, Bagnell
- 2009
(Show Context)
Citation Context ...ivergence regularization has the pseudo-sparsity effect, meaning that most elements in w can be replaced by elements in the constant vector p without significantly increasing the loss function (e.g., =-=Bradley and Bagnell, 2009-=-). 93. Regret Bounds for Online Optimization In this section, we give the precise regret bounds of the RDA method for solving regularized online optimization problems. The convergence rates for stoch... |

30 | Convex repeated games and fenchel duality - Shalev-Shwartz, Singer - 2006 |

26 | Incremental Gradient(-Projection) method with momentum term and adaptive stepsize rule
- Tseng
- 1998
(Show Context)
Citation Context ...distribution on a finite support; more specifically, fk(w) = (1/m)f(w,zk) for k = 1,...,m. The unregularized version, i.e., with Ψ(w) = 0, has been addressed by incremental subgradient methods (e.g., =-=Tseng, 1998-=-; Nedić and Bertsekas, 2001). At each iteration of such methods, a step is taken along the negative subgradient of a single function fk, which is chosen either in a round-robin manner or randomly with... |

25 | On the generalization ability of online strongly convex programming algorithms - Kakade, Tewari - 2009 |

25 | Confidence level solutions for stochastic programming - Vial |

25 | Mind the duality gap: Logarithmic regret algorithms for online optimization - Kakade, Shalev-Shwartz - 2008 |

23 | Iterated hard shrinkage for minimization problems with sparsity constraints - Bredies, Lorenz |