## Better Mini-Batch Algorithms via Accelerated Gradient Methods

Citations: | 6 - 3 self |

### BibTeX

@MISC{Cotter_bettermini-batch,

author = {Andrew Cotter and Nathan Srebro and Ohad Shamir and Karthik Sridharan},

title = {Better Mini-Batch Algorithms via Accelerated Gradient Methods},

year = {}

}

### OpenURL

### Abstract

Mini-batch algorithms have been proposed as a way to speed-up stochastic convex optimization problems. We study how such algorithms can be improved using accelerated gradient methods. We provide a novel analysis, which shows how standard gradient methods may sometimes be insufficient to obtain a significant speed-up and propose a novel accelerated gradient algorithm, which deals with this deficiency, enjoys a uniformly superior guarantee and works well in practice. 1

### Citations

280 | Pegasos: Primal estimated sub-gradient solver for SVM - Shalev-Shwartz, Singer, et al. |

251 | Smooth minimization of non-smooth functions
- Nesterov
(Show Context)
Citation Context ...ing distributed framework is capable of attaining asymptotically optimal speed-up in general (see also [1]). A parallel development has been the popularization of accelerated gradient descent methods =-=[7, 8, 15, 5]-=-. In a deterministic optimization setting and for general smooth convex functions, these methods enjoy a rate of O(1/n 2 )(wheren is the number of iterations) as opposed to O(1/n) using standard metho... |

132 | The tradeoffs of large scale learning
- Bottou, Bousquet
- 2007
(Show Context)
Citation Context ...d gradient method we propose is E [L(w ag n )] − L(w ) ≤ Õ L(w ) bn + 1 √ bn + 1 n 2 = Õ L(w ) m + √ b m + b2 m 2 . (6) To understand the implication these bounds, we follow the approach described in =-=[3, 12]-=- to analyze large-scale learning algorithms. First, we fix a desired suboptimality parameter , which measures how close to L(w ) we want to get. Then, we assume that both algorithms are ran till the s... |

96 | Robust stochastic approximation approach to stochastic programming - Nemirovski, Juditsky, et al. - 2009 |

82 |
Mirror descent and nonlinear projected subgradient methods for convex optimization
- Beck, Teboulle
(Show Context)
Citation Context ...(x,y) and (w, (x,y)) is a prediction loss. In recent years, there has been much interest in developing efficient first-order stochastic optimization methods for these problems, such as mirror descent =-=[2, 6]-=- and dual averaging [9, 16]. These methods are characterized by incremental updates based on subgradients ∂ (w,zi) of individual instances, and enjoy the advantages of being highly scalable and simple... |

73 | Primal-Dual subgradient methods for convex problems
- Nesterov
- 2009
(Show Context)
Citation Context ...prediction loss. In recent years, there has been much interest in developing efficient first-order stochastic optimization methods for these problems, such as mirror descent [2, 6] and dual averaging =-=[9, 16]-=-. These methods are characterized by incremental updates based on subgradients ∂ (w,zi) of individual instances, and enjoy the advantages of being highly scalable and simple to implement. An important... |

70 |
On accelerated proximal gradient methods for convex-concave optimization
- Tseng
- 2008
(Show Context)
Citation Context ...ing distributed framework is capable of attaining asymptotically optimal speed-up in general (see also [1]). A parallel development has been the popularization of accelerated gradient descent methods =-=[7, 8, 15, 5]-=-. In a deterministic optimization setting and for general smooth convex functions, these methods enjoy a rate of O(1/n 2 )(wheren is the number of iterations) as opposed to O(1/n) using standard metho... |

59 |
A method for unconstrained convex minimization problem with the rate of convergence O(1/k2
- Nesterov
- 1983
(Show Context)
Citation Context ...ing distributed framework is capable of attaining asymptotically optimal speed-up in general (see also [1]). A parallel development has been the popularization of accelerated gradient descent methods =-=[7, 8, 15, 5]-=-. In a deterministic optimization setting and for general smooth convex functions, these methods enjoy a rate of O(1/n 2 )(wheren is the number of iterations) as opposed to O(1/n) using standard metho... |

59 | Dual averaging methods for regularized stochastic learning and online optimization. JMLR
- Xiao
- 2010
(Show Context)
Citation Context ...prediction loss. In recent years, there has been much interest in developing efficient first-order stochastic optimization methods for these problems, such as mirror descent [2, 6] and dual averaging =-=[9, 16]-=-. These methods are characterized by incremental updates based on subgradients ∂ (w,zi) of individual instances, and enjoy the advantages of being highly scalable and simple to implement. An important... |

52 | SVM optimization: inverse dependence on training set size
- Shalev-Shwartz, Srebro
- 2008
(Show Context)
Citation Context ...d gradient method we propose is E [L(w ag n )] − L(w ) ≤ Õ L(w ) bn + 1 √ bn + 1 n 2 = Õ L(w ) m + √ b m + b2 m 2 . (6) To understand the implication these bounds, we follow the approach described in =-=[3, 12]-=- to analyze large-scale learning algorithms. First, we fix a desired suboptimality parameter , which measures how close to L(w ) we want to get. Then, we assume that both algorithms are ran till the s... |

16 | Distributed delayed stochastic optimization
- Agarwal, Duchi
- 2011
(Show Context)
Citation Context ...er in a distributed framework (see for instance [11]). Recently, [10] has shown that a mini-batching distributed framework is capable of attaining asymptotically optimal speed-up in general (see also =-=[1]-=-). A parallel development has been the popularization of accelerated gradient descent methods [7, 8, 15, 5]. In a deterministic optimization setting and for general smooth convex functions, these meth... |

16 | Optimal distributed online prediction using mini-batches
- Dekel, Gilad-Bachrach, et al.
(Show Context)
Citation Context ...gence speed. Moreover, in certain regimes acceleration is actually necessary in order to allow a significant speedups. The potential benefit of acceleration to mini-batching has been briefly noted in =-=[4]-=-, but here we study this issue in much more depth. In particular, we make the following contributions: • We develop novel convergence bounds for the standard gradient method, which refines the result ... |

15 | low-noise and fast rates
- Srebro, Sridharan, et al.
- 2010
(Show Context)
Citation Context ...to just choosing γi ∝ i as in [5]), might yield superior results. The key observation used for analyzing the dependence on L(w ) is that for any non-negative H-smooth convex function f : W → R, wehave=-=[13]-=-: ∇f(w) ≤ 4Hf(w) (3) This self-bounding property tells us that the norm of the gradient is small at a point if the loss is itself small at that point. This self-bounding property has been used in [14]... |

11 | Optimal distributed online prediction
- Dekel, Gilad-Bachrach, et al.
(Show Context)
Citation Context ...w,zi+j)). The gradient computations for each mini-batch can instance (i.e., 1 b be parallelized, allowing these methods to perform faster in a distributed framework (see for instance [11]). Recently, =-=[10]-=- has shown that a mini-batching distributed framework is capable of attaining asymptotically optimal speed-up in general (see also [1]). A parallel development has been the popularization of accelerat... |

4 |
Learning: Theory, Algorithms, and Applications
- Online
- 2007
(Show Context)
Citation Context ...[13]: ∇f(w) ≤ 4Hf(w) (3) This self-bounding property tells us that the norm of the gradient is small at a point if the loss is itself small at that point. This self-bounding property has been used in =-=[14]-=- inthe online setting and in [13] in the stochastic setting to get better (faster) rates of convergence for non-negative smooth losses. The implication of this observation are that for any w ∈ W, ∇L(w... |

1 |
An optimal method for stochastic convex optimization
- Lan
- 2009
(Show Context)
Citation Context |