• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Incremental majorization-minimization optimization with application to large-scale machine learning. (2015)

by Julien Mairal
Venue:SIAM Journal on Optimization,
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 23
Next 10 →

SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives

by Aaron Defazio, Francis Bach, Simon Lacoste-julien , 2014
"... In this work we introduce a new optimisation method called SAGA in the spirit of SAG, SDCA, MISO and SVRG, a set of recently proposed incremental gradient algorithms with fast linear convergence rates. SAGA improves on the theory behind SAG and SVRG, with better theoretical convergence rates, and ha ..."
Abstract - Cited by 30 (3 self) - Add to MetaCart
In this work we introduce a new optimisation method called SAGA in the spirit of SAG, SDCA, MISO and SVRG, a set of recently proposed incremental gradient algorithms with fast linear convergence rates. SAGA improves on the theory behind SAG and SVRG, with better theoretical convergence rates, and has support for composite objectives where a proximal operator is used on the regulariser. Unlike SDCA, SAGA supports non-strongly convex problems directly, and is adaptive to any inherent strong convexity of the problem. We give experimental results showing the effectiveness of our method. 1
(Show Context)

Citation Context

... = n, however doing so does not give anywhere near the best practical performance. Having to tune one parameter instead of two is a practical advantage for SAGA. Finito/MISOµ The Finito [5] and MISOµ =-=[6]-=- methods are also closely related to SAGA. Both Finito and MISOµ use updates of the following form, for a step length η: xk+1 = 1 n ∑ i φki − 1 η n∑ i=1 f ′i(φ k i ). Note that the step sized used is ...

Randomized dual coordinate ascent with arbitrary sampling

by Zheng Qu, Peter Richtárik, Tong Zhang , 2014
"... We study the problem of minimizing the average of a large number of smooth convex functions penalized with a strongly convex regularizer. We propose and analyze a novel primal-dual method (Quartz) which at every iteration samples and updates a random subset of the dual variables, chosen according to ..."
Abstract - Cited by 7 (4 self) - Add to MetaCart
We study the problem of minimizing the average of a large number of smooth convex functions penalized with a strongly convex regularizer. We propose and analyze a novel primal-dual method (Quartz) which at every iteration samples and updates a random subset of the dual variables, chosen according to an arbitrary distribution. In contrast to typical analysis, we directly bound the decrease of the primal-dual error (in expectation), without the need to first analyze the dual error. Depending on the choice of the sampling, we obtain efficient serial, parallel and distributed variants of the method. In the serial case, our bounds match the best known bounds for SDCA (both with uniform and importance sampling). With standard mini-batching, our bounds predict initial data-independent speedup as well as additional data-driven speedup which depends on spectral and sparsity properties of the data. We calculate theoretical speedup factors and find that they are excellent predictors of actual speedup in practice. Moreover, we illustrate that it is possible to design an efficient mini-batch importance sampling. The distributed variant of Quartz is the first distributed SDCA-like method with an analysis for non-separable data.

A universal catalyst for first-order optimization.

by Hongzhou Lin , Julien Mairal , Zaid Harchaoui - In Advances in Neural Information Processing Systems, , 2015
"... Abstract We introduce a generic scheme for accelerating first-order optimization methods in the sense of Nesterov, which builds upon a new analysis of the accelerated proximal point algorithm. Our approach consists of minimizing a convex objective by approximately solving a sequence of well-chosen ..."
Abstract - Cited by 6 (0 self) - Add to MetaCart
Abstract We introduce a generic scheme for accelerating first-order optimization methods in the sense of Nesterov, which builds upon a new analysis of the accelerated proximal point algorithm. Our approach consists of minimizing a convex objective by approximately solving a sequence of well-chosen auxiliary problems, leading to faster convergence. This strategy applies to a large class of algorithms, including gradient descent, block coordinate descent, SAG, SAGA, SDCA, SVRG, Finito/MISO, and their proximal variants. For all of these methods, we provide acceleration and explicit support for non-strongly convex objectives. In addition to theoretical speed-up, we also show that acceleration is useful in practice, especially for ill-conditioned problems where we measure significant improvements.
(Show Context)

Citation Context

...th constant µ, the rate of convergence becomes linear inO((1−µ/L)k). These rates were shown by Nesterov [16] to be suboptimal for the class of first-order methods, and instead optimal rates—O(1/k2) for the convex case and O((1 − √ µ/L)k) for the µ-strongly convex one—could be obtained by taking gradient steps at well-chosen points. Later, this acceleration technique was extended to deal with non-differentiable regularization functions ψ [4, 19]. For modern machine learning problems involving a large sum of n functions, a recent effort has been devoted to developing fast incremental algorithms [6, 7, 14, 24, 25, 27] that can exploit the particular 1 structure of (2). Unlike full gradient approaches which require computing and averaging n gradients ∇f(x) = (1/n)∑ni=1 ∇fi(x) at every iteration, incremental techniques have a cost per-iteration that is independent of n. The price to pay is the need to store a moderate amount of information regarding past iterates, but the benefit is significant in terms of computational complexity. Main contributions. Our main achievement is a generic acceleration scheme that applies to a large class of optimization methods. By analogy with substances that increase chemical ...

Finito: A Faster, Permutable Incremental Gradient Method for Big Data Problems

by Aaron J. Defazio, Tibério S. Caetano, Justin Domke
"... Recent advances in optimization theory have shown that smooth strongly convex finite sums can be minimized faster than by treating them as a black box ”batch ” problem. In this work we introduce a new method in this class with a theo-retical convergence rate four times faster than ex-isting methods, ..."
Abstract - Cited by 6 (1 self) - Add to MetaCart
Recent advances in optimization theory have shown that smooth strongly convex finite sums can be minimized faster than by treating them as a black box ”batch ” problem. In this work we introduce a new method in this class with a theo-retical convergence rate four times faster than ex-isting methods, for sums with sufficiently many terms. This method is also amendable to a sam-pling without replacement scheme that in prac-tice gives further speed-ups. We give empirical results showing state of the art performance. 1.

Coordinate descent with arbitrary sampling I: Algorithms and complexity

by Zheng Qu , 2014
"... The design and complexity analysis of randomized coordinate descent methods, and in par-ticular of variants which update a random subset (sampling) of coordinates in each iteration, depends on the notion of expected separable overapproximation (ESO). This refers to an in-equality involving the objec ..."
Abstract - Cited by 5 (1 self) - Add to MetaCart
The design and complexity analysis of randomized coordinate descent methods, and in par-ticular of variants which update a random subset (sampling) of coordinates in each iteration, depends on the notion of expected separable overapproximation (ESO). This refers to an in-equality involving the objective function and the sampling, capturing in a compact way certain smoothness properties of the function in a random subspace spanned by the sampled coordi-nates. ESO inequalities were previously established for special classes of samplings only, almost invariably for uniform samplings. In this paper we develop a systematic technique for deriving these inequalities for a large class of functions and for arbitrary samplings. We demonstrate that one can recover existing ESO results using our general approach, which is based on the study of eigenvalues associated with samplings and the data describing the function. 1
(Show Context)

Citation Context

...mail: peter.richtarik@ed.ac.uk) 1 [46, 24, 36, 40, 48], randomized coordinate descent methods [4, 8, 34, 33, 40, 42, 41, 7, 30, 5, 48, 38, 39, 15, 28, 12] and semi-stochastic gradient descent methods =-=[35, 45, 9, 13, 19, 20, 3, 44, 11, 10, 12]-=-. 1.1 Randomized coordinate descent In this paper we focus on randomized coordinate descent methods. After the seminal work of Nesterov [26], which provided an early theoretical justification of these...

A stochastic coordinate descent primal-dual algorithm and applications to large-scale composite optimization,

by P Bianchi , W Hachem , F Iutzeler , 2014
"... Abstract-Based on the idea of randomized coordinate descent of α-averaged operators, a randomized primal-dual optimization algorithm is introduced, where a random subset of coordinates is updated at each iteration. The algorithm builds upon a variant of a recent (deterministic) algorithm proposed b ..."
Abstract - Cited by 5 (2 self) - Add to MetaCart
Abstract-Based on the idea of randomized coordinate descent of α-averaged operators, a randomized primal-dual optimization algorithm is introduced, where a random subset of coordinates is updated at each iteration. The algorithm builds upon a variant of a recent (deterministic) algorithm proposed by Vũ and Condat that includes the well known ADMM as a particular case. The obtained algorithm is used to solve asynchronously a distributed optimization problem. A network of agents, each having a separate cost function containing a differentiable term, seek to find a consensus on the minimum of the aggregate objective. The method yields an algorithm where at each iteration, a random subset of agents wake up, update their local estimates, exchange some data with their neighbors, and go idle. Numerical results demonstrate the attractive performance of the method. The general approach can be naturally adapted to other situations where coordinate descent convex optimization algorithms are used with a random choice of the coordinates.
(Show Context)

Citation Context

...d algorithms are: - SGD: the stochastic (sub-)gradient descent (see [21] and references therein) applied to Problem (20) with 1/k stepsize. - MISO: the MISO algorithm for composite optimization [22], =-=[23]-=- applied to Problem (20) with 1/L stepsize, where L is set to the maximum of the upper bounds of the Lipschitz constants per batch L = 0.25maxn=1,...,N ‖an‖22. - SMPD: our SMPD algorithm described abo...

Parallel successive convex approximation for nonsmooth nonconvex optimization

by Meisam Razaviyayn, Mingyi Hong, Zhi-Quan Luo, Jong-Shi Pang , 2014
"... Consider the problem of minimizing the sum of a smooth (possibly non-convex) and a convex (possibly nonsmooth) function involving a large number of variables. A popular approach to solve this problem is the block coordinate descent (BCD) method whereby at each iteration only one variable block is up ..."
Abstract - Cited by 4 (2 self) - Add to MetaCart
Consider the problem of minimizing the sum of a smooth (possibly non-convex) and a convex (possibly nonsmooth) function involving a large number of variables. A popular approach to solve this problem is the block coordinate descent (BCD) method whereby at each iteration only one variable block is updated while the re-maining variables are held fixed. With the recent advances in the developments of the multi-core parallel processing technology, it is desirable to parallelize the BCD method by allowing multiple blocks to be updated simultaneously at each itera-tion of the algorithm. In this work, we propose an inexact parallel BCD approach where at each iteration, a subset of the variables is updated in parallel by mini-mizing convex approximations of the original objective function. We investigate the convergence of this parallel BCD method for both randomized and cyclic vari-able selection rules. We analyze the asymptotic and non-asymptotic convergence behavior of the algorithm for both convex and non-convex objective functions. The numerical experiments suggest that for a special case of Lasso minimization problem, the cyclic block selection rule can outperform the randomized rule.
(Show Context)

Citation Context

...) = 〈∇yif(y), xi − yi〉+ α2 ‖xi − yi‖2. • f̃(xi, y) = f(xi, y−i) + α2 ‖xi − yi‖2, for α large enough. For other practical useful approximations of f(·) and the stochastic/incremental counterparts, see =-=[21, 25, 26]-=-. With the recent advances in the development of parallel processing machines, it is desirable to take the advantage of multi-core machines by updating multiple blocks simultaneously in (3). Unfortuna...

Stochastic dual coordinate ascent with adaptive probabilities. ICML 2015. [2] Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss

by Dominik Csiba, Zheng Qu
"... This paper introduces AdaSDCA: an adap-tive variant of stochastic dual coordinate as-cent (SDCA) for solving the regularized empir-ical risk minimization problems. Our modifica-tion consists in allowing the method adaptively change the probability distribution over the dual variables throughout the ..."
Abstract - Cited by 3 (2 self) - Add to MetaCart
This paper introduces AdaSDCA: an adap-tive variant of stochastic dual coordinate as-cent (SDCA) for solving the regularized empir-ical risk minimization problems. Our modifica-tion consists in allowing the method adaptively change the probability distribution over the dual variables throughout the iterative process. AdaS-DCA achieves provably better complexity bound than SDCA with the best fixed probability dis-tribution, known as importance sampling. How-ever, it is of a theoretical character as it is expen-sive to implement. We also propose AdaSDCA+: a practical variant which in our experiments out-performs existing non-adaptive methods. 1.
(Show Context)

Citation Context

...se include primal methods such as SAG (Schmidt et al., 2013), SVRG (Johnson & Zhang, 2013), S2GD (Konečný & Richtárik, 2014), SAGA (Defazio et al., 2014), mS2GD (Konečný et al., 2014a) and MISO (=-=Mairal, 2014-=-). Importance sampling was considered in ProxSVRG (Xiao & Zhang, 2014) and S2CD (Konečný et al., 2014b). Stochastic Dual Coordinate Ascent. One of the most successful methods in this category is sto...

Fercoq, O.: SDNA: Stochastic Dual Newton Ascent for Empirical Risk Minimization

by Zheng Qu , Zhengqu@hku Hk , Peter Richtárik , Peter Ac Richtarik@ed , Uk , Martin Takáč , Olivier Fercoq , Olivier Fercoq@telecom-Paristech , Fr , 1502
"... Abstract We propose a new algorithm for minimizing regularized empirical loss: Stochastic Dual Newton Ascent (SDNA). Our method is dual in nature: in each iteration we update a random subset of the dual variables. However, unlike existing methods such as stochastic dual coordinate ascent, SDNA is c ..."
Abstract - Cited by 1 (1 self) - Add to MetaCart
Abstract We propose a new algorithm for minimizing regularized empirical loss: Stochastic Dual Newton Ascent (SDNA). Our method is dual in nature: in each iteration we update a random subset of the dual variables. However, unlike existing methods such as stochastic dual coordinate ascent, SDNA is capable of utilizing all local curvature information contained in the examples, which leads to striking improvements in both theory and practice -sometimes by orders of magnitude. In the special case when an L2-regularizer is used in the primal, the dual problem is a concave quadratic maximization problem plus a separable term. In this regime, SDNA in each step solves a proximal subproblem involving a random principal submatrix of the Hessian of the quadratic function; whence the name of the method.
(Show Context)

Citation Context

...ems with a massive number of examples, which leads to new algorithmic challenges. State-of-the-art optimization methods for ERM include i) stochastic (sub)gradient descent (Shalev-Shwartz et al., 2011; Takac et al., 2013), ii) methods based on stochastic estimates of the gradient with diminishing variance such as SAG (Schmidt et al., 2013), SVRG (Johnson & Zhang, 2013), S2GD (Konecny & Richtarik, 2014), Proceedings of the 33 rd International Conference on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48. Copyright 2016 by the author(s). proxSVRG (Xiao & Zhang, 2014), MISO (Mairal, 2015), SAGA (Defazio et al., 2014), minibatch S2GD (Konecny et al., 2014a), S2CD (Konecny et al., 2014b), and iii) variants of stochastic dual coordinate ascent (Shalev-Shwartz & Zhang, 2013b; Zhao & Zhang, 2014; Takac et al., 2013; Shalev-Shwartz & Zhang, 2013a;a; Lin et al., 2014; Qu et al., 2014). There have been several attempts at designing methods that combine randomization with the use of curvature (secondorder) information. For example, methods based on running coordinate ascent in the dual such as those mentioned above and also (Richtarik & Takac, 2014; 2015; Fercoq & Richtarik, ...

Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization

by Jinghui Chen , Quanquan Gu
"... Abstract We propose an accelerated stochastic block coordinate descent algorithm for nonconvex optimization under sparsity constraint in the high dimensional regime. The core of our algorithm is leveraging both stochastic partial gradient and full partial gradient restricted to each coordinate bloc ..."
Abstract - Add to MetaCart
Abstract We propose an accelerated stochastic block coordinate descent algorithm for nonconvex optimization under sparsity constraint in the high dimensional regime. The core of our algorithm is leveraging both stochastic partial gradient and full partial gradient restricted to each coordinate block to accelerate the convergence. We prove that the algorithm converges to the unknown true parameter at a linear rate, up to the statistical error of the underlying model. Experiments on both synthetic and real datasets backup our theory.
Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University