Results 1  10
of
16
Accelerated, parallel and proximal coordinate descent
, 2014
"... We propose a new stochastic coordinate descent method for minimizing the sum of convex functions each of which depends on a small number of coordinates only. Our method (APPROX) is simultaneously Accelerated, Parallel and PROXimal; this is the first time such a method is proposed. In the special cas ..."
Abstract

Cited by 31 (6 self)
 Add to MetaCart
We propose a new stochastic coordinate descent method for minimizing the sum of convex functions each of which depends on a small number of coordinates only. Our method (APPROX) is simultaneously Accelerated, Parallel and PROXimal; this is the first time such a method is proposed. In the special case when the number of processors is equal to the number of coordinates, the method converges at the rate 2ω̄L̄R2/(k+1)2, where k is the iteration counter, ω ̄ is an average degree of separability of the loss function, L ̄ is the average of Lipschitz constants associated with the coordinates and individual functions in the sum, and R is the distance of the initial point from the minimizer. We show that the method can be implemented without the need to perform fulldimensional vector operations, which is the major bottleneck of accelerated coordinate descent. The fact that the method depends on the average degree of separability, and not on the maximum degree of separability, can be attributed to the use of new safe large stepsizes, leading to improved expected separable overapproximation (ESO). These are of independent interest and can be utilized in all existing parallel stochastic coordinate descent algorithms based on the concept of ESO.
Randomized dual coordinate ascent with arbitrary sampling
, 2014
"... We study the problem of minimizing the average of a large number of smooth convex functions penalized with a strongly convex regularizer. We propose and analyze a novel primaldual method (Quartz) which at every iteration samples and updates a random subset of the dual variables, chosen according to ..."
Abstract

Cited by 7 (4 self)
 Add to MetaCart
We study the problem of minimizing the average of a large number of smooth convex functions penalized with a strongly convex regularizer. We propose and analyze a novel primaldual method (Quartz) which at every iteration samples and updates a random subset of the dual variables, chosen according to an arbitrary distribution. In contrast to typical analysis, we directly bound the decrease of the primaldual error (in expectation), without the need to first analyze the dual error. Depending on the choice of the sampling, we obtain efficient serial, parallel and distributed variants of the method. In the serial case, our bounds match the best known bounds for SDCA (both with uniform and importance sampling). With standard minibatching, our bounds predict initial dataindependent speedup as well as additional datadriven speedup which depends on spectral and sparsity properties of the data. We calculate theoretical speedup factors and find that they are excellent predictors of actual speedup in practice. Moreover, we illustrate that it is possible to design an efficient minibatch importance sampling. The distributed variant of Quartz is the first distributed SDCAlike method with an analysis for nonseparable data.
Distributed Block Coordinate Descent for Minimizing Partially Separable Functions
"... In this work we propose a distributed randomized block coordinate descent method for minimizing a convex function with a huge number of variables/coordinates. We analyze its complexity under the assumption that the smooth part of the objective function is partially block separable, and show that th ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
In this work we propose a distributed randomized block coordinate descent method for minimizing a convex function with a huge number of variables/coordinates. We analyze its complexity under the assumption that the smooth part of the objective function is partially block separable, and show that the degree of separability directly influences the complexity. This extends the results in [22] to a distributed environment. We first show that partially block separable functions admit an expected separable overapproximation (ESO) with respect to a distributed sampling, compute the ESO parameters, and then specialize complexity results from recent literature that hold under the generic ESO assumption. We describe several approaches to distribution and synchronization of the computation across a cluster of multicore computer and provide promising computational results.
Coordinate descent with arbitrary sampling I: Algorithms and complexity
, 2014
"... The design and complexity analysis of randomized coordinate descent methods, and in particular of variants which update a random subset (sampling) of coordinates in each iteration, depends on the notion of expected separable overapproximation (ESO). This refers to an inequality involving the objec ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
(Show Context)
The design and complexity analysis of randomized coordinate descent methods, and in particular of variants which update a random subset (sampling) of coordinates in each iteration, depends on the notion of expected separable overapproximation (ESO). This refers to an inequality involving the objective function and the sampling, capturing in a compact way certain smoothness properties of the function in a random subspace spanned by the sampled coordinates. ESO inequalities were previously established for special classes of samplings only, almost invariably for uniform samplings. In this paper we develop a systematic technique for deriving these inequalities for a large class of functions and for arbitrary samplings. We demonstrate that one can recover existing ESO results using our general approach, which is based on the study of eigenvalues associated with samplings and the data describing the function. 1
Parallel successive convex approximation for nonsmooth nonconvex optimization
, 2014
"... Consider the problem of minimizing the sum of a smooth (possibly nonconvex) and a convex (possibly nonsmooth) function involving a large number of variables. A popular approach to solve this problem is the block coordinate descent (BCD) method whereby at each iteration only one variable block is up ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
Consider the problem of minimizing the sum of a smooth (possibly nonconvex) and a convex (possibly nonsmooth) function involving a large number of variables. A popular approach to solve this problem is the block coordinate descent (BCD) method whereby at each iteration only one variable block is updated while the remaining variables are held fixed. With the recent advances in the developments of the multicore parallel processing technology, it is desirable to parallelize the BCD method by allowing multiple blocks to be updated simultaneously at each iteration of the algorithm. In this work, we propose an inexact parallel BCD approach where at each iteration, a subset of the variables is updated in parallel by minimizing convex approximations of the original objective function. We investigate the convergence of this parallel BCD method for both randomized and cyclic variable selection rules. We analyze the asymptotic and nonasymptotic convergence behavior of the algorithm for both convex and nonconvex objective functions. The numerical experiments suggest that for a special case of Lasso minimization problem, the cyclic block selection rule can outperform the randomized rule.
S2CD: SemiStochastic Coordinate Descent
, 2014
"... We propose a novel reduced variance method—semistochastic coordinate descent (S2CD)—for the problem of minimizing a strongly convex function represented as the average of a large number of smooth convex functions: f(x) = 1n i fi(x). Our method first performs a deterministic step (computation of th ..."
Abstract

Cited by 4 (3 self)
 Add to MetaCart
(Show Context)
We propose a novel reduced variance method—semistochastic coordinate descent (S2CD)—for the problem of minimizing a strongly convex function represented as the average of a large number of smooth convex functions: f(x) = 1n i fi(x). Our method first performs a deterministic step (computation of the gradient of f at the starting point), followed by a large number of stochastic steps. The process is repeated a few times, with the last stochastic iterate becoming the new starting point where the deterministic step is taken. The novelty of our method is in how the stochastic steps are performed. In each such step, we pick a random function fi and a random coordinate j—both using nonuniform distributions—and update a single coordinate of the decision vector only, based on the computation of the jth partial derivative of fi at two different points. Each random step of the method constitutes an unbiased estimate of the gradient of f and moreover, the squared norm of the steps goes to zero in expectation, meaning that the method enjoys a reduced variance property. The complexity of the method is the sum of two terms: O(n log(1/)) evaluations of gradients ∇fi and O(κ ̂ log(1/)) evaluations of partial derivatives∇jfi, where κ ̂ is a novel condition number. 1
Stochastic dual coordinate ascent with adaptive probabilities. ICML 2015. [2] Shai ShalevShwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss
"... This paper introduces AdaSDCA: an adaptive variant of stochastic dual coordinate ascent (SDCA) for solving the regularized empirical risk minimization problems. Our modification consists in allowing the method adaptively change the probability distribution over the dual variables throughout the ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
This paper introduces AdaSDCA: an adaptive variant of stochastic dual coordinate ascent (SDCA) for solving the regularized empirical risk minimization problems. Our modification consists in allowing the method adaptively change the probability distribution over the dual variables throughout the iterative process. AdaSDCA achieves provably better complexity bound than SDCA with the best fixed probability distribution, known as importance sampling. However, it is of a theoretical character as it is expensive to implement. We also propose AdaSDCA+: a practical variant which in our experiments outperforms existing nonadaptive methods. 1.
Hybrid Random/Deterministic Parallel Algorithms for Convex and Nonconvex Big Data Optimization
"... We propose a decomposition framework for the parallel optimization of the sum of a differentiable (possibly nonconvex) function and a nonsmooth (possibly nonseparable), convex one. The latter term is usually employed to enforce structure in the solution, typically sparsity. The main contribution of ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
We propose a decomposition framework for the parallel optimization of the sum of a differentiable (possibly nonconvex) function and a nonsmooth (possibly nonseparable), convex one. The latter term is usually employed to enforce structure in the solution, typically sparsity. The main contribution of this work is a novel parallel, hybrid random/deterministic decomposition scheme wherein, at each iteration, a subset of (block) variables is updated at the same time by minimizing a convex surrogate of the original nonconvex function. To tackle hugescale problems, the (block) variables to be updated are chosen according to a mixed random and deterministic procedure, which captures the advantages of both pure deterministic and random updatebased schemes. Almost sure convergence of the proposed scheme is established. Numerical results show that on hugescale problems the proposed hybrid random/deterministic algorithm compares favorably to random and deterministic schemes on both convex and nonconvex problems.
Primal Method for ERM with Flexible Minibatching Schemes and Nonconvex Losses∗
, 2015
"... In this work we develop a new algorithm for regularized empirical risk minimization. Our method extends recent techniques of ShalevShwartz [02/2015], which enable a dualfree analysis of SDCA, to arbitrary minibatching schemes. Moreover, our method is able to better utilize the information in the ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
In this work we develop a new algorithm for regularized empirical risk minimization. Our method extends recent techniques of ShalevShwartz [02/2015], which enable a dualfree analysis of SDCA, to arbitrary minibatching schemes. Moreover, our method is able to better utilize the information in the data defining the ERM problem. For convex loss functions, our complexity results match those of QUARTZ, which is a primaldual method also allowing for arbitrary minibatching schemes. The advantage of a dualfree analysis comes from the fact that it guarantees convergence even for nonconvex loss functions, as long as the average loss is convex. We illustrate through experiments the utility of being able to design arbitrary minibatching schemes. 1