Results 1  10
of
32
On the complexity analysis of randomized blockcoordinate descent methods
, 2013
"... In this paper we analyze the randomized blockcoordinate descent (RBCD) methods proposed in [11, 15] for minimizing the sum of a smooth convex function and a blockseparable convex function, and derive improved bounds on their convergence rates. In particular, we extend Nesterov’s technique develope ..."
Abstract

Cited by 33 (3 self)
 Add to MetaCart
(Show Context)
In this paper we analyze the randomized blockcoordinate descent (RBCD) methods proposed in [11, 15] for minimizing the sum of a smooth convex function and a blockseparable convex function, and derive improved bounds on their convergence rates. In particular, we extend Nesterov’s technique developed in [11] for analyzing the RBCD method for minimizing a smooth convex function over a blockseparable closed convex set to the aforementioned more general problem and obtain a sharper expectedvalue type of convergence rate than the one implied in [15]. As a result, we also obtain a better highprobability type of iteration complexity. In addition, for unconstrained smooth convex minimization, we develop a new technique called randomized estimate sequence to analyze the accelerated RBCD method proposed by Nesterov [11] and establish a sharper expectedvalue type of convergence rate than the one given in [11]. Key words: Randomized blockcoordinate descent, accelerated coordinate descent, iteration complexity, convergence rate, composite minimization. 1
Lifted coordinate descent for learning with tracenorm regularization
 AISTATS
, 2012
"... We consider the minimization of a smooth loss with tracenorm regularization, which is a natural objective in multiclass and multitask learning. Even though the problem is convex, existing approaches rely on optimizing a nonconvex variational bound, which is not guaranteed to converge, or repeated ..."
Abstract

Cited by 31 (5 self)
 Add to MetaCart
(Show Context)
We consider the minimization of a smooth loss with tracenorm regularization, which is a natural objective in multiclass and multitask learning. Even though the problem is convex, existing approaches rely on optimizing a nonconvex variational bound, which is not guaranteed to converge, or repeatedly perform singularvalue decomposition, which prevents scaling beyond moderate matrix sizes. We lift the nonsmooth convex problem into an infinitely dimensional smooth problem and apply coordinate descent to solve it. We prove that our approach converges to the optimum, and is competitive or outperforms state of the art. 1
Sample Size Selection in Optimization Methods for Machine Learning
, 2012
"... This paper presents a methodology for using varying sample sizes in batchtype optimization methods for large scale machine learning problems. The first part of the paper deals with the delicate issue of dynamic sample selection in the evaluation of the function and gradient. We propose a criterion ..."
Abstract

Cited by 20 (3 self)
 Add to MetaCart
(Show Context)
This paper presents a methodology for using varying sample sizes in batchtype optimization methods for large scale machine learning problems. The first part of the paper deals with the delicate issue of dynamic sample selection in the evaluation of the function and gradient. We propose a criterion for increasing the sample size based on variance estimates obtained during the computation of a batch gradient. We establish an O(1/ɛ) complexity bound on the total cost of a gradient method. The second part of the paper describes a practical Newton method that uses a smaller sample to compute Hessian vectorproducts than to evaluate the function and the gradient, and that also employs a dynamic sampling technique. The focus of the paper shifts in the third part of the paper to L1 regularized problems designed to produce sparse solutions. We propose a Newtonlike method that consists of two phases: a (minimalistic) gradient projection phase that identifies zero variables, and subspace phase that applies a subsampled Hessian Newton iteration in the free variables. Numerical tests on speech recognition problems illustrate the performance of the algorithms.
Inexact Coordinate Descent: Complexity and Preconditioning
, 2013
"... In this paper we consider the problem of minimizing a convex function using a randomized block coordinate descent method. One of the key steps at each iteration of the algorithm is determining the update to a block of variables. Existing algorithms assume that in order to compute the update, a parti ..."
Abstract

Cited by 18 (3 self)
 Add to MetaCart
In this paper we consider the problem of minimizing a convex function using a randomized block coordinate descent method. One of the key steps at each iteration of the algorithm is determining the update to a block of variables. Existing algorithms assume that in order to compute the update, a particular subproblem is solved exactly. In his work we relax this requirement, and allow for the subproblem to be solved inexactly, leading to an inexact block coordinate descent method. Our approach incorporates the best known results for exact updates as a special case. Moreover, these theoretical guarantees are complemented by practical considerations: the use of iterative techniques to determine the update as well as the use of preconditioning for further acceleration.
Maximum block improvement and polynomial optimization
 SIAM Journal on Optimization
"... Abstract. In this paper we propose an efficient method for solving the spherically constrained homogeneous polynomial optimization problem. The new approach has the following three main ingredients. First, we establish a block coordinate descent type search method for nonlinear optimization, with t ..."
Abstract

Cited by 16 (5 self)
 Add to MetaCart
(Show Context)
Abstract. In this paper we propose an efficient method for solving the spherically constrained homogeneous polynomial optimization problem. The new approach has the following three main ingredients. First, we establish a block coordinate descent type search method for nonlinear optimization, with the novelty being that we only accept a block update that achieves the maximum improvement, hence the name of our new search method: Maximum Block Improvement (MBI). Convergence of the sequence produced by the MBI method to a stationary point is proven. Second, we establish that maximizing a homogeneous polynomial over a sphere is equivalent to its tensor relaxation problem, thus we can maximize a homogeneous polynomial function over a sphere by its tensor relaxation via the MBI approach. Third, we propose a scheme to reach a KKT point of the polynomial optimization, provided that a stationary solution for the relaxed tensor problem is available. Numerical experiments have shown that our new method works very efficiently: for a majority of the test instances that we have experimented with, the method finds the global optimal solution at a low computational cost.
Learning HigherOrder Graph Structure with Features by Structure Penalty
"... In discrete undirected graphical models, the conditional independence of node labels Y is specified by the graph structure. We study the case where there is another input random vector X (e.g. observed features) such that the distribution P(Y  X) is determined by functions of X that characterize th ..."
Abstract

Cited by 9 (3 self)
 Add to MetaCart
(Show Context)
In discrete undirected graphical models, the conditional independence of node labels Y is specified by the graph structure. We study the case where there is another input random vector X (e.g. observed features) such that the distribution P(Y  X) is determined by functions of X that characterize the (higherorder) interactions among the Y ’s. The main contribution of this paper is to learn the graph structure and the functions conditioned on X at the same time. We prove that discrete undirected graphical models with feature X are equivalent to multivariate discrete models. The reparameterization of the potential functions in graphical models by conditional log odds ratios of the latter offers advantages in representation of the conditional independence structure. The functional spaces can be flexibly determined by kernels. Additionally, we impose a Structure Lasso (SLasso) penalty on groups of functions to learn the graph structure. These groups with overlaps are designed to enforce hierarchical function selection. In this way, we are able to shrink higher order interactions to obtain a sparse graph structure. 1