Results 1  10
of
70
Online learning for matrix factorization and sparse coding
"... Sparse coding—that is, modelling data vectors as sparse linear combinations of basis elements—is widely used in machine learning, neuroscience, signal processing, and statistics. This paper focuses on the largescale matrix factorization problem that consists of learning the basis set, adapting it t ..."
Abstract

Cited by 110 (20 self)
 Add to MetaCart
Sparse coding—that is, modelling data vectors as sparse linear combinations of basis elements—is widely used in machine learning, neuroscience, signal processing, and statistics. This paper focuses on the largescale matrix factorization problem that consists of learning the basis set, adapting it to specific data. Variations of this problem include dictionary learning in signal processing, nonnegative matrix factorization and sparse principal component analysis. In this paper, we propose to address these tasks with a new online optimization algorithm, based on stochastic approximations, which scales up gracefully to large datasets with millions of training samples, and extends naturally to various matrix factorization formulations, making it suitable for a wide range of learning problems. A proof of convergence is presented, along with experiments with natural images and genomic data demonstrating that it leads to stateoftheart performance in terms of speed and optimization for both small and large datasets.
Sparse Online Learning via Truncated Gradient
"... We propose a general method called truncated gradient to induce sparsity in the weights of onlinelearning algorithms with convex loss. This method has several essential properties. First, the degree of sparsity is continuous—a parameter controls the rate of sparsification from no sparsification to ..."
Abstract

Cited by 62 (1 self)
 Add to MetaCart
We propose a general method called truncated gradient to induce sparsity in the weights of onlinelearning algorithms with convex loss. This method has several essential properties. First, the degree of sparsity is continuous—a parameter controls the rate of sparsification from no sparsification to total sparsification. Second, the approach is theoretically motivated, and an instance of it can be regarded as an online counterpart of the popular L1regularization method in the batch setting. We prove small rates of sparsification result in only small additional regret with respect to typical onlinelearning guarantees. Finally, the approach works well empirically. We apply it to several datasets and find for datasets with large numbers of features, substantial sparsity is discoverable. 1
Dual averaging methods for regularized stochastic learning and online optimization
 In Advances in Neural Information Processing Systems 23
, 2009
"... We consider regularized stochastic learning and online optimization problems, where the objective function is the sum of two convex terms: one is the loss function of the learning task, and the other is a simple regularization term such as ℓ1norm for promoting sparsity. We develop extensions of Nes ..."
Abstract

Cited by 59 (3 self)
 Add to MetaCart
We consider regularized stochastic learning and online optimization problems, where the objective function is the sum of two convex terms: one is the loss function of the learning task, and the other is a simple regularization term such as ℓ1norm for promoting sparsity. We develop extensions of Nesterov’s dual averaging method, that can exploit the regularization structure in an online setting. At each iteration of these methods, the learning variables are adjusted by solving a simple minimization problem that involves the running average of all past subgradients of the loss function and the whole regularization term, not just its subgradient. In the case of ℓ1regularization, our method is particularly effective in obtaining sparse solutions. We show that these methods achieve the optimal convergence rates or regret bounds that are standard in the literature on stochastic and online convex optimization. For stochastic learning problems in which the loss functions have Lipschitz continuous gradients, we also present an accelerated version of the dual averaging method.
Efficient Online and Batch Learning using Forward Backward Splitting
"... We describe, analyze, and experiment with a framework for empirical loss minimization with regularization. Our algorithmic framework alternates between two phases. On each iteration we first perform an unconstrained gradient descent step. We then cast and solve an instantaneous optimization problem ..."
Abstract

Cited by 58 (1 self)
 Add to MetaCart
We describe, analyze, and experiment with a framework for empirical loss minimization with regularization. Our algorithmic framework alternates between two phases. On each iteration we first perform an unconstrained gradient descent step. We then cast and solve an instantaneous optimization problem that trades off minimization of a regularization term while keeping close proximity to the result of the first phase. This view yields a simple yet effective algorithm that can be used for batch penalized risk minimization and online learning. Furthermore, the two phase approach enables sparse solutions when used in conjunction with regularization functions that promote sparsity, such as ℓ1. We derive concrete and very simple algorithms for minimization of loss functions with ℓ1, ℓ2, ℓ 2 2, and ℓ ∞ regularization. We also show how to construct efficient algorithms for mixednorm ℓ1/ℓq regularization. We further extend the algorithms and give efficient implementations for very highdimensional data with sparsity. We demonstrate the potential of the proposed framework in a series of experiments with synthetic and natural datasets.
Projected Subgradient Methods for Learning Sparse Gaussians
"... Gaussian Markov random fields (GMRFs) are useful in a broad range of applications. In this paper we tackle the problem of learning a sparse GMRF in a highdimensional space. Our approach uses the ℓ1norm as a regularization on the inverse covariance matrix. We utilize a novel projected gradient meth ..."
Abstract

Cited by 31 (0 self)
 Add to MetaCart
Gaussian Markov random fields (GMRFs) are useful in a broad range of applications. In this paper we tackle the problem of learning a sparse GMRF in a highdimensional space. Our approach uses the ℓ1norm as a regularization on the inverse covariance matrix. We utilize a novel projected gradient method, which is faster than previous methods in practice and equal to the best performing of these in asymptotic complexity. We also extend the ℓ1regularized objective to the problem of sparsifying entire blocks within the inverse covariance matrix. Our methods generalize fairly easily to this case, while other methods do not. We demonstrate that our extensions give better generalization performance on two real domains—biological network analysis and a 2Dshape modeling image task. 1
Composite Objective Mirror Descent
"... We present a new method for regularized convex optimization and analyze it under both online and stochastic optimization settings. In addition to unifying previously known firstorder algorithms, such as the projected gradient method, mirror descent, and forwardbackward splitting, our method yields n ..."
Abstract

Cited by 28 (5 self)
 Add to MetaCart
We present a new method for regularized convex optimization and analyze it under both online and stochastic optimization settings. In addition to unifying previously known firstorder algorithms, such as the projected gradient method, mirror descent, and forwardbackward splitting, our method yields new analysis and algorithms. We also derive specific instantiations of our method for commonly used regularization functions, such as ℓ1, mixed norm, and tracenorm. 1
Large Graph Construction for Scalable SemiSupervised Learning
"... In this paper, we address the scalability issue plaguing graphbased semisupervised learningviaasmallnumberofanchorpointswhich adequatelycovertheentirepointcloud. Critically, these anchor points enable nonparametric regression that predicts the label for each data point as a locally weighted averag ..."
Abstract

Cited by 27 (9 self)
 Add to MetaCart
In this paper, we address the scalability issue plaguing graphbased semisupervised learningviaasmallnumberofanchorpointswhich adequatelycovertheentirepointcloud. Critically, these anchor points enable nonparametric regression that predicts the label for each data point as a locally weighted average of the labels on anchor points. Becauseconventionalgraphconstructionisinefficient in large scale, we propose to construct a tractable large graph by coupling anchorbased label prediction and adjacency matrix design. Contrary to the Nyström approximation of adjacency matrices which results in indefinite graph Laplacians and in turn leads to potential nonconvex optimization over graphs, the proposed graph construction approach based on a unique idea called AnchorGraph provides nonnegative adjacency matrices to guarantee positive semidefinite graph Laplacians. Our approach scales linearly with the data size and in practice usually produces a large sparse graph. Experiments on large datasets demonstrate the significant accuracy improvement and scalability of the proposed approach. 1.
Largescale sparse logistic regression
 In ACM SIGKDD International Conference On Knowledge Discovery and Data Mining
, 2009
"... Logistic Regression is a wellknown classification method that has been used widely in many applications of data mining, machine learning, computer vision, and bioinformatics. Sparse logistic regression embeds feature selection in the classification framework using the ℓ1norm regularization, and is ..."
Abstract

Cited by 23 (7 self)
 Add to MetaCart
Logistic Regression is a wellknown classification method that has been used widely in many applications of data mining, machine learning, computer vision, and bioinformatics. Sparse logistic regression embeds feature selection in the classification framework using the ℓ1norm regularization, and is attractive in many applications involving highdimensional data. In this paper, we propose Lassplore for solving Largescale sparse logistic regression. Specifically, we formulate the problem as the ℓ1ball constrained smooth convex optimization, and propose to solve the problem using the Nesterov’s method, an optimal firstorder blackbox method for smooth convex optimization. One of the critical issues in the use of the Nesterov’s method is the estimation of the step size at each of the optimization iterations. Previous approaches either applies the constant step size which assumes that the Lipschitz gradient is known in advance, or requires a sequence of decreasing step size which leads to slow convergence in practice. In this paper, we propose an adaptive line search scheme which allows to tune the step size adaptively and meanwhile guarantees the optimal convergence rate. Empirical comparisons with several stateoftheart algorithms demonstrate the efficiency of the proposed Lassplore algorithm for largescale problems.
On the equivalence of weak learnability and linear separability: New relaxations and efficient boosting algorithms
 IN: PROCEEDINGS OF THE 21ST ANNUAL CONFERENCE ON COMPUTATIONAL LEARNING THEORY
"... Boosting algorithms build highly accurate prediction mechanisms from a collection of lowaccuracy predictors. To do so, they employ the notion of weaklearnability. The starting point of this paper is a proof which shows that weak learnability is equivalent to linear separability with ℓ1 margin. Whil ..."
Abstract

Cited by 20 (6 self)
 Add to MetaCart
Boosting algorithms build highly accurate prediction mechanisms from a collection of lowaccuracy predictors. To do so, they employ the notion of weaklearnability. The starting point of this paper is a proof which shows that weak learnability is equivalent to linear separability with ℓ1 margin. While this equivalence is a direct consequence of von Neumann’s minimax theorem, we derive the equivalence directly using Fenchel duality. We then use our derivation to describe a family of relaxations to the weaklearnability assumption that readily translates to a family of relaxations of linear separability with margin. This alternative perspective sheds new light on known softmargin boosting algorithms and also enables us to derive several new relaxations of the notion of linear separability. Last, we describe and analyze an efficient boosting framework that can be used for minimizing the loss functions derived from our family of relaxations. In particular, we obtain efficient boosting algorithms for maximizing hard and soft versions of the ℓ1 margin.
Highdimensional regression with noisy and missing data: Provable guarantees with nonconvexity
, 2011
"... Although the standard formulations of prediction problems involve fullyobserved and noiseless data drawn in an i.i.d. manner, many applications involve noisy and/or missing data, possibly involving dependencies. We study these issues in the context of highdimensional sparse linear regression, and ..."
Abstract

Cited by 19 (4 self)
 Add to MetaCart
Although the standard formulations of prediction problems involve fullyobserved and noiseless data drawn in an i.i.d. manner, many applications involve noisy and/or missing data, possibly involving dependencies. We study these issues in the context of highdimensional sparse linear regression, and propose novel estimators for the cases of noisy, missing, and/or dependent data. Many standard approaches to noisy or missing data, such as those using the EM algorithm, lead to optimization problems that are inherently nonconvex, and it is difficult to establish theoretical guarantees on practical algorithms. While our approach also involves optimizing nonconvex programs, we are able to both analyze the statistical error associated with any global optimum, and prove that a simple projected gradient descent algorithm will converge in polynomial time to a small neighborhood of the set of global minimizers. On the statistical side, we provide nonasymptotic bounds that hold with high probability for the cases of noisy, missing, and/or dependent data. On the computational side, we prove that under the same types of conditions required for statistical consistency, the projected gradient descent algorithm will converge at geometric rates to a nearglobal minimizer. We illustrate these theoretical predictions with simulations, showing agreement with the predicted scalings. 1