Results 1  10
of
120
An interiorpoint method for largescale l1regularized logistic regression
 Journal of Machine Learning Research
, 2007
"... Logistic regression with ℓ1 regularization has been proposed as a promising method for feature selection in classification problems. In this paper we describe an efficient interiorpoint method for solving largescale ℓ1regularized logistic regression problems. Small problems with up to a thousand ..."
Abstract

Cited by 153 (6 self)
 Add to MetaCart
Logistic regression with ℓ1 regularization has been proposed as a promising method for feature selection in classification problems. In this paper we describe an efficient interiorpoint method for solving largescale ℓ1regularized logistic regression problems. Small problems with up to a thousand or so features and examples can be solved in seconds on a PC; medium sized problems, with tens of thousands of features and examples, can be solved in tens of seconds (assuming some sparsity in the data). A variation on the basic method, that uses a preconditioned conjugate gradient method to compute the search step, can solve very large problems, with a million features and examples (e.g., the 20 Newsgroups data set), in a few minutes, on a PC. Using warmstart techniques, a good approximation of the entire regularization path can be computed much more efficiently than by solving a family of problems independently.
Online learning for matrix factorization and sparse coding
"... Sparse coding—that is, modelling data vectors as sparse linear combinations of basis elements—is widely used in machine learning, neuroscience, signal processing, and statistics. This paper focuses on the largescale matrix factorization problem that consists of learning the basis set, adapting it t ..."
Abstract

Cited by 97 (18 self)
 Add to MetaCart
Sparse coding—that is, modelling data vectors as sparse linear combinations of basis elements—is widely used in machine learning, neuroscience, signal processing, and statistics. This paper focuses on the largescale matrix factorization problem that consists of learning the basis set, adapting it to specific data. Variations of this problem include dictionary learning in signal processing, nonnegative matrix factorization and sparse principal component analysis. In this paper, we propose to address these tasks with a new online optimization algorithm, based on stochastic approximations, which scales up gracefully to large datasets with millions of training samples, and extends naturally to various matrix factorization formulations, making it suitable for a wide range of learning problems. A proof of convergence is presented, along with experiments with natural images and genomic data demonstrating that it leads to stateoftheart performance in terms of speed and optimization for both small and large datasets.
Piecewise linear regularized solution paths
 Ann. Statist
, 2007
"... We consider the generic regularized optimization problem ˆ β(λ) = arg minβ L(y, Xβ) + λJ(β). Recently, Efron et al. (2004) have shown that for the Lasso – that is, if L is squared error loss and J(β) = ‖β‖1 is the l1 norm of β – the optimal coefficient path is piecewise linear, i.e., ∂ ˆ β(λ)/∂λ i ..."
Abstract

Cited by 83 (8 self)
 Add to MetaCart
We consider the generic regularized optimization problem ˆ β(λ) = arg minβ L(y, Xβ) + λJ(β). Recently, Efron et al. (2004) have shown that for the Lasso – that is, if L is squared error loss and J(β) = ‖β‖1 is the l1 norm of β – the optimal coefficient path is piecewise linear, i.e., ∂ ˆ β(λ)/∂λ is piecewise constant. We derive a general characterization of the properties of (loss L, penalty J) pairs which give piecewise linear coefficient paths. Such pairs allow for efficient generation of the full regularized coefficient paths. We investigate the nature of efficient path following algorithms which arise. We use our results to suggest robust versions of the Lasso for regression and classification, and to develop new, efficient algorithms for existing problems in the literature, including Mammen & van de Geer’s Locally Adaptive Regression Splines. 1
A unified framework for highdimensional analysis of Mestimators with decomposable regularizers
"... ..."
Structured sparsityinducing norms through submodular functions
 IN ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS
, 2010
"... Sparse methods for supervised learning aim at finding good linear predictors from as few variables as possible, i.e., with small cardinality of their supports. This combinatorial selection problem is often turnedinto a convex optimization problem byreplacing the cardinality function by its convex en ..."
Abstract

Cited by 30 (9 self)
 Add to MetaCart
Sparse methods for supervised learning aim at finding good linear predictors from as few variables as possible, i.e., with small cardinality of their supports. This combinatorial selection problem is often turnedinto a convex optimization problem byreplacing the cardinality function by its convex envelope (tightest convex lower bound), in this case the ℓ1norm. In this paper, we investigate more general setfunctions than the cardinality, that may incorporate prior knowledge or structural constraints which are common in many applications: namely, we show that for nonincreasing submodular setfunctions, the corresponding convex envelope can be obtained from its Lovász extension, a common tool in submodular analysis. This defines a family of polyhedral norms, for which we provide generic algorithmic tools (subgradients and proximal operators) and theoretical results (conditions for support recovery or highdimensional inference). By selecting specific submodular functions, we can give a new interpretation to known norms, such as those based on rankstatistics or grouped norms with potentially overlapping groups; we also define new norms, in particular ones that can be used as nonfactorial priors for supervised learning.
Recovering timevarying networks of dependencies in social and biological studies
 Proc. Nat. Acad. Sci
, 2009
"... A plausible representation of the relational information among entities in dynamic systems such as a living cell or a social community is a stochastic network that is topologically rewiring and semantically evolving over time. While there is a rich literature in modeling static or temporally invaria ..."
Abstract

Cited by 29 (8 self)
 Add to MetaCart
A plausible representation of the relational information among entities in dynamic systems such as a living cell or a social community is a stochastic network that is topologically rewiring and semantically evolving over time. While there is a rich literature in modeling static or temporally invariant networks, little has been done toward recovering the network structure when the networks are not observable in a dynamic context. In this paper, we present a new machine learning method called TESLA, which builds on a temporally smoothed l1regularized logistic regression formalism that can be cast as a standard convexoptimization problem and solved efficiently using generic solvers scalable to large networks. We report promising results on recovering simulated timevarying networks, and on reverse engineering the latent sequence of temporally rewiring political and academic social networks from longitudinal data, and the evolving gene networks over more than 4000 genes during the life cycle of Drosophila melanogaster from a microarray time course at a resolution limited only by sample frequency.
Gradient Directed Regularization for Linear Regression and Classification
, 2004
"... Regularization in linear modeling is viewed as a twostage process. First a set of candidate models is defined by a path through the space of joint parameter values, and then a point on this path is chosen to be the final model. Various pathnding strategies for the first stage of this process are ex ..."
Abstract

Cited by 26 (4 self)
 Add to MetaCart
Regularization in linear modeling is viewed as a twostage process. First a set of candidate models is defined by a path through the space of joint parameter values, and then a point on this path is chosen to be the final model. Various pathnding strategies for the first stage of this process are examined, based on the notion of generalized gradient descent. Several of these strategies are seen to produce paths that closely correspond to those induced by commonly used penalization methods. Others give rise to new regularization techniques that are shown to be advantageous in some situations. In all cases, the gradient descent pathfinding paradigm can be readily generalized to include the use of a wide variety of loss criteria, leading to robust methods for regression and classification, as well as to apply user defined constraints on the parameter values, all with highly efficient computational implementations.
Smoothing Proximal Gradient Method for General Structured Sparse Learning
"... We study the problem of learning high dimensional regression models regularized by a structuredsparsityinducing penalty that encodes prior structural information on either input or output sides. We consider two widely adopted types of such penalties as our motivating examples: 1) overlapping group ..."
Abstract

Cited by 23 (5 self)
 Add to MetaCart
We study the problem of learning high dimensional regression models regularized by a structuredsparsityinducing penalty that encodes prior structural information on either input or output sides. We consider two widely adopted types of such penalties as our motivating examples: 1) overlapping group lasso penalty, based on the ℓ1/ℓ2 mixednorm penalty, and 2) graphguided fusion penalty. For both types of penalties, due to their nonseparability, developing an efficient optimization method has remained a challenging problem. In this paper, we propose a general optimization approach, called smoothing proximal gradient method, which can solve the structured sparse regression problems with a smooth convex loss and a wide spectrum of structuredsparsityinducing penalties. Our approach is based on a general smoothing technique of Nesterov [17]. It achieves a convergence rate faster than the standard firstorder method, subgradient method, and is much more scalable than the most widely used interiorpoint method. Numerical results are reported to demonstrate the efficiency and scalability of the proposed method. 1
ℓ1 Trend Filtering
, 2007
"... The problem of estimating underlying trends in time series data arises in a variety of disciplines. In this paper we propose a variation on HodrickPrescott (HP) filtering, a widely used method for trend estimation. The proposed ℓ1 trend filtering method substitutes a sum of absolute values (i.e., ..."
Abstract

Cited by 18 (6 self)
 Add to MetaCart
The problem of estimating underlying trends in time series data arises in a variety of disciplines. In this paper we propose a variation on HodrickPrescott (HP) filtering, a widely used method for trend estimation. The proposed ℓ1 trend filtering method substitutes a sum of absolute values (i.e., an ℓ1norm) for the sum of squares used in HP filtering to penalize variations in the estimated trend. The ℓ1 trend filtering method produces trend estimates that are piecewise linear, and therefore is well suited to analyzing time series with an underlying piecewise linear trend. The kinks, knots, or changes in slope, of the estimated trend can be interpreted as abrupt changes or events in the underlying dynamics of the time series. Using specialized interiorpoint methods, ℓ1 trend filtering can be carried out with not much more effort than HP filtering; in particular, the number of arithmetic operations required grows linearly with the number of data points. We describe the method and some of its basic properties, and give some illustrative examples. We show how the method is related to ℓ1 regularization based methods in sparse signal recovery and feature selection, and list some extensions of the basic method.