Results 1  10
of
15
An interiorpoint method for largescale l1regularized logistic regression
 Journal of Machine Learning Research
, 2007
"... Logistic regression with ℓ1 regularization has been proposed as a promising method for feature selection in classification problems. In this paper we describe an efficient interiorpoint method for solving largescale ℓ1regularized logistic regression problems. Small problems with up to a thousand ..."
Abstract

Cited by 153 (5 self)
 Add to MetaCart
Logistic regression with ℓ1 regularization has been proposed as a promising method for feature selection in classification problems. In this paper we describe an efficient interiorpoint method for solving largescale ℓ1regularized logistic regression problems. Small problems with up to a thousand or so features and examples can be solved in seconds on a PC; medium sized problems, with tens of thousands of features and examples, can be solved in tens of seconds (assuming some sparsity in the data). A variation on the basic method, that uses a preconditioned conjugate gradient method to compute the search step, can solve very large problems, with a million features and examples (e.g., the 20 Newsgroups data set), in a few minutes, on a PC. Using warmstart techniques, a good approximation of the entire regularization path can be computed much more efficiently than by solving a family of problems independently.
Highdimensional graphical model selection using ℓ1regularized logistic regression
 Advances in Neural Information Processing Systems 19
, 2007
"... We consider the problem of estimating the graph structure associated with a discrete Markov random field. We describe a method based on ℓ1regularized logistic regression, in which the neighborhood of any given node is estimated by performing logistic regression subject to an ℓ1constraint. Our fram ..."
Abstract

Cited by 73 (5 self)
 Add to MetaCart
We consider the problem of estimating the graph structure associated with a discrete Markov random field. We describe a method based on ℓ1regularized logistic regression, in which the neighborhood of any given node is estimated by performing logistic regression subject to an ℓ1constraint. Our framework applies to the highdimensional setting, in which both the number of nodes p and maximum neighborhood sizes d are allowed to grow as a function of the number of observations n. Our main results provide sufficient conditions on the triple (n, p, d) for the method to succeed in consistently estimating the neighborhood of every node in the graph simultaneously. Under certain assumptions on the population Fisher information matrix, we prove that consistent neighborhood selection can be obtained for sample sizes n = Ω(d 3 log p), with the error decaying as O(exp(−Cn/d 3)) for some constant C. If these same assumptions are imposed directly on the sample matrices, we show that n = Ω(d 2 log p) samples are sufficient.
HIGHDIMENSIONAL ISING MODEL SELECTION USING ℓ1REGULARIZED LOGISTIC REGRESSION
 SUBMITTED TO THE ANNALS OF STATISTICS
"... We consider the problem of estimating the graph associated with a binary Ising Markov random field. We describe a method based on ℓ1regularized logistic regression, in which the neighborhood of any given node is estimated by performing logistic regression subject to an ℓ1constraint. The method is ..."
Abstract

Cited by 37 (12 self)
 Add to MetaCart
We consider the problem of estimating the graph associated with a binary Ising Markov random field. We describe a method based on ℓ1regularized logistic regression, in which the neighborhood of any given node is estimated by performing logistic regression subject to an ℓ1constraint. The method is analyzed under highdimensional scaling, in which both the number of nodes p and maximum neighborhood size d are allowed to grow as a function of the number of observations n. Our main results provide sufficient conditions on the triple (n, p, d) and the model parameters for the method to succeed in consistently estimating the neighborhood of every node in the graph simultaneously. With coherence conditions imposed on the population Fisher information matrix, we prove that consistent neighborhood selection can be obtained for sample sizes n = Ω(d 3 log p), with exponentially decaying error. When these same conditions are imposed directly on the sample matrices, we show that a reduced sample size of n = Ω(d 2 log p) suffices for the method to estimate neighborhoods consistently. Although this paper focuses on the binary graphical models, we indicate how a generalization of the method of the paper would apply to general discrete Markov random fields.
Efficient Euclidean Projections in Linear Time
"... We consider the problem of computing the Euclidean projection of a vector of length n onto a closed convex set including the ℓ1 ball and the specialized polyhedra employed in (ShalevShwartz & Singer, 2006). These problems have played building block roles in solving several ℓ 1norm based sparse lea ..."
Abstract

Cited by 21 (7 self)
 Add to MetaCart
We consider the problem of computing the Euclidean projection of a vector of length n onto a closed convex set including the ℓ1 ball and the specialized polyhedra employed in (ShalevShwartz & Singer, 2006). These problems have played building block roles in solving several ℓ 1norm based sparse learning problems. Existing methods have a worstcase time complexity of O(n log n). In this paper, we propose to cast both Euclidean projections as root finding problems associated with specific auxiliary functions, which can be solved in linear time via bisection. We further make use of the special structure of the auxiliary functions, and propose an improved bisection algorithm. Empirical studies demonstrate that the proposed algorithms are much more efficient than the competing ones for computing the projections. 1.
Scalable Discriminative Learning for Natural Language Parsing and Translation
 In Proceedings of the 2006 Neural Information Processing Systems (NIPS
, 2006
"... Parsing and translating natural languages can be viewed as problems of predicting tree structures. For machine learning approaches to these predictions, the diversity and high dimensionality of the structures involved mandate very large training sets. This paper presents a purely discriminative lear ..."
Abstract

Cited by 20 (1 self)
 Add to MetaCart
Parsing and translating natural languages can be viewed as problems of predicting tree structures. For machine learning approaches to these predictions, the diversity and high dimensionality of the structures involved mandate very large training sets. This paper presents a purely discriminative learning method that scales up well to problems of this size. Its accuracy was at least as good as other comparable methods on a standard parsing task. To our knowledge, it is the first purely discriminative learning algorithm for translation with treestructured models. Unlike other popular methods, this method does not require a great deal of feature engineering a priori, because it performs feature selection over a compound feature space as it learns. Experiments demonstrate the method’s versatility, accuracy, and efficiency. Relevant software is freely available at
Advances in discriminative parsing
 In Proceedings of the Joint International Conference on Computational Linguistics and Association of Computational Linguistics (COLING/ACL
, 2006
"... The present work advances the accuracy and training speed of discriminative parsing. Our discriminative parsing method has no generative component, yet surpasses a generative baseline on constituent parsing, and does so with minimal linguistic cleverness. Our model can incorporate arbitrary features ..."
Abstract

Cited by 10 (1 self)
 Add to MetaCart
The present work advances the accuracy and training speed of discriminative parsing. Our discriminative parsing method has no generative component, yet surpasses a generative baseline on constituent parsing, and does so with minimal linguistic cleverness. Our model can incorporate arbitrary features of the input and parse state, and performs feature selection incrementally over an exponential feature space during training. We demonstrate the flexibility of our approach by testing it with several parsing strategies and various feature sets. Our implementation is freely available at:
Scalable purelydiscriminative training for word and tree transducers
, 2006
"... Discriminative training methods have recently led to significant advances in the state of the art of machine translation (MT). Another promising trend is the incorporation of syntactic information into MT systems. Combining these trends is difficult for reasons of system complexity and computational ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
Discriminative training methods have recently led to significant advances in the state of the art of machine translation (MT). Another promising trend is the incorporation of syntactic information into MT systems. Combining these trends is difficult for reasons of system complexity and computational complexity. The present study makes progress towards a syntaxaware MT system whose every component is trained discriminatively. Our main innovation is an approach to discriminative learning that is computationally efficient enough for large statistical MT systems, yet whose accuracy on translation subtasks is near the state of the art. Our source code is downloadable from
On Learning Discrete Graphical Models using GroupSparse
"... We study the problem of learning the graph structure associated with a general discrete graphical models (each variable can take any of m> 1 values, the clique factors have maximum size c ≥ 2) from samples, under highdimensional scaling where the number of variables p could be larger than the numbe ..."
Abstract

Cited by 6 (2 self)
 Add to MetaCart
We study the problem of learning the graph structure associated with a general discrete graphical models (each variable can take any of m> 1 values, the clique factors have maximum size c ≥ 2) from samples, under highdimensional scaling where the number of variables p could be larger than the number of samples n. We provide a quantitative consistency analysis of a procedure based on nodewise multiclass logistic regression with groupsparse regularization. We first consider general mary pairwise models – where each factor depends on at most two variables. We show that when
A coordinate gradient descent method for ℓ1regularized convex minimization
 Department of Mathematics, National University of Singapore
, 2008
"... In applications such as signal processing and statistics, many problems involve finding sparse solutions to underdetermined linear systems of equations. These problems can be formulated as a structured nonsmooth optimization problems, i.e., the problem of minimizing ℓ1regularized linear least squa ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
In applications such as signal processing and statistics, many problems involve finding sparse solutions to underdetermined linear systems of equations. These problems can be formulated as a structured nonsmooth optimization problems, i.e., the problem of minimizing ℓ1regularized linear least squares problems. In this paper, we propose a block coordinate gradient descent method (abbreviated as CGD) to solve the more general ℓ1regularized convex minimization problems, i.e., the problem of minimizing an ℓ1regularized convex smooth function. We establish a Qlinear convergence rate for our method when the coordinate block is chosen by a GaussSouthwelltype rule to ensure sufficient descent. We propose efficient implementations of the CGD method and report numerical results for solving largescale ℓ1regularized linear least squares problems arising in compressed sensing and image deconvolution as well as largescale ℓ1regularized logistic regression problems for feature selection in data classification. Comparison with several stateoftheart algorithms specifically designed for solving largescale ℓ1regularized linear least squares or logistic regression problems suggests that an efficiently implemented CGD method may outperform these algorithms despite the fact that the CGD method is not specifically designed just to solve these special classes of problems. Key words. Coordinate gradient descent, Qlinear convergence, ℓ1regularization, compressed sensing, image deconvolution, linear least squares, logistic regression, convex optimization
Multiplicative Updates for L1–Regularized Linear and Logistic Regression
"... Abstract. Multiplicative update rules have proven useful in many areas of machine learning. Simple to implement, guaranteed to converge, they account in part for the widespread popularity of algorithms such as nonnegative matrix factorization and ExpectationMaximization. In this paper, we show how ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
Abstract. Multiplicative update rules have proven useful in many areas of machine learning. Simple to implement, guaranteed to converge, they account in part for the widespread popularity of algorithms such as nonnegative matrix factorization and ExpectationMaximization. In this paper, we show how to derive multiplicative updates for problems in L1regularized linear and logistic regression. For L1–regularized linear regression, the updates are derived by reformulating the required optimization as a problem in nonnegative quadratic programming (NQP). The dual of this problem, itself an instance of NQP, can also be solved using multiplicative updates; moreover, the observed duality gap can be used to bound the error of intermediate solutions. For L1–regularized logistic regression, we derive similar updates using an iteratively reweighted least squares approach. We present illustrative experimental results and describe efficient implementations for largescale problems of interest (e.g., with tens of thousands of examples and over one million features). 1