Results 1  10
of
95
An interiorpoint method for largescale l1regularized logistic regression
 Journal of Machine Learning Research
, 2007
"... Logistic regression with ℓ1 regularization has been proposed as a promising method for feature selection in classification problems. In this paper we describe an efficient interiorpoint method for solving largescale ℓ1regularized logistic regression problems. Small problems with up to a thousand ..."
Abstract

Cited by 167 (5 self)
 Add to MetaCart
Logistic regression with ℓ1 regularization has been proposed as a promising method for feature selection in classification problems. In this paper we describe an efficient interiorpoint method for solving largescale ℓ1regularized logistic regression problems. Small problems with up to a thousand or so features and examples can be solved in seconds on a PC; medium sized problems, with tens of thousands of features and examples, can be solved in tens of seconds (assuming some sparsity in the data). A variation on the basic method, that uses a preconditioned conjugate gradient method to compute the search step, can solve very large problems, with a million features and examples (e.g., the 20 Newsgroups data set), in a few minutes, on a PC. Using warmstart techniques, a good approximation of the entire regularization path can be computed much more efficiently than by solving a family of problems independently.
Scalable training of L1regularized loglinear models
 In ICML ’07
, 2007
"... The lbfgs limitedmemory quasiNewton method is the algorithm of choice for optimizing the parameters of largescale loglinear models with L2 regularization, but it cannot be used for an L1regularized loss due to its nondifferentiability whenever some parameter is zero. Efficient algorithms have ..."
Abstract

Cited by 131 (2 self)
 Add to MetaCart
The lbfgs limitedmemory quasiNewton method is the algorithm of choice for optimizing the parameters of largescale loglinear models with L2 regularization, but it cannot be used for an L1regularized loss due to its nondifferentiability whenever some parameter is zero. Efficient algorithms have been proposed for this task, but they are impractical when the number of parameters is very large. We present an algorithm OrthantWise Limitedmemory QuasiNewton (owlqn), based on lbfgs, that can efficiently optimize the L1regularized loglikelihood of loglinear models with millions of parameters. In our experiments on a parse reranking task, our algorithm was several orders of magnitude faster than an alternative algorithm, and substantially faster than lbfgs on the analogous L2regularized problem. We also present a proof that owlqn is guaranteed to converge to a globally optimal parameter vector. 1.
Sparse multinomial logistic regression: fast algorithms and generalization bounds
 IEEE Trans. on Pattern Analysis and Machine Intelligence
"... Abstract—Recently developed methods for learning sparse classifiers are among the stateoftheart in supervised learning. These methods learn classifiers that incorporate weighted sums of basis functions with sparsitypromoting priors encouraging the weight estimates to be either significantly larg ..."
Abstract

Cited by 128 (1 self)
 Add to MetaCart
(Show Context)
Abstract—Recently developed methods for learning sparse classifiers are among the stateoftheart in supervised learning. These methods learn classifiers that incorporate weighted sums of basis functions with sparsitypromoting priors encouraging the weight estimates to be either significantly large or exactly zero. From a learningtheoretic perspective, these methods control the capacity of the learned classifier by minimizing the number of basis functions used, resulting in better generalization. This paper presents three contributions related to learning sparse classifiers. First, we introduce a true multiclass formulation based on multinomial logistic regression. Second, by combining a bound optimization approach with a componentwise update procedure, we derive fast exact algorithms for learning sparse multiclass classifiers that scale favorably in both the number of training samples and the feature dimensionality, making them applicable even to large data sets in highdimensional feature spaces. To the best of our knowledge, these are the first algorithms to perform exact multinomial logistic regression with a sparsitypromoting prior. Third, we show how nontrivial generalization bounds can be derived for our classifier in the binary case. Experimental results on standard benchmark data sets attest to the accuracy, sparsity, and efficiency of the proposed methods.
Trust region Newton method for largescale logistic regression
 In Proceedings of the 24th International Conference on Machine Learning (ICML
, 2007
"... Largescale logistic regression arises in many applications such as document classification and natural language processing. In this paper, we apply a trust region Newton method to maximize the loglikelihood of the logistic regression model. The proposed method uses only approximate Newton steps in ..."
Abstract

Cited by 69 (12 self)
 Add to MetaCart
Largescale logistic regression arises in many applications such as document classification and natural language processing. In this paper, we apply a trust region Newton method to maximize the loglikelihood of the logistic regression model. The proposed method uses only approximate Newton steps in the beginning, but achieves fast convergence in the end. Experiments show that it is faster than the commonly used quasi Newton approach for logistic regression. We also compare it with existing linear SVM implementations. 1
Exponentiated gradient algorithms for conditional random fields and maxmargin Markov networks
, 2008
"... Loglinear and maximummargin models are two commonlyused methods in supervised machine learning, and are frequently used in structured prediction problems. Efficient learning of parameters in these models is therefore an important problem, and becomes a key factor when learning from very large dat ..."
Abstract

Cited by 64 (1 self)
 Add to MetaCart
Loglinear and maximummargin models are two commonlyused methods in supervised machine learning, and are frequently used in structured prediction problems. Efficient learning of parameters in these models is therefore an important problem, and becomes a key factor when learning from very large data sets. This paper describes exponentiated gradient (EG) algorithms for training such models, where EG updates are applied to the convex dual of either the loglinear or maxmargin objective function; the dual in both the loglinear and maxmargin cases corresponds to minimizing a convex function with simplex constraints. We study both batch and online variants of the algorithm, and provide rates of convergence for both cases. In the maxmargin case, O ( 1 ε) EG updates are required to reach a given accuracy ε in the dual; in contrast, for loglinear models only O(log (1/ε)) updates are required. For both the maxmargin and loglinear cases, our bounds suggest that the online EG algorithm requires a factor of n less computation to reach a desired accuracy than the batch EG algorithm, where n is the number of training examples. Our experiments confirm that the online algorithms are much faster than the batch algorithms in practice. We describe how the EG updates factor in a convenient way for structured prediction problems, allowing the algorithms to be
MultiClass Segmentation with Relative Location Prior
 INTERNATIONAL JOURNAL OF COMPUTER VISION
, 2008
"... Multiclass image segmentation has made significant advances in recent years through the combination of local and global features. One important type of global feature is that of interclass spatial relationships. For example, identifying “tree” pixels indicates that pixels above and to the sides ar ..."
Abstract

Cited by 62 (4 self)
 Add to MetaCart
Multiclass image segmentation has made significant advances in recent years through the combination of local and global features. One important type of global feature is that of interclass spatial relationships. For example, identifying “tree” pixels indicates that pixels above and to the sides are more likely to be “sky” whereas pixels below are more likely to be “grass.” Incorporating such global information across the entire image and between all classes is a computational challenge as it is imagedependent, and hence, cannot be precomputed. In this work we propose a method for capturing global information from interclass spatial relationships and encoding it as a local feature. We employ a twostage classification process to label all image pixels. First, we generate predictions which are used to compute a local relative location feature from learned relative location maps. In the second stage, we combine this with appearancebased features to provide a final segmentation. We compare our results to recent published results on several multiclass image segmentation databases and show that the incorporation of relative location information allows us to significantly outperform the current stateoftheart.
Efficient l1 regularized logistic regression
 In AAAI06
, 2006
"... L1 regularized logistic regression is now a workhorse of machine learning: it is widely used for many classification problems, particularly ones with many features. L1 regularized logistic regression requires solving a convex optimization problem. However, standard algorithms for solving convex opti ..."
Abstract

Cited by 47 (4 self)
 Add to MetaCart
(Show Context)
L1 regularized logistic regression is now a workhorse of machine learning: it is widely used for many classification problems, particularly ones with many features. L1 regularized logistic regression requires solving a convex optimization problem. However, standard algorithms for solving convex optimization problems do not scale well enough to handle the large datasets encountered in many practical settings. In this paper, we propose an efficient algorithm for L1 regularized logistic regression. Our algorithm iteratively approximates the objective function by a quadratic approximation at the current point, while maintaining the L1 constraint. In each iteration, it uses the efficient LARS (Least Angle Regression) algorithm to solve the resulting L1 constrained quadratic optimization problem. Our theoretical results show that our algorithm is guaranteed to converge to the global optimum. Our experiments show that our algorithm significantly outperforms standard algorithms for solving convex optimization problems. Moreover, our algorithm outperforms four previously published algorithms that were specifically designed to solve the L1 regularized logistic regression problem.
Bundle Methods for Regularized Risk Minimization
"... A wide variety of machine learning problems can be described as minimizing a regularized risk functional, with different algorithms using different notions of risk and different regularizers. Examples include linear Support Vector Machines (SVMs), Gaussian Processes, Logistic Regression, Conditional ..."
Abstract

Cited by 40 (2 self)
 Add to MetaCart
(Show Context)
A wide variety of machine learning problems can be described as minimizing a regularized risk functional, with different algorithms using different notions of risk and different regularizers. Examples include linear Support Vector Machines (SVMs), Gaussian Processes, Logistic Regression, Conditional Random Fields (CRFs), and Lasso amongst others. This paper describes the theory and implementation of a scalable and modular convex solver which solves all these estimation problems. It can be parallelized on a cluster of workstations, allows for datalocality, and can deal with regularizers such as L1 and L2 penalties. In addition to the unified framework we present tight convergence bounds, which show that our algorithm converges in O(1/ɛ) steps to ɛ precision for general convex problems and in O(log(1/ɛ)) steps for continuously differentiable problems. We demonstrate the performance of our general purpose solver on a variety of publicly available datasets.
A leastsquares approach to direct importance estimation
 Journal of Machine Learning Research
, 2009
"... We address the problem of estimating the ratio of two probability density functions, which is often referred to as the importance. The importance values can be used for various succeeding tasks such as covariate shift adaptation or outlier detection. In this paper, we propose a new importance estima ..."
Abstract

Cited by 39 (24 self)
 Add to MetaCart
We address the problem of estimating the ratio of two probability density functions, which is often referred to as the importance. The importance values can be used for various succeeding tasks such as covariate shift adaptation or outlier detection. In this paper, we propose a new importance estimation method that has a closedform solution; the leaveoneout crossvalidation score can also be computed analytically. Therefore, the proposed method is computationally highly efficient and simple to implement. We also elucidate theoretical properties of the proposed method such as the convergence rate and approximation error bounds. Numerical experiments show that the proposed method is comparable to the best existing method in accuracy, while it is computationally more efficient than competing approaches.
Exponentiated gradient algorithms for loglinear structured prediction
 In Proc. ICML, 2007
"... Conditional loglinear models are a commonly used method for structured prediction. Efficient learning of parameters in these models is therefore an important problem. This paper describes an exponentiated gradient (EG) algorithm for training such models. EG is applied to the convex dual of the maxi ..."
Abstract

Cited by 29 (5 self)
 Add to MetaCart
Conditional loglinear models are a commonly used method for structured prediction. Efficient learning of parameters in these models is therefore an important problem. This paper describes an exponentiated gradient (EG) algorithm for training such models. EG is applied to the convex dual of the maximum likelihood objective; this results in both sequential and parallel update algorithms, where in the sequential algorithm parameters are updated in an online fashion. We provide a convergence proof for both algorithms. Our analysis also simplifies previous results on EG for maxmargin models, and leads to a tighter bound on convergence rates. Experiments on a largescale parsing task show that the proposed algorithm converges much faster than conjugategradient and LBFGS approaches both in terms of optimization objective and test error. 1.