Results 1  10
of
49
An interiorpoint method for largescale l1regularized logistic regression
 Journal of Machine Learning Research
, 2007
"... Logistic regression with ℓ1 regularization has been proposed as a promising method for feature selection in classification problems. In this paper we describe an efficient interiorpoint method for solving largescale ℓ1regularized logistic regression problems. Small problems with up to a thousand ..."
Abstract

Cited by 153 (6 self)
 Add to MetaCart
Logistic regression with ℓ1 regularization has been proposed as a promising method for feature selection in classification problems. In this paper we describe an efficient interiorpoint method for solving largescale ℓ1regularized logistic regression problems. Small problems with up to a thousand or so features and examples can be solved in seconds on a PC; medium sized problems, with tens of thousands of features and examples, can be solved in tens of seconds (assuming some sparsity in the data). A variation on the basic method, that uses a preconditioned conjugate gradient method to compute the search step, can solve very large problems, with a million features and examples (e.g., the 20 Newsgroups data set), in a few minutes, on a PC. Using warmstart techniques, a good approximation of the entire regularization path can be computed much more efficiently than by solving a family of problems independently.
Active learning literature survey
, 2010
"... The key idea behind active learning is that a machine learning algorithm can achieve greater accuracy with fewer labeled training instances if it is allowed to choose the data from which is learns. An active learner may ask queries in the form of unlabeled instances to be labeled by an oracle (e.g., ..."
Abstract

Cited by 132 (1 self)
 Add to MetaCart
The key idea behind active learning is that a machine learning algorithm can achieve greater accuracy with fewer labeled training instances if it is allowed to choose the data from which is learns. An active learner may ask queries in the form of unlabeled instances to be labeled by an oracle (e.g., a human annotator). Active learning is wellmotivated in many modern machine learning problems, where unlabeled data may be abundant but labels are difficult, timeconsuming, or expensive to obtain. This report provides a general introduction to active learning and a survey of the literature. This includes a discussion of the scenarios in which queries can be formulated, and an overview of the query strategy frameworks proposed in the literature to date. An analysis of the empirical and theoretical evidence for active learning, a summary of several problem setting variants, and a discussion
Scalable training of L1regularized loglinear models
 In ICML ’07
, 2007
"... The lbfgs limitedmemory quasiNewton method is the algorithm of choice for optimizing the parameters of largescale loglinear models with L2 regularization, but it cannot be used for an L1regularized loss due to its nondifferentiability whenever some parameter is zero. Efficient algorithms have ..."
Abstract

Cited by 122 (2 self)
 Add to MetaCart
The lbfgs limitedmemory quasiNewton method is the algorithm of choice for optimizing the parameters of largescale loglinear models with L2 regularization, but it cannot be used for an L1regularized loss due to its nondifferentiability whenever some parameter is zero. Efficient algorithms have been proposed for this task, but they are impractical when the number of parameters is very large. We present an algorithm OrthantWise Limitedmemory QuasiNewton (owlqn), based on lbfgs, that can efficiently optimize the L1regularized loglikelihood of loglinear models with millions of parameters. In our experiments on a parse reranking task, our algorithm was several orders of magnitude faster than an alternative algorithm, and substantially faster than lbfgs on the analogous L2regularized problem. We also present a proof that owlqn is guaranteed to converge to a globally optimal parameter vector. 1.
Efficient structure learning of Markov networks using L1regularization
 In NIPS
, 2006
"... Markov networks are widely used in a wide variety of applications, in problems ranging from computer vision, to natural language, to computational biology. In most current applications, even those that rely heavily on learned models, the structure of the Markov network is constructed by hand, due to ..."
Abstract

Cited by 107 (2 self)
 Add to MetaCart
Markov networks are widely used in a wide variety of applications, in problems ranging from computer vision, to natural language, to computational biology. In most current applications, even those that rely heavily on learned models, the structure of the Markov network is constructed by hand, due to the lack of effective algorithms for learning Markov network structure from data. In this paper, we provide a computationally effective method for learning Markov network structure from data. Our method is based on the use of L1 regularization on the weights of the loglinear model, which has the effect of biasing the model towards solutions where many of the parameters are zero. This formulation converts the Markov network learning problem into a convex optimization problem in a continuous space, which can be solved using efficient gradient methods. A key issue in this setting is the (unavoidable) use of approximate inference, which can lead to errors in the gradient computation when the network structure is dense. Thus, we explore the use of different feature introduction schemes and compare their performance. We provide results for our method on synthetic data, and on two real world data sets: modeling the joint distribution of pixel values in the MNIST data, and modeling the joint distribution of genetic sequence variations in the human HapMap data. We show that our L1based method achieves considerably higher generalization performance than the more standard L2based method (a Gaussian parameter prior) or pure maximumlikelihood learning. We also show that we can learn MRF network structure at a computational cost that is not much greater than learning parameters alone, demonstrating the existence of a feasible method for this important problem. 1
Performance guarantees for regularized maximum entropy density estimation
 Proceedings of the 17th Annual Conference on Computational Learning Theory
, 2004
"... Abstract. We consider the problem of estimating an unknown probability distribution from samples using the principle of maximum entropy (maxent). To alleviate overfitting with a very large number of features, we propose applying the maxent principle with relaxed constraints on the expectations of th ..."
Abstract

Cited by 53 (8 self)
 Add to MetaCart
Abstract. We consider the problem of estimating an unknown probability distribution from samples using the principle of maximum entropy (maxent). To alleviate overfitting with a very large number of features, we propose applying the maxent principle with relaxed constraints on the expectations of the features. By convex duality, this turns out to be equivalent to finding the Gibbs distribution minimizing a regularized version of the empirical log loss. We prove nonasymptotic bounds showing that, with respect to the true underlying distribution, this relaxed version of maxent produces density estimates that are almost as good as the best possible. These bounds are in terms of the deviation of the feature empirical averages relative to their true expectations, a number that can be bounded using standard uniformconvergence techniques. In particular, this leads to bounds that drop quickly with the number of samples, and that depend very moderately on the number or complexity of the features. We also derive and prove convergence for both sequentialupdate and parallelupdate algorithms. Finally, we briefly describe experiments on data relevant to the modeling of species geographical distributions. 1
Efficient l1 regularized logistic regression
 In AAAI06
, 2006
"... L1 regularized logistic regression is now a workhorse of machine learning: it is widely used for many classification problems, particularly ones with many features. L1 regularized logistic regression requires solving a convex optimization problem. However, standard algorithms for solving convex opti ..."
Abstract

Cited by 44 (4 self)
 Add to MetaCart
L1 regularized logistic regression is now a workhorse of machine learning: it is widely used for many classification problems, particularly ones with many features. L1 regularized logistic regression requires solving a convex optimization problem. However, standard algorithms for solving convex optimization problems do not scale well enough to handle the large datasets encountered in many practical settings. In this paper, we propose an efficient algorithm for L1 regularized logistic regression. Our algorithm iteratively approximates the objective function by a quadratic approximation at the current point, while maintaining the L1 constraint. In each iteration, it uses the efficient LARS (Least Angle Regression) algorithm to solve the resulting L1 constrained quadratic optimization problem. Our theoretical results show that our algorithm is guaranteed to converge to the global optimum. Our experiments show that our algorithm significantly outperforms standard algorithms for solving convex optimization problems. Moreover, our algorithm outperforms four previously published algorithms that were specifically designed to solve the L1 regularized logistic regression problem.
A maximum entropy approach to species distribution modeling
 In Proceedings of the TwentyFirst International Conference on Machine Learning
, 2004
"... We study the problem of modeling species geographic distributions, a critical problem in conservation biology. We propose the use of maximumentropy techniques for this problem, specifically, sequentialupdate algorithms that can handle a very large number of features. We describe experiments compar ..."
Abstract

Cited by 35 (6 self)
 Add to MetaCart
We study the problem of modeling species geographic distributions, a critical problem in conservation biology. We propose the use of maximumentropy techniques for this problem, specifically, sequentialupdate algorithms that can handle a very large number of features. We describe experiments comparing maxent with a standard distributionmodeling tool, called GARP, on a dataset containing observation data for North American breeding birds. We also study how well maxent performs as a function of the number of training examples and training time, analyze the use of regularization to avoid overfitting when the number of examples is small, and explore the interpretability of models constructed using maxent. 1.
MultiTask Feature Learning Via Efficient L2,1Norm Minimization
, 2009
"... The problem of joint feature selection across a group of related tasks has applications in many areas including biomedical informatics and computer vision. We consider the ℓ2,1norm regularized regression model for joint feature selection from multiple tasks, which can be derived in the probabilisti ..."
Abstract

Cited by 28 (3 self)
 Add to MetaCart
The problem of joint feature selection across a group of related tasks has applications in many areas including biomedical informatics and computer vision. We consider the ℓ2,1norm regularized regression model for joint feature selection from multiple tasks, which can be derived in the probabilistic framework by assuming a suitable prior from the exponential family. One appealing feature of the ℓ2,1norm regularization is that it encourages multiple predictors to share similar sparsity patterns. However, the resulting optimization problem is challenging to solve due to the nonsmoothness of the ℓ2,1norm regularization. In this paper, we propose to accelerate the computation by reformulating it as two equivalent smooth convex optimization problems which are then solved via the Nesterov’s method—an optimal firstorder blackbox method for smooth convex optimization. A key building block in solving the reformulations is the Euclidean projection. We show that the Euclidean projection for the first reformulation can be analytically computed, while the Euclidean projection for the second one can be computed in linear time. Empirical evaluations on several data sets verify the efficiency of the proposed algorithms.
Novel Estimation Methods for Unsupervised Discovery of Latent Structure in Natural Language Text
, 2006
"... This thesis is about estimating probabilistic models to uncover useful hidden structure in data; specifically, we address the problem of discovering syntactic structure in natural language text. We present three new parameter estimation techniques that generalize the standard approach, maximum likel ..."
Abstract

Cited by 27 (8 self)
 Add to MetaCart
This thesis is about estimating probabilistic models to uncover useful hidden structure in data; specifically, we address the problem of discovering syntactic structure in natural language text. We present three new parameter estimation techniques that generalize the standard approach, maximum likelihood estimation, in different ways. Contrastive estimation maximizes the conditional probability of the observed data given a “neighborhood” of implicit negative examples. Skewed deterministic annealing locally maximizes likelihood using a cautious parameter search strategy that starts with an easier optimization problem than likelihood, and iteratively moves to harder problems, culminating in likelihood. Structural annealing is similar, but starts with a heavy bias toward simple syntactic structures and gradually relaxes the bias. Our estimation methods do not make use of annotated examples. We consider their performance in both an unsupervised model selection setting, where models trained under different initialization and regularization settings are compared by evaluating the training objective on a small set of unseen, unannotated development data, and supervised model selection, where the most accurate model on the development set (now with annotations)
Maximum entropy distribution estimation with generalized regularization
 Proc. Annual Conf. Computational Learning Theory
, 2006
"... Abstract. We present a unified and complete account of maximum entropy distribution estimation subject to constraints represented by convex potential functions or, alternatively, by convex regularization. We provide fully general performance guarantees and an algorithm with a complete convergence pr ..."
Abstract

Cited by 26 (1 self)
 Add to MetaCart
Abstract. We present a unified and complete account of maximum entropy distribution estimation subject to constraints represented by convex potential functions or, alternatively, by convex regularization. We provide fully general performance guarantees and an algorithm with a complete convergence proof. As special cases, we can easily derive performance guarantees for many known regularization types, including ℓ1, ℓ2, ℓ 2 2 and ℓ1 + ℓ 2 2 style regularization. Furthermore, our general approach enables us to use information about the structure of the feature space or about sample selection bias to derive entirely new regularization functions with superior guarantees. We propose an algorithm solving a large and general subclass of generalized maxent problems, including all discussed in the paper, and prove its convergence. Our approach generalizes techniques based on information geometry and Bregman divergences as well as those based more directly on compactness. 1