Results 1  10
of
58
An interiorpoint method for largescale l1regularized logistic regression
 Journal of Machine Learning Research
, 2007
"... Logistic regression with ℓ1 regularization has been proposed as a promising method for feature selection in classification problems. In this paper we describe an efficient interiorpoint method for solving largescale ℓ1regularized logistic regression problems. Small problems with up to a thousand ..."
Abstract

Cited by 232 (8 self)
 Add to MetaCart
(Show Context)
Logistic regression with ℓ1 regularization has been proposed as a promising method for feature selection in classification problems. In this paper we describe an efficient interiorpoint method for solving largescale ℓ1regularized logistic regression problems. Small problems with up to a thousand or so features and examples can be solved in seconds on a PC; medium sized problems, with tens of thousands of features and examples, can be solved in tens of seconds (assuming some sparsity in the data). A variation on the basic method, that uses a preconditioned conjugate gradient method to compute the search step, can solve very large problems, with a million features and examples (e.g., the 20 Newsgroups data set), in a few minutes, on a PC. Using warmstart techniques, a good approximation of the entire regularization path can be computed much more efficiently than by solving a family of problems independently.
Fast Optimization Methods for L1 Regularization: A Comparative Study and Two New Approaches
"... Abstract. L1 regularization is effective for feature selection, but the resulting optimization is challenging due to the nondifferentiability of the 1norm. In this paper we compare stateoftheart optimization techniques to solve this problem across several loss functions. Furthermore, we propose ..."
Abstract

Cited by 72 (2 self)
 Add to MetaCart
(Show Context)
Abstract. L1 regularization is effective for feature selection, but the resulting optimization is challenging due to the nondifferentiability of the 1norm. In this paper we compare stateoftheart optimization techniques to solve this problem across several loss functions. Furthermore, we propose two new techniques. The first is based on a smooth (differentiable) convex approximation for the L1 regularizer that does not depend on any assumptions about the loss function used. The other technique is a new strategy that addresses the nondifferentiability of the L1regularizer by casting the problem as a constrained optimization problem that is then solved using a specialized gradient projection method. Extensive comparisons show that our newly proposed approaches consistently rank among the best in terms of convergence speed and efficiency by measuring the number of function evaluations required. 1
Structure learning in random fields for heart motion abnormality detection
 In CVPR
, 2008
"... Coronary Heart Disease can be diagnosed by assessing the regional motion of the heart walls in ultrasound images of the left ventricle. Even for experts, ultrasound images are difficult to interpret leading to high intraobserver variability. Previous work indicates that in order to approach this pr ..."
Abstract

Cited by 49 (8 self)
 Add to MetaCart
(Show Context)
Coronary Heart Disease can be diagnosed by assessing the regional motion of the heart walls in ultrasound images of the left ventricle. Even for experts, ultrasound images are difficult to interpret leading to high intraobserver variability. Previous work indicates that in order to approach this problem, the interactions between the different heart regions and their overall influence on the clinical condition of the heart need to be considered. To do this, we propose a method for jointly learning the structure and parameters of conditional random fields, formulating these tasks as a convex optimization problem. We consider blockL1 regularization for each set of features associated with an edge, and formalize an efficient projection method to find the globally optimal penalized maximum likelihood solution. We perform extensive numerical experiments comparing the presented method with related methods that approach the structure learning problem differently. We verify the robustness of our method on echocardiograms collected in routine clinical practice at one hospital. 1.
Lange K: Genomewide Association Analysis by Lasso Penalized Logistic Regression
 Bioinformatics
"... Motivation: In ordinary regression, imposition of a lasso penalty makes continuous model selection straightforward. Lasso penalized regression is particularly advantageous when the number of predictors far exceeds the number of observations. Method: The present paper evaluates the performance of las ..."
Abstract

Cited by 32 (2 self)
 Add to MetaCart
(Show Context)
Motivation: In ordinary regression, imposition of a lasso penalty makes continuous model selection straightforward. Lasso penalized regression is particularly advantageous when the number of predictors far exceeds the number of observations. Method: The present paper evaluates the performance of lasso penalized logistic regression in casecontrol disease gene mapping with a large number of SNP (single nucleotide polymorphisms) predictors. The strength of the lasso penalty can be tuned to select a predetermined number of the most relevant SNPs and other predictors. For a given value of the tuning constant, the penalized likelihood is quickly maximized by cyclic coordinate ascent. Once the most potent marginal predictors are identified, their twoway and higherorder interactions can also be examined by lasso penalized logistic regression. Results: This strategy is tested on both simulated and real data. Our findings on coeliac disease replicate the previous single SNP results and shed light on possible interactions among the SNPs. Availability: The software discussed is available in Mendel 9.0 at the
Stochastic Gradient Descent Training for L1regularized Loglinear Models with Cumulative Penalty
"... Stochastic gradient descent (SGD) uses approximate gradients estimated from subsets of the training data and updates the parameters in an online fashion. This learning framework is attractive because it often requires much less training time in practice than batch training algorithms. However, L1re ..."
Abstract

Cited by 31 (0 self)
 Add to MetaCart
(Show Context)
Stochastic gradient descent (SGD) uses approximate gradients estimated from subsets of the training data and updates the parameters in an online fashion. This learning framework is attractive because it often requires much less training time in practice than batch training algorithms. However, L1regularization, which is becoming popular in natural language processing because of its ability to produce compact models, cannot be efficiently applied in SGD training, due to the large dimensions of feature vectors and the fluctuations of approximate gradients. We present a simple method to solve these problems by penalizing the weights according to cumulative values for L1 penalty. We evaluate the effectiveness of our method in three applications: text chunking, named entity recognition, and partofspeech tagging. Experimental results demonstrate that our method can produce compact and accurate models much more quickly than a stateoftheart quasiNewton method for L1regularized loglinear models. 1
Learning graphical model structure using L1regularization paths
 In Proceedings of the 21st Conference on Artificial Intelligence (AAAI
, 2007
"... Sparsitypromoting L1regularization has recently been succesfully used to learn the structure of undirected graphical models. In this paper, we apply this technique to learn the structure of directed graphical models. Specifically, we make three contributions. First, we show how the decomposability ..."
Abstract

Cited by 30 (2 self)
 Add to MetaCart
Sparsitypromoting L1regularization has recently been succesfully used to learn the structure of undirected graphical models. In this paper, we apply this technique to learn the structure of directed graphical models. Specifically, we make three contributions. First, we show how the decomposability of the MDL score, plus the ability to quickly compute entire regularization paths, allows us to efficiently pick the optimal regularization parameter on a pernode basis. Second, we show how to use L1 variable selection to select the Markov blanket, before a DAG search stage. Finally, we show how L1 variable selection can be used inside of an order search algorithm. The effectiveness of these L1based approaches are compared to current state of the art methods on 10 datasets.
ℓ1 Trend Filtering
, 2007
"... The problem of estimating underlying trends in time series data arises in a variety of disciplines. In this paper we propose a variation on HodrickPrescott (HP) filtering, a widely used method for trend estimation. The proposed ℓ1 trend filtering method substitutes a sum of absolute values (i.e., ..."
Abstract

Cited by 29 (6 self)
 Add to MetaCart
(Show Context)
The problem of estimating underlying trends in time series data arises in a variety of disciplines. In this paper we propose a variation on HodrickPrescott (HP) filtering, a widely used method for trend estimation. The proposed ℓ1 trend filtering method substitutes a sum of absolute values (i.e., an ℓ1norm) for the sum of squares used in HP filtering to penalize variations in the estimated trend. The ℓ1 trend filtering method produces trend estimates that are piecewise linear, and therefore is well suited to analyzing time series with an underlying piecewise linear trend. The kinks, knots, or changes in slope, of the estimated trend can be interpreted as abrupt changes or events in the underlying dynamics of the time series. Using specialized interiorpoint methods, ℓ1 trend filtering can be carried out with not much more effort than HP filtering; in particular, the number of arithmetic operations required grows linearly with the number of data points. We describe the method and some of its basic properties, and give some illustrative examples. We show how the method is related to ℓ1 regularization based methods in sparse signal recovery and feature selection, and list some extensions of the basic method.
Estimation of sparse binary pairwise Markov networks using pseudolikelihood
 J
"... We consider the problems of estimating the parameters as well as the structure of binaryvalued Markov networks. For maximizing the penalized loglikelihood, we implement an approximate procedure based on the pseudolikelihood of Besag (1975) and generalize it to a fast exact algorithm. The exact al ..."
Abstract

Cited by 27 (0 self)
 Add to MetaCart
We consider the problems of estimating the parameters as well as the structure of binaryvalued Markov networks. For maximizing the penalized loglikelihood, we implement an approximate procedure based on the pseudolikelihood of Besag (1975) and generalize it to a fast exact algorithm. The exact algorithm starts with the pseudolikelihood solution and then adjusts the pseudolikelihood criterion so that each additional iterations moves it closer to the exact solution. Our results show that this procedure is faster than the competing exact method proposed by Lee, Ganapathi, and Koller (2006a). However, we also find that the approximate pseudolikelihood as well as the approaches of Wainwright et al. (2006), when implemented using the coordinate descent procedure of Friedman, Hastie, and Tibshirani (2008b), are much faster than the exact methods, and only slightly less accurate.
Domain Adaptation of Conditional Probability Models via Feature Subsetting
"... Abstract. The goal in domain adaptation is to train a model using labeled data sampled from a domain different from the target domain on which the model will be deployed. We exploit unlabeled data from the target domain to train a model that maximizes likelihood over the training sample while minimi ..."
Abstract

Cited by 26 (0 self)
 Add to MetaCart
(Show Context)
Abstract. The goal in domain adaptation is to train a model using labeled data sampled from a domain different from the target domain on which the model will be deployed. We exploit unlabeled data from the target domain to train a model that maximizes likelihood over the training sample while minimizing the distance between the training and target distribution. Our focus is conditional probability models used for predicting a label structure y given input x based on features defined jointly over x and y. We propose practical measures of divergence between the two domains based on which we penalize features with large divergence, while improving the effectiveness of other less deviant correlated features. Empirical evaluation on several reallife information extraction tasks using Conditional Random Fields (CRFs) show that our method of domain adaptation leads to significant reduction in error. 1
Exponential Family Sparse Coding with Applications to Selftaught Learning
"... Sparse coding is an unsupervised learning algorithm for finding concise, slightly higherlevel representations of inputs, and has been successfully applied to selftaught learning, where the goal is to use unlabeled data to help on a supervised learning task, even if the unlabeled data cannot be ass ..."
Abstract

Cited by 20 (2 self)
 Add to MetaCart
Sparse coding is an unsupervised learning algorithm for finding concise, slightly higherlevel representations of inputs, and has been successfully applied to selftaught learning, where the goal is to use unlabeled data to help on a supervised learning task, even if the unlabeled data cannot be associated with the labels of the supervised task [Raina et al., 2007]. However, sparse coding uses a Gaussian noise model and a quadratic loss function, and thus performs poorly if applied to binary valued, integer valued, or other nonGaussian data, such as text. Drawing on ideas from generalized linear models (GLMs), we present a generalization of sparse coding to learning with data drawn from any exponential family distribution (such as Bernoulli, Poisson, etc). This gives a method that we argue is much better suited to model other data types than Gaussian. We present an algorithm for solving the L1regularized optimization problem defined by this model, and show that it is especially efficient when the optimal solution is sparse. We also show that the new model results in significantly improved selftaught learning performance when applied to text classification and to a robotic perception task. 1