Results 1  10
of
101
An interiorpoint method for largescale l1regularized logistic regression
 Journal of Machine Learning Research
, 2007
"... Logistic regression with ℓ1 regularization has been proposed as a promising method for feature selection in classification problems. In this paper we describe an efficient interiorpoint method for solving largescale ℓ1regularized logistic regression problems. Small problems with up to a thousand ..."
Abstract

Cited by 153 (6 self)
 Add to MetaCart
Logistic regression with ℓ1 regularization has been proposed as a promising method for feature selection in classification problems. In this paper we describe an efficient interiorpoint method for solving largescale ℓ1regularized logistic regression problems. Small problems with up to a thousand or so features and examples can be solved in seconds on a PC; medium sized problems, with tens of thousands of features and examples, can be solved in tens of seconds (assuming some sparsity in the data). A variation on the basic method, that uses a preconditioned conjugate gradient method to compute the search step, can solve very large problems, with a million features and examples (e.g., the 20 Newsgroups data set), in a few minutes, on a PC. Using warmstart techniques, a good approximation of the entire regularization path can be computed much more efficiently than by solving a family of problems independently.
Efficient structure learning of Markov networks using L1regularization
 In NIPS
, 2006
"... Markov networks are widely used in a wide variety of applications, in problems ranging from computer vision, to natural language, to computational biology. In most current applications, even those that rely heavily on learned models, the structure of the Markov network is constructed by hand, due to ..."
Abstract

Cited by 107 (2 self)
 Add to MetaCart
Markov networks are widely used in a wide variety of applications, in problems ranging from computer vision, to natural language, to computational biology. In most current applications, even those that rely heavily on learned models, the structure of the Markov network is constructed by hand, due to the lack of effective algorithms for learning Markov network structure from data. In this paper, we provide a computationally effective method for learning Markov network structure from data. Our method is based on the use of L1 regularization on the weights of the loglinear model, which has the effect of biasing the model towards solutions where many of the parameters are zero. This formulation converts the Markov network learning problem into a convex optimization problem in a continuous space, which can be solved using efficient gradient methods. A key issue in this setting is the (unavoidable) use of approximate inference, which can lead to errors in the gradient computation when the network structure is dense. Thus, we explore the use of different feature introduction schemes and compare their performance. We provide results for our method on synthetic data, and on two real world data sets: modeling the joint distribution of pixel values in the MNIST data, and modeling the joint distribution of genetic sequence variations in the human HapMap data. We show that our L1based method achieves considerably higher generalization performance than the more standard L2based method (a Gaussian parameter prior) or pure maximumlikelihood learning. We also show that we can learn MRF network structure at a computational cost that is not much greater than learning parameters alone, demonstrating the existence of a feasible method for this important problem. 1
Combining svms with various feature selection strategies
 Taiwan University
, 2005
"... Feature selection is an important issue in many research areas. There are some reasons for selecting important features such as reducing the learning time, improving the accuracy, etc. This thesis investigates the performance of combining support vector machines (SVM) and various feature selection s ..."
Abstract

Cited by 58 (0 self)
 Add to MetaCart
Feature selection is an important issue in many research areas. There are some reasons for selecting important features such as reducing the learning time, improving the accuracy, etc. This thesis investigates the performance of combining support vector machines (SVM) and various feature selection strategies. The first part of the thesis mainly describes the existing feature selection methods and our experience on using those methods to attend a competition. The second part studies more feature selection strategies using the SVM. ii �ì��¬¡÷ � ��å�ç¢�ß��� � selection)��¥ì����£��È�� ����È������Ú���£����æÁ ç��£�����û�� ì�Öù�¡�È��(feature é£�æÁ©Â����℄���� � �Ü � ����Æ���È��℄�¡��û���℄�ø�¢�§���� �(Support Vector Machine) iii
An efficient earth mover’s distance algorithm for robust histogram comparison
 PAMI
, 2007
"... DRAFT We propose EMDL1: a fast and exact algorithm for computing the Earth Mover’s Distance (EMD) between a pair of histograms. The efficiency of the new algorithm enables its application to problems that were previously prohibitive due to high time complexities. The proposed EMDL1 significantly s ..."
Abstract

Cited by 44 (4 self)
 Add to MetaCart
DRAFT We propose EMDL1: a fast and exact algorithm for computing the Earth Mover’s Distance (EMD) between a pair of histograms. The efficiency of the new algorithm enables its application to problems that were previously prohibitive due to high time complexities. The proposed EMDL1 significantly simplifies the original linear programming formulation of EMD. Exploiting the L1 metric structure, the number of unknown variables in EMDL1 is reduced to O(N) from O(N 2) of the original EMD for a histogram with N bins. In addition, the number of constraints is reduced by half and the objective function of the linear program is simplified. Formally without any approximation, we prove that the EMDL1 formulation is equivalent to the original EMD with a L1 ground distance. To perform the EMDL1 computation, we propose an efficient treebased algorithm, TreeEMD. TreeEMD exploits the fact that a basic feasible solution of the simplex algorithmbased solver forms a spanning tree when we interpret EMDL1 as a network flow optimization problem. We empirically show that this new algorithm has average time complexity of O(N 2), which significantly improves the best reported supercubic complexity of the original EMD. The accuracy of the proposed methods is evaluated by
Statistical challenges with high dimensionality: Feature selection in knowledge discovery
 Proceedings of the International Congress of Mathematicians
, 2006
"... Abstract. Technological innovations have revolutionized the process of scientific research and knowledge discovery. The availability of massive data and challenges from frontiers of research and development have reshaped statistical thinking, data analysis and theoretical studies. The challenges of ..."
Abstract

Cited by 35 (9 self)
 Add to MetaCart
Abstract. Technological innovations have revolutionized the process of scientific research and knowledge discovery. The availability of massive data and challenges from frontiers of research and development have reshaped statistical thinking, data analysis and theoretical studies. The challenges of highdimensionality arise in diverse fields of sciences and the humanities, ranging from computational biology and health studies to financial engineering and risk management. In all of these fields, variable selection and feature extraction are crucial for knowledge discovery. We first give a comprehensive overview of statistical challenges with high dimensionality in these diverse disciplines. We then approach the problem of variable selection and feature extraction using a unified framework: penalized likelihood methods. Issues relevant to the choice of penalty functions are addressed. We demonstrate that for a host of statistical problems, as long as the dimensionality is not excessively large, we can estimate the model parameters as well as if the best model is known in advance. The persistence property in risk minimization is also addressed. The applicability of such a theory and method to diverse statistical problems is demonstrated. Other related problems with highdimensionality are also discussed.
Tracking curved regularized optimization solution paths
 in ‘Advances in Neural Information Processing Systems (NIPS*2004
, 2004
"... Regularization plays a central role in the analysis of modern data, where nonregularized fitting is likely to lead to overfitted models, useless for both prediction and interpretation. We consider the design of incremental algorithms which follow paths of regularized solutions, as the regularizati ..."
Abstract

Cited by 26 (2 self)
 Add to MetaCart
Regularization plays a central role in the analysis of modern data, where nonregularized fitting is likely to lead to overfitted models, useless for both prediction and interpretation. We consider the design of incremental algorithms which follow paths of regularized solutions, as the regularization varies. These approaches often result in methods which are both efficient and highly flexible. We suggest a general pathfollowing algorithm based on secondorder approximations, prove that under mild conditions it remains “very close ” to the path of optimal solutions and illustrate it with examples. 1
Least Squares Linear Discriminant Analysis
"... Linear Discriminant Analysis (LDA) is a wellknown method for dimensionality reduction and classification. LDA in the binaryclass case has been shown to be equivalent to linear regression with the class label as the output. This implies that LDA for binaryclass classifications can be formulated as ..."
Abstract

Cited by 26 (6 self)
 Add to MetaCart
Linear Discriminant Analysis (LDA) is a wellknown method for dimensionality reduction and classification. LDA in the binaryclass case has been shown to be equivalent to linear regression with the class label as the output. This implies that LDA for binaryclass classifications can be formulated as a least squares problem. Previous studies have shown certain relationship between multivariate linear regression and LDA for the multiclass case. Many of these studies show that multivariate linear regression with a specific class indicator matrix as the output can be applied as a preprocessing step for LDA. However, directly casting LDA as a least squares problem is challenging for the multiclass case. In this paper, a novel formulation for multivariate linear regression is proposed. The equivalence relationship between the proposed least squares formulation and LDA for multiclass classifications is rigorously established under a mild condition, which is shown empirically to hold in many applications involving highdimensional data. Several LDA extensions based on the equivalence relationship are discussed. 1.
Nonlinear Estimators and Tail Bounds for Dimension Reduction in l1 Using Cauchy Random Projections
, 2007
"... For1 dimension reduction in the l1 norm, the method of Cauchy random projections multiplies the original data matrix A ∈ Rn×D with a random matrix R ∈ RD×k (k ≪ D) whose entries are i.i.d. samples of the standard Cauchy C(0,1). Because of the impossibility result, one can not hope to recover the pai ..."
Abstract

Cited by 18 (0 self)
 Add to MetaCart
For1 dimension reduction in the l1 norm, the method of Cauchy random projections multiplies the original data matrix A ∈ Rn×D with a random matrix R ∈ RD×k (k ≪ D) whose entries are i.i.d. samples of the standard Cauchy C(0,1). Because of the impossibility result, one can not hope to recover the pairwise l1 distances in A from B = A × R ∈ Rn×k, using linear estimators without incurring large errors. However, nonlinear estimators are still useful for certain applications in data stream computations, information retrieval, learning, and data mining. We study three types of nonlinear estimators: the sample median estimators, the geometric mean estimators, and the maximum likelihood estimators � (MLE). We derive tail bounds for the logn geometric mean estimators and establish that k = O ε2 � suffices with the constants explicitly given. Asymptotically (as k → ∞), both the sample median and the geometric mean estimators are about 80 % efficient compared to the MLE. We analyze the moments of the MLE and propose approximating its distribution of by an inverse Gaussian. Keywords: dimension reduction, l1 norm, JohnsonLindenstrauss (JL) lemma, Cauchy random projections