## Logistic Regression for Data Mining and High-Dimensional Classification (2004)

Citations: | 10 - 1 self |

### BibTeX

@TECHREPORT{Komarek04logisticregression,

author = {Paul Komarek and Advised Andrew Moore},

title = {Logistic Regression for Data Mining and High-Dimensional Classification},

institution = {},

year = {2004}

}

### OpenURL

### Abstract

The focus of this thesis is fast and robust adaptations of logistic regression (LR) for data mining and high-dimensional classification problems. LR is well-understood and widely used in the statistics, machine learning, and data analysis communities. Its benefits include a firm statistical foundation and a probabilistic model useful for ``explaining'' the data. There is a perception that LR is slow, unstable, and unsuitable for large learning or classification tasks. Through fast approximate numerical methods, regularization to avoid numerical instability, and an efficient implementation we will show that LR can outperform modern algorithms like Support Vector Machines (SVM) on a variety of learning tasks. Our novel implementation, which uses a modified iteratively re-weighted least squares estimation procedure, can compute model parameters for sparse binary datasets with hundreds of thousands of rows and attributes, and millions or tens of millions of nonzero elements in just a few seconds. Our implementation also handles real-valued dense datasets of similar size.

### Citations

8983 | N.: The nature of statistical learning theory
- Vapnik
- 1995
(Show Context)
Citation Context ...sifier. We didn't actually implement our own SVM, but embedded SVM light [18] in the software. SVM is a novel type of learning machine which tries to find the largest margin linear classifier. Vapnik =-=[42]-=- shows how training a support vector machine leads to the following quadratic optimization problem: W(a) = - l i=1 a i + 1 2sl i=1sl j=1 y i y j a i a j k(x i , x j ) l i=1 y i a i = 0 a i # 0, i=1 . ... |

4937 |
C4.5: Programs for Machine Learning
- Quinlan
- 1993
(Show Context)
Citation Context ...is possible that other varieties of tree-based classifiers, such as bagged or boosted decision trees, would perform better. However, we do not have an implementation of decision trees other than C4.5 =-=[38]-=-. Section 6.2 contains experimental results on the synthetic datasets described in Section 5.1.3.3. These datasets help us measure classifier score and speed as a function of the dataset's number of r... |

3976 | Computer Architecture – A Quantitative Approach - Hennessy, Patterson - 2003 |

3921 |
Pattern Classification and Scene Analysis
- Duda, Hart
- 1973
(Show Context)
Citation Context ...mes are reported in seconds, and details about timing measurements may be found in Section 5.1.5. Before we describe AUC scores, we must first describe Receiver Operating Characteristic (ROC ) curves =-=[5]-=-. We will use the ROC and AUC description we published in [20]. To construct an ROC curve, the dataset 33 Table 5.2: Synthetic Datasets. Name Attributes Rows Sparsity s Coupling c Num Datasets Num Pos... |

2284 | A tutorial on support vector machines for pattern recognition
- Burges
- 1998
(Show Context)
Citation Context ...y j k(x i , x j ), this can equivalently be written as minimize W(a) = -a T 1+ 1 2 a T Qa such that a T y = 0 a i # 0, i=1 . . . # a i # C1, i=1 . . . # (B.11) Two good tutorials for SVM are [29] and =-=[3]-=-. 130 Keyword Arg Type Arg Vals Default Common capacity float [1e-10, ) 10.0 cost float [1e-10, 1e+10] 1.0 gamma float (0, ) 0.002 kernel int 0, 2 0 svmdir path ./ usedefcapacity none usedefgamma none... |

2047 |
The Elements of Statistical Learning
- Hastie, Tibshirani, et al.
- 2001
(Show Context)
Citation Context ... to overcome singularity when inverting X T X to compute the covariance matrix. In this setting the ridge coefficient l is a perturbation of the diagonal entries of X T X to encourage non-singularity =-=[10]-=-. The covariance matrix in a ridge regression setting is (X T X- lI) -1 . Using this formulation, Ridge regression may be seen as belonging to an older family of techniques called regularization [34].... |

1697 | Text Categorization with Support Vector Machines: Learning with Many Relevant Features
- Joachims
- 1998
(Show Context)
Citation Context ... well as a promising future [40; 49; 48]. Even when classifiers capable of handling high-dimensional inputs are used, many authors apply feature selection to reduce the number of attributes. Joachims =-=[15]-=- has argued that feature selection may not be necessary, and may hurt performance. We believe our implementation of LR is suitable for text classification, and could be competitive with the state-of-t... |

1441 |
Making large-Scale SVM Learning Practical
- Joachims
- 1999
(Show Context)
Citation Context ...caled probability density function (pdf). Note that this pdf is not interpreted probabilistically. [10; 34] We did not implement SVM, and instead chose to use the popular SVM light package, version 5 =-=[18; 16]-=-. This software uses a variety of tricks to find solutions to QP, defined in Equation 6.8, more quickly than traditional quadratic programming software. SVM light can function as a linear or RBF SVM, ... |

1019 | Cadie, Empirical analysis of predictive algorithms for collaborative filtering
- Breese, Heckerman, et al.
- 1998
(Show Context)
Citation Context ...usses the problem of link completion. In this task we are asked to sort a group of objects according their likelihood of having been removed from a link. This is equivalent to collaborative filtering =-=[2]-=-. Link completion is a multiclass problem and is somewhat different than the binary classifications problems we discuss in this thesis. While it is not common for LR to be used in this domain, our imp... |

639 | A re-examination of text categorization methods
- Yang, Liu
- 1999
(Show Context)
Citation Context ... to data mining and high-dimensional classification problems was first discussed in Komarek and Moore [20]. We have made some initial text classification experiments, similar to those of Yang and Liu =-=[47]-=-. These were promising in that our scores were competitive with those of SVM light , which is often considered state-of-the-art for text classification; and further our IRLS implementation ran nearly ... |

555 | Stacked generalization
- Wolpert
(Show Context)
Citation Context ...set. The superlearner is then applied to the testing dataset or to the held-out data points in a k-fold cross-validation experiment. This idea has been thoroughly investigated under the name stacking =-=[45]-=-. Keyword Arg Type Arg Vals Default Common learners string superlearner string lr Common keywords and arguments: . learners string: This keyword allows specification of the sublearners. The string arg... |

503 |
Artificial Intelligence: A Modern Approach
- Russel, Norvig
- 1995
(Show Context)
Citation Context ...iminative classifier, such as LR, which estimates P(Y,x) directly from the data. The result is that generative classifiers are not directly optimizing the quantity of interest [33]. Russel and Norvig =-=[39]-=- contains a nice description of BC. Computing P(Y = 1) and P(sx j |Y ) only needs to be done once, thus BC is always fast. By exploiting 88 0.4 0.5 0.6 0.7 0.8 0.9 1 1000 10000 100000 1e+06 number of ... |

389 |
Learning to Classify Text using Support Vector Machines
- Joachims
- 2002
(Show Context)
Citation Context ...eraged F1 scores were on par with the best we were able to produce using SVM LINEAR and SVM RBF, and were similar to scores reported by the SVM light author on his version of the Reuters-21578 corpus =-=[17]-=-. Our LR implementation computed these results in less than one-third the time of linear SVM with 95 0 100 200 300 400 500 600 700 800 900 0 5000 10000 15000 20000 25000 30000 False positives ROC curv... |

369 | On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes
- Ng, Jordan
- 2001
(Show Context)
Citation Context ...sts with that of a discriminative classifier, such as LR, which estimates P(Y,x) directly from the data. The result is that generative classifiers are not directly optimizing the quantity of interest =-=[33]-=-. Russel and Norvig [39] contains a nice description of BC. Computing P(Y = 1) and P(sx j |Y ) only needs to be done once, thus BC is always fast. By exploiting 88 0.4 0.5 0.6 0.7 0.8 0.9 1 1000 10000... |

324 |
Iterative Methods for Solving Linear Systems
- Greenbaum
- 1997
(Show Context)
Citation Context ...nteresting possibilities for IRLS with CG. It is possible to use CG with a preconditioning matrix, which may contribute to stability and speed if some or all of the covariance matrix can be estimated =-=[41; 31; 7]-=-. We made one experiment with a simple diagonal preconditioner [41]. The result was unsatisfactory but the failure was not investigated. Alternate techniques for IRLS or CG termination could be tried,... |

305 | An Introduction to the Conjugate Gradient Method Without the Agonizing Pain
- SHEWCHUK
- 1994
(Show Context)
Citation Context ...ranteed to converge. However there is no guarantee that Steepest Descent will converge quickly. For example, Figure 2.1 shows Steepest Descent's path when started from a worst-case starting point x 0 =-=[41]-=-. The minimum point in Figure 2.1 is x # = (0,0) T , which by definition satisfies Ax # = b. Let e i = x # -x i represent the error vector at iteration i. We cannot compute e i since we do not know x ... |

174 | Introduction to mathematical statistics - Craig, T - 1978 |

163 | A Comparison of Classifiers and Document Representations for the Routing Problem, SIGIR
- Schutze, Hull, et al.
- 1995
(Show Context)
Citation Context ...and Oles [49]. The authors discuss the reputation of LR as slow and numerically unstable in text categorization, referring to several papers. They suggest that the failures reported in Schutze et al. =-=[40]-=- could be the result of that paper's lack of regularization in general, and [a]nother reason could be their choice of the Newton-Raphson method of numerical optimization, which in our experience could... |

157 |
Fundamentals of Matrix Computations
- Watkins
- 2002
(Show Context)
Citation Context ... of a quadratic form f can be found by setting f # (x) = 0, which is equivalent to solving Ax = b for x. Therefore, one may use Gaussian elimination or compute the inverse or left pseudo-inverse of A =-=[23; 44]-=-. The time complexity of these methods is O # n 3 # , which is infeasible for large values of n. Several possibilities exist for inverting a matrix asymptotically faster than O # n 3 # , with complexi... |

137 |
All of statistics: a concise course in statistical inference
- Wasserman
(Show Context)
Citation Context ... 2 , t 2 ) =-RSS ridge with l = s 2 /t 2 , the Bayesian point estimator for b is the same as the least squares estimator given loss function RSS ridge and an appropriately chosen prior variance t 2 . =-=[10; 43]-=-. 21 Chapter 4 Logistic Regression Linear regression is useful for data with linear relations or applications for which a first-order approximation is adequate. There are many applications for which l... |

130 | The design and analysis of algorithms - Kozen - 1992 |

92 | Introduction to radial basis function networks
- Orr
- 1996
(Show Context)
Citation Context ... [10]. The covariance matrix in a ridge regression setting is (X T X- lI) -1 . Using this formulation, Ridge regression may be seen as belonging to an older family of techniques called regularization =-=[34]-=-. Another interpretation of Ridge regression is available through Bayesian point estimation. In this setting the belief that b should be small is coded into a prior distribution. In particular, suppos... |

81 | Text categorization based on regularized linear classifiers
- Zhang, Oles
- 2001
(Show Context)
Citation Context ...assification results into a deeper analysis. We believe that LR is not widely used for data mining because of an assumption that LR is unsuitably slow for high-dimensional problems. In Zhang and Oles =-=[49]-=-, the authors observe that many information retrieval experiments with LR lacked regularization or used too few attributes in the model. Though they address these deficiencies, they still report that ... |

68 |
Linear algebra and its applications
- Lay
- 1997
(Show Context)
Citation Context ... of a quadratic form f can be found by setting f # (x) = 0, which is equivalent to solving Ax = b for x. Therefore, one may use Gaussian elimination or compute the inverse or left pseudo-inverse of A =-=[23; 44]-=-. The time complexity of these methods is O # n 3 # , which is infeasible for large values of n. Several possibilities exist for inverting a matrix asymptotically faster than O # n 3 # , with complexi... |

50 |
Algorithms for maximum-likelihood logistic regression
- Minka
- 2001
(Show Context)
Citation Context ... numerical methods are typically used to find the MLE b. CG is a popular choice, and by some reports CG provides as good or better results for this task than any other numerical method tested to date =-=[27]-=-. The time complexity of this approach is simply the time complexity of the numerical method used. 4.3 Iteratively Re-weighted Least Squares An alternative to numerically maximizing the LR maximum lik... |

39 | Optimizing classifier performance via an approximation to the Wilcoxon-Mann-Whitney statistic
- Yan, Dodier, et al.
- 2003
(Show Context)
Citation Context ...tion 5.2.1.13. Instead, those experiments suggested that the deviance grew slowly after reaching a minimum. Since optimizing the likelihood does not necessarily correspond to maximizing the AUC score =-=[46]-=-, it is not clear that choosing a slightly non-optimal iterate should have any negative effect at all. We have neglected many interesting possibilities for IRLS with CG. It is possible to use CG with ... |

28 | Efficient exact k-NN and nonparametric classification in high dimensions
- Liu, Moore, et al.
- 2003
(Show Context)
Citation Context ...KNN uses one or more space-partitioning trees to organize the training data according to its geometric structure. In our comparisons, we use the ball tree-based KNS2 algorithm described in Liu et al. =-=[24]-=-. The KNS2 algorithms uses two ball trees to separately store the positive and 87 negative training examples. If the data has a very uneven class distribution, then one of the ball trees will be small... |

27 | Modified logistic regression: An approximation to svm and its applications in large-scale text categorization
- Zhang, Jin, et al.
- 2003
(Show Context)
Citation Context ...lgorithm called B ohning's method, in a short technical report. Minka mentioned the need for regularization, and in two of his three datasets found that CG outperformed other algorithms. Zhang et al. =-=[48]-=- preferred Hestenes-Stiefel direction updates when using CG for nonlinear convex optimization, which is somewhat at odds with the conclusions of this chapter. We do not have an explanation for this co... |

23 | Fast robust logistic regression for large sparse datasets with binary outputs
- Komarek, Moore
- 2003
(Show Context)
Citation Context ... datasets using an exact version of Gaussian elimination. Later, we combined the column elimination with the Cholesky decomposition. This Modified Cholesky technique is described in Komarek and Moore =-=[20]-=-. While the latter approach was faster than the former, it was still very slow. We also encountered stability and overfitting problems. In particular, we encountered the model and weight saturation de... |

16 | A comparison of statistical and machine learning algorithms on the task of link completion
- Goldenberg, Kubica, et al.
- 2003
(Show Context)
Citation Context ...nt people associated with that work. Again we chose the most frequently appearing attribute, Blanc Mel, for the target. We have previously published related link analysis work using LR. Kubica et al. =-=[22]-=- discusses the problem of link completion. In this task we are asked to sort a group of objects according their likelihood of having been removed from a link. This is equivalent to collaborative filte... |

14 | Using Tarjan’s Red Rule for Fast Dependency Tree Construction. NIPS 15
- Pelleg, Moore
- 2002
(Show Context)
Citation Context ...Synthetic Datasets We use four groups of synthetic datasets for characterization of LR and other learning algorithms. The datasets are described further below. To create these datasets, the method of =-=[36]-=- is employed. A random tree is generated with one node per dataset attribute. Each row of the dataset is generated independently of the other rows. Two parameters, the coupling c and the sparsity s, a... |

12 |
Generalized linear models, volume 37
- McCullagh, Nelder
- 1989
(Show Context)
Citation Context ...X, b)) (4.14) Since b i = (X T WX) -1 X T WX b i we may rewrite Equation 4.14 as b i+1 = (X T WX) -1 X T (WX b i + (y -(X, b))) (4.15) = (X T WX) -1 X T Wz (4.16) where z = X b i +W -1 (y -(X, b i )) =-=[10; 30; 25]-=-. The elements of vector z are often called the adjusted dependentscovariates, since we may view Equation 4.16 as the weighted least squares problem from Section 3.1 with dependent variables, or covar... |

11 |
The distribution of the distance in a hypersphere
- Hammersley
- 1950
(Show Context)
Citation Context ...case for our synthetic datasets. A further problem occurs for non-clustered high-dimensional data. In such data all data points are equidistant from one another, invalidating the intuition behind KNN =-=[9]-=-. For our experiments, we tried k = 1, k = 9, and k = 129 due to our experience with KNS2 on real-world datasets. We can expect poor predictions with high variance from values of k which are too small... |

9 | The internet movie database. http://www.imdb.com - Imdb |

5 |
Fitting Linear Models: An Application of Conjugate Gradient Algorithms
- McIntosh
- 1982
(Show Context)
Citation Context ...ln(1 -u i ) where R is the number of rows in the dataset. For logistic regression, IRLS is equivalent to Newton-Raphson [25]. To improve the speed of IRLS, this implementation uses conjugate gradient =-=[31; 7; 26; 27]-=- as an approximate linear solver [20]. This solver is applied to the linear regression # X T WX # b new = X T W z (B.7) where W = diag( i (1- i )) and z = Xb old +W -1 (y-). The current estimate of b ... |

4 |
A Powerpoint tutorial on Support Vector Machines. Available from http://www.cs.cmu.edu/∼awm/tutorials/svm.html
- Moore
- 2001
(Show Context)
Citation Context ... j = y i y j k(x i , x j ), this can equivalently be written as minimize W(a) = -a T 1+ 1 2 a T Qa such that a T y = 0 a i # 0, i=1 . . . # a i # C1, i=1 . . . # (B.11) Two good tutorials for SVM are =-=[29]-=- and [3]. 130 Keyword Arg Type Arg Vals Default Common capacity float [1e-10, ) 10.0 cost float [1e-10, 1e+10] 1.0 gamma float (0, ) 0.002 kernel int 0, 2 0 svmdir path ./ usedefcapacity none usedefga... |

2 |
The Auton Lab. http://www.autonlab.org
- AUTON
- 2004
(Show Context)
Citation Context ...ssion model which allows classification of an experiment x i as positive or negative, that is, belonging to either the positive or negative class. Though LR is applicable to datasets with outcomes in =-=[0, 1]-=-, we will restrict our discussion to the binary case. We can think of an experiment in X as a Bernoulli trial with mean parameter (x i ). Thus y i is a Bernoulli random variable with mean (x i ) and v... |

2 |
Generalised Linear Interactive Modeling package. http://www.nag.co.uk/stats/GDGE soft.asp, http://lib.stat.cmu.edu/glim
- GLIM
- 2004
(Show Context)
Citation Context ...ktop computing power on a par with the large mainframe computers of 15 or 20 years ago. McIntosh goes on to extol the limited memory required by CG compared to QR algorithms. The author modified GLIM =-=[6]-=- statistics software to use CG, and explored its application to several generalized linear models including LR. It appears that McIntosh replaced IRLS entirely, using nonlinear CG to create what we de... |

2 |
Auton Fast Classifiers. http://www.autonlab.org
- Komarek, Liu, et al.
- 2004
(Show Context)
Citation Context ...ese concluding chapters are several appendices. We acknowledge several important people and organizations in Appendix A. Appendix B reproduces the documentation for the Auton Fast Classifier software =-=[19]-=- used in this thesis. Appendix C contains some light-hearted, miscellaneous information which didn't fit elsewhere. Concluding this thesis is the bibliography. 8 Part II Background 9 Chapter 2 Conjuga... |

2 |
A Powerpoint tutorial on Probabilistic Machine Learning. Available from http://www.cs.cmu.edu/∼awm/tutorials/prob.html
- Moore
- 2001
(Show Context)
Citation Context ...one or more learners, for example with the super learner. The syntax used for such keywords is detailed in the keyword's description. B.4.1 bc This is an implementation of a Naive Bayesian Classifier =-=[28; 5]-=- with binary-valued inputs attributes (all input values must be zero or one). The implementation has been optimized for speed and accuracy on very high dimensional sparse data. P(y = ACT|x 1 , x 2 . .... |