## Text Categorization Based on Regularized Linear Classification Methods (2000)

Venue: | Information Retrieval |

Citations: | 81 - 2 self |

### BibTeX

@ARTICLE{Zhang00textcategorization,

author = {Tong Zhang and Frank J. Oles},

title = {Text Categorization Based on Regularized Linear Classification Methods},

journal = {Information Retrieval},

year = {2000},

volume = {4},

pages = {5--31}

}

### Years of Citing Articles

### OpenURL

### Abstract

A number of linear classification methods such as the linear least squares fit (LLSF), logistic regression, and support vector machines (SVM's) have been applied to text categorization problems. These methods share the similarity by finding hyperplanes that approximately separate a class of document vectors from its complement. However, support vector machines are so far considered special in that they have been demonstrated to achieve the state of the art performance. It is therefore worthwhile to understand whether such good performance is unique to the SVM design, or if it can also be achieved by other linear classification methods. In this paper, we compare a number of known linear classification methods as well as some variants in the framework of regularized linear systems. We will discuss the statistical and numerical properties of these algorithms, with a focus on text categorization. We will also provide some numerical experiments to illustrate these algorithms on a number of datasets.

### Citations

8980 | Statistical Learning Theory
- Vapnik
- 1998
(Show Context)
Citation Context ...ead of the log-likelihood function ln(1 + exp( z)) in (6), where both a and c can be absorbed into the regularization parameter . The support vector machine is a method originally proposed by Vapnik [=-=3, 19, -=-21] that has nice properties from the sample complexity theory. It is designed as a modication of the Perceptron algorithm. Slightly dierent from our approach of forcing threshold = 0, and then compe... |

3265 | Convex Analysis
- ROCKAFELLAR
- 1996
(Show Context)
Citation Context ...[ k( i ) + i (w T x i y i )] + w 2 : Then we need to show that there exists ( ^ w; ^ ) such that L( ^ w; ^ ) = inf w sup L(w; ) = sup inf w L(w; ): (19) It is well-known (for example, see [18]) that the duality gap G is non-negative, where G = inf w sup L(w; ) sup inf w L(w; ): Furthermore, equation (19) has a solution if G = 0. We need tosnd ( ^ w; ^ ) such that L( ^ w; ^ ) = inf ... |

2171 | Support-vector networks
- Cortes, Vapnik
- 1995
(Show Context)
Citation Context ...ead of the log-likelihood function ln(1 + exp( z)) in (6), where both a and c can be absorbed into the regularization parameter . The support vector machine is a method originally proposed by Vapnik [=-=3, 19, -=-21] that has nice properties from the sample complexity theory. It is designed as a modication of the Perceptron algorithm. Slightly dierent from our approach of forcing threshold = 0, and then compe... |

1958 |
Matrix computations
- Golub, Loan
- 1996
(Show Context)
Citation Context ... in (12) is convex, thus it has a unique local minimum which is also the global minimum. Methods investigated in this section are based on the following generic relaxation algorithm (for example, see [7]): 8 Algorithm 1 (Gauss-Seidel) let w = 0 and r i = w T x i = 0 for k = 1; 2; : : : for j = 1; : : : ; dsnd w j by approximately minimizing 1 n P i f(r i +w j x ij y i ) + (w j +w j ) 2 () update... |

1697 | Text Categorization with Support Vector Machines: Learning with Many Relevant Features
- Joachims
- 1998
(Show Context)
Citation Context ...istical methods for text categorization in [24]. The best performances previously reported in the literature are from weighted resampled decision trees in [23] and (linear) support vector machines in =-=[12, 4]-=-. Integral parts of all these approaches are tokenization, feature selection, and creating numeric vector representations of documents. Thesrst step, tokenization, is laid out in detail in Figure 1. T... |

1320 | Generalized Additive Models
- Hastie, Tibshirani
- 1990
(Show Context)
Citation Context ...nfortunately, the Gaussian noise assumption, which is continuous, can only be satised approximately for classication problem, since y i 2 f1; 1g is discrete. In statistics, logistic regression (cf. [8=-=]-=- Section 4.5) has often been employed to remedy this problem. As we have pointed out before, even though there have been considerable interest in applying logistic regression for text categorization, ... |

954 | A Comparative Study on Feature Selection in Text Categorization
- Yang, Pedersen
- 1997
(Show Context)
Citation Context ...o be taken by themselves to be the sole features of interest.) We will not take up the specics of feature selection, but a number of methods of varying degrees of sophistication have been studied in [=-=27]-=-. Feature selection might be done only once for the entire data set, but experience indicates that better results will be obtained by doing feature selection separately for each category, re ecting th... |

758 | A comparison of event models for naive Bayes text classification
- McCallum, Nigam
(Show Context)
Citation Context ... as Logistic Reg, the (linear) support vector machine (8) denoted as SVM, and the modied SVM corresponding to (18) denoted as Mod SVM. For comparison purposes, we also include results of Naive Bayes [=-=14]-=- as a baseline method. 4.1 Some Implementation issues A number of feature selection methods for text categorization were compared in [27]. In our experiments, we employ a method similar to the informa... |

701 | Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods
- Platt
- 1999
(Show Context)
Citation Context ...nts, we adjust the threshold after the hyperplane is computed to compensate for this problem (we do this bysnding a value that minimizes the training error). This phenomenon has also been observed in =-=[1-=-6], where Platt found thatstting a sigmoid to the output of SVM can enhance the performance on Reuters | the eect of such a sigmoidstting is equivalent to using a dierent threshold. It is unclear what... |

639 | A re-examination of text categorization methods
- Yang, Liu
- 1999
(Show Context)
Citation Context ...te a tradeo between precision and recall, then one can compute the break-even point (BEP), where precision equals recall, as an evaluation criterion for the performance of a classication algorithm [2=-=6-=-]. Another widely used single number metric is the F 1 metric dened as the harmonic mean of the precision and the recall [24]. The standard dataset for comparing text categorization algorithms is the ... |

504 | Inductive learning algorithms and representations for text categorization
- Dumais, Platt, et al.
- 1998
(Show Context)
Citation Context ...istical methods for text categorization in [24]. The best performances previously reported in the literature are from weighted resampled decision trees in [23] and (linear) support vector machines in =-=[12, 4]-=-. Integral parts of all these approaches are tokenization, feature selection, and creating numeric vector representations of documents. Thesrst step, tokenization, is laid out in detail in Figure 1. T... |

495 |
Ridge regression: Biased estimation for nonorthogonal problems
- Hoerl, Kennard
- 1970
(Show Context)
Citation Context ... order to handle large sparse 4 systems, we need to use iterative algorithms which do not rely on matrix factorization techniques. Therefore in this paper, we use the standard ridge regression method =-=[-=-9] that adds a regularization term to (3): ^ w = arg inf w 1 n n X i=1 (w T x i y i 1) 2 + w 2 ; (4) where is an appropriately chosen regularization parameter. The solution is given by ^ w = ( n X i=... |

491 | An evaluation of statistical approaches to text categorization
- Yang
(Show Context)
Citation Context ...t categorization problem. In [25], Yang and Chute proposed a linear least squaresst algorithm to train linear classiers. Yang also compared a number of statistical methods for text categorization in [=-=24]-=-. The best performances previously reported in the literature are from weighted resampled decision trees in [23] and (linear) support vector machines in [12, 4]. Integral parts of all these approaches... |

474 | A sequential algorithm for training text classifiers - Lewis, Gale - 1994 |

255 | Automated learning of decision rules for text categorization
- Apté, Damerau, et al.
- 1994
(Show Context)
Citation Context ...sed learning problems have been widely studied in the past. Recently, many methods developed for classication problems have been applied to text categorization. For example, Apte, Damerau, and Weiss [=-=1-=-] applied an inductive rule learning algorithm, SWAP1, to the text categorization problem. In [25], Yang and Chute proposed a linear least squaresst algorithm to train linear classiers. Yang also comp... |

193 | A discriminative framework for detecting remote protein homologies
- Jaakkola, Diekhans
- 2000
(Show Context)
Citation Context ...2 of [19], although he dismissed it as inferior to a standard SVM. We shall later present experimental evidence contrary to Platt's opinion. In fact, the method has also been successfully employed in =-=[-=-11]. To distinguish this procedure from a standard SVM, we shall call it modied SVM in this paper. The update (18) with = 1 corresponds to the exact minimization of the dual objective functional in (... |

150 | Support vector machines, reproducing kernel hilbert spaces and randomized gacv
- Wahba
- 1998
(Show Context)
Citation Context ...2), and suggest a class of numerical algorithms to solve the dual formulation. By using the dual formulation, we can obtain a representation of w in a socalled Reproducing Kernel Hilbert Space (RKHS) =-=[22-=-]. Such a representation allows a linear classier to learn non-linear functions in the original input space, and thus is considered a major advantage of kernel methods including recently popularized s... |

115 |
An example-based mapping method for text categorization and retrieval
- Yang, Chute
- 1994
(Show Context)
Citation Context ... classication problems have been applied to text categorization. For example, Apte, Damerau, and Weiss [1] applied an inductive rule learning algorithm, SWAP1, to the text categorization problem. In [=-=25-=-], Yang and Chute proposed a linear least squaresst algorithm to train linear classiers. Yang also compared a number of statistical methods for text categorization in [24]. The best performances previ... |

72 | Maximizing Text-Mining Performance
- Weiss, Apte, et al.
- 1999
(Show Context)
Citation Context ...assiers. Yang also compared a number of statistical methods for text categorization in [24]. The best performances previously reported in the literature are from weighted resampled decision trees in [=-=23]-=- and (linear) support vector machines in [12, 4]. Integral parts of all these approaches are tokenization, feature selection, and creating numeric vector representations of documents. Thesrst step, to... |

68 |
Pattern Recognition and Neural Networks (Cambridge
- Ripley
- 1996
(Show Context)
Citation Context ...dvanced over the years. In the early statistical literature, the weight was obtained by using linear discriminant analysis, which makes the assumption that each class has a Gaussian distribution (cf. =-=[17-=-], chapter 3). Similar to linear discriminant analysis, an approach widely used in statistics (usually for regression rather than classication) is the least squaresst algorithm. Least squaresst has be... |

61 | D.: “Text categorization of low quality images
- Ittner, Lewis, et al.
- 1995
(Show Context)
Citation Context ...pport vector machines, which have recently gained much popularity. There have been a long history of using logistic regression in information retrieval, as can be seen from the following partial list =-=[2, 5, 6, 10, 13, 20-=-]. However, for a number of 3 reasons, the method was not used in an eective way for text categorization. As a result, the comparison in [20] suggested negative opinions on the performance of logistic... |

41 | Inferring probability of relevance using the method of logistic regression
- Gey
- 1994
(Show Context)
Citation Context ...pport vector machines, which have recently gained much popularity. There have been a long history of using logistic regression in information retrieval, as can be seen from the following partial list =-=[2, 5, 6, 10, 13, 20-=-]. However, for a number of 3 reasons, the method was not used in an eective way for text categorization. As a result, the comparison in [20] suggested negative opinions on the performance of logistic... |

39 |
Probabilistic retrieval based on staged logistic regression
- Cooper, Dabney
(Show Context)
Citation Context ...pport vector machines, which have recently gained much popularity. There have been a long history of using logistic regression in information retrieval, as can be seen from the following partial list =-=[2, 5, 6, 10, 13, 20-=-]. However, for a number of 3 reasons, the method was not used in an eective way for text categorization. As a result, the comparison in [20] suggested negative opinions on the performance of logistic... |

12 | Combining model-oriented and description-oriented approaches for probabilistic indexing
- Fuhr, Pfeifer
- 1991
(Show Context)
Citation Context |

6 | A comparison of classi and document representations for the routing problem - Schutze, Hull, et al. - 1995 |