## A Scalability Analysis of Classifiers in Text Categorization (2003)

### Cached

### Download Links

- [nyc.lti.cs.cmu.edu]
- [www-2.cs.cmu.edu]
- [www.cs.cmu.edu]
- [nyc.lti.cs.cmu.edu]
- [www.stat.purdue.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | in Proceedings of SIGIR-03, 26th ACM International Conference on Research and Development in Information Retrieval, ACM |

Citations: | 38 - 3 self |

### BibTeX

@INPROCEEDINGS{Yang03ascalability,

author = {Yiming Yang and Jian Zhang and Bryan Kisiel},

title = {A Scalability Analysis of Classifiers in Text Categorization},

booktitle = {in Proceedings of SIGIR-03, 26th ACM International Conference on Research and Development in Information Retrieval, ACM},

year = {2003},

pages = {96--103},

publisher = {Press}

}

### Years of Citing Articles

### OpenURL

### Abstract

Real-world applications of text categorization often require a system to deal with tens of thousands of categories de- ned over a large taxonomy. This paper addresses the problem with respect to a set of popular algorithms in text categorization, including Support Vector Machines, k-nearest neighbor, ridge regression, linear least square t and logistic regression. By providing a formal analysis of the computational complexity of each classi cation method, followed by an investigation on the usage of dierent classi ers in a hierarchical setting of categorization, we show how the scalability of a method depends on the topology of the hierarchy and the category distributions. In addition, we are able to obtain tight bounds for the complexities by using the power law to approximate category distributions over a hierarchy. Experiments with kNN and SVM classi ers on the OHSUMED corpus are reported on, as concrete examples.

### Citations

9002 | The Nature of Statistical Learning Theory
- Vapnik
- 1995
(Show Context)
Citation Context ...ous that NLv = VLf Table 1 summarizes the time/space complexities for the algorithms that we analyze in the following sub-sections. 2.2 SVM SVM is a promising classification method developed by Vapnik=-=[14]-=-. It applies Structural Risk Maximization, which aims to minimize the generalization error instead of the empirical error on training data alone. Multiple variants of SVM have been developed [14, 6]; ... |

1703 | Text categorization with support vector machines: Learning with many relevant features
- Joachims
- 1998
(Show Context)
Citation Context ...as evaluated using the full domain of MeSH categories in OHSUMED, the k-nearest neighbor approach reported in[19, 17]. Evaluations of other methods used much smaller subsets (23, 28 or 49 categories) =-=[9, 6]-=- until the Text Retrieval Conference (TREC-9) in 2000, wherein a subset of 4904 categories from OHSUMED were selected for the filtering track[13]. However, only three systems were able to submit compl... |

1253 | On power-law relationships of the internet topology
- Faloutsos, Faloutsos, et al.
- 1999
(Show Context)
Citation Context ...been observed in multiple domains, including an early discovery in cognitive science regarding human learning rate through repetitive tasks[11], and the recent observations about the Internet topology=-=[4]-=-. The application of Zipf’s law to word distribution over documents is another well-known example, which has been commonly observed in Information Retrieval. The power law has an exponential form: y =... |

958 | A comparative study on feature selection in text categorization
- Yang, Pedersen
- 1997
(Show Context)
Citation Context ...5 article abstracts in medical journals with 14,321 unique category labels (a large subset of the Medical Subject Headings (MeSH)), has become an evaluation benchmark in text categorization since 1994=-=[19, 17]-=-. To our knowledge, only one TC method was evaluated using the full domain of MeSH categories in OHSUMED, the k-nearest neighbor approach reported in[19, 17]. Evaluations of other methods used much sm... |

640 | A re-examination of text categorization methods
- YIMING, XIN
- 1999
(Show Context)
Citation Context ...pirical error on training data alone. Multiple variants of SVM have been developed [14, 6]; here we limit our discussion to linear SVM due to its popularity and high performance in text categorization=-=[6, 18, 8]-=-. The optimization of SVM (dual form) is to minimize: α ∗ =argmin α {− subject to: n� i=1 n� i=1 αi + 1 2 αiyi =0; The prediction is given by: f(x) = N N� i=1 n� n� i=1 j=1 0≤ αi ≤ C yiyjαiαj〈xi,x... |

492 | An evaluation of statistical approaches to text categorization
- Yang
- 1999
(Show Context)
Citation Context ...5 article abstracts in medical journals with 14,321 unique category labels (a large subset of the Medical Subject Headings (MeSH)), has become an evaluation benchmark in text categorization since 1994=-=[19, 17]-=-. To our knowledge, only one TC method was evaluated using the full domain of MeSH categories in OHSUMED, the k-nearest neighbor approach reported in[19, 17]. Evaluations of other methods used much sm... |

293 | Sequential minimal optimization: A fast algorithm for training support vector machines
- Platt
- 1998
(Show Context)
Citation Context ... O(MINLv) O(MLv) O(NLv) binary of empirical observations about the super-linear time complexity of SVM in training with respect to N, thenumberof training documents: O(N 1.2 ) on a web page collection=-=[12]-=-, and O(N 1.5 ) on the OHSUMED corpus. The complexity analysis above is for the training time on one category. Since the algorithm in SVM-Light (and other SVM algorithms applied to text categorization... |

254 | An improved training algorithm for support vector machines
- Osuna, Freund, et al.
- 1997
(Show Context)
Citation Context ...ore efficiently by utilizing the sparseness of the text data. Here we discuss the algorithm used in SVM-Light [7], which is one of the most popular SVM packages in text categorization. The basic idea =-=[3]-=- is to iteratively decompose the big QP problem into smaller ones (called “working set”) and solve them sequentially until convergence is obtained. The training-time complexity for each iteration is: ... |

244 | Training algorithms for linear text classifiers
- Lewis, Schapire, et al.
- 1996
(Show Context)
Citation Context ...as evaluated using the full domain of MeSH categories in OHSUMED, the k-nearest neighbor approach reported in[19, 17]. Evaluations of other methods used much smaller subsets (23, 28 or 49 categories) =-=[9, 6]-=- until the Text Retrieval Conference (TREC-9) in 2000, wherein a subset of 4904 categories from OHSUMED were selected for the filtering track[13]. However, only three systems were able to submit compl... |

194 |
Mechanisms of skill acquisition and the law of practice
- Newell, Rosenbloom
- 1981
(Show Context)
Citation Context ...al phenomena As an interesting phenomenon, the power law has been observed in multiple domains, including an early discovery in cognitive science regarding human learning rate through repetitive tasks=-=[11]-=-, and the recent observations about the Internet topology[4]. The application of Zipf’s law to word distribution over documents is another well-known example, which has been commonly observed in Infor... |

191 |
Lanczos algorithms for large symmetric eigenvalue computations
- Cullum, Willoughby
- 1985
(Show Context)
Citation Context ...to categories. In this sense, this method is quite efficient when M is very large, as compared to training binary classifiers M times repeatedly and independently. The Lanczos algorithm introduced in =-=[2]-=- and thoroughly analyzed by [1] is particularly efficient for solving LLSF on very large and sparse matrices, and has a good convergence property. The step-wise complexities are given below: Step 1. C... |

153 |
Expert network: effective and efficient learning from human decisions in text categorisation and retrieval
- Yang
- 1994
(Show Context)
Citation Context ... that it does not have an off-line learning phase. The so-called “training” phase in kNN is simply to index the training data for later use. Building the inverted index of documents for classification=-=[15]-=- is a mature technique with a complexity of O(NLd). If we consider Ld, the average length of documents, as an application-specific constant, then the training complexity of kNN is O(N), i.e., linear i... |

89 | A Study of Approaches to Hypertext Categorization
- Yang, Slattery, et al.
- 2002
(Show Context)
Citation Context ...nterpretation could be made. Regardless, we have found that category distributions in data from real-world applications are often highly skewed, including news stories, journal articles and web pages =-=[18, 20]-=-. The power law is therefore a natural candidate for the characterization of those skewed distributions - more appropriate than using Zipf’s law (assuming a fixed slope of -1) or a uniform distributio... |

81 | Text Categorization Based on Regularized Linear Classification Methods
- Zhang, Oles
(Show Context)
Citation Context ...ession. We limit our discussion to a binary version of the LR algorithm, which has been reported in the literature with a performance competitive with SVM and linear regression on benchmarks data sets=-=[21, 10]-=-. The classification problem is defined as minimizing the the following objective function: w ∗ = argmin{ w 1 n N� i=1 It has a close-form solution as: w ∗ = ( N� i=1 xix T i (yi −〈w, xi〉) 2 + ... |

58 | Noise reduction in a statistical approach to text categorization
- Yang
- 1995
(Show Context)
Citation Context ...ough validation. In our experiments with LLSF on benchmark collections (Reuters news stories, MEDLINE documents, etc), we observed the optimal ranges of ks to be between a few hundred and one thousand=-=[16]-=-. Step 2. Compute the pseudo-inverse X + = VS −1 U T = X T US −2 U T The time complexity here is O(ksN 2 ), dominated by the computation of US −2 U T . The space complexity is O(NV ), for storing matr... |

26 |
The Maximum Margin Approach to Learning Text Classifiers: Methods, Theory, and Algorithms
- Joachims
- 2000
(Show Context)
Citation Context ...n general quadratic programming. Furthermore, there exist algorithms that can solve this more efficiently by utilizing the sparseness of the text data. Here we discuss the algorithm used in SVM-Light =-=[7]-=-, which is one of the most popular SVM packages in text categorization. The basic idea [3] is to iteratively decompose the big QP problem into smaller ones (called “working set”) and solve them sequen... |

23 | Y.: A Loss Function Analysis of Classification Methods in Text Categorization
- Li, Yang
- 2003
(Show Context)
Citation Context ...ession. We limit our discussion to a binary version of the LR algorithm, which has been reported in the literature with a performance competitive with SVM and linear regression on benchmarks data sets=-=[21, 10]-=-. The classification problem is defined as minimizing the the following objective function: w ∗ = argmin{ w 1 n N� i=1 It has a close-form solution as: w ∗ = ( N� i=1 xix T i (yi −〈w, xi〉) 2 + ... |

8 |
The reuters corpus volume I as a text categorization test collection
- Lewis, Li, et al.
- 2003
(Show Context)
Citation Context ...pirical error on training data alone. Multiple variants of SVM have been developed [14, 6]; here we limit our discussion to linear SVM due to its popularity and high performance in text categorization=-=[6, 18, 8]-=-. The optimization of SVM (dual form) is to minimize: α ∗ =argmin α {− subject to: n� i=1 n� i=1 αi + 1 2 αiyi =0; The prediction is given by: f(x) = N N� i=1 n� n� i=1 j=1 0≤ αi ≤ C yiyjαiαj〈xi,x... |

8 | Expert network: E#ective and e#cient learning from human decisions in text categorization and retrieval - Yang - 1994 |

6 | Microsoft cambridge at trec-9 - Robertson, Walker - 2001 |

2 |
Large-scale singular value computations
- Berry
(Show Context)
Citation Context ...his method is quite efficient when M is very large, as compared to training binary classifiers M times repeatedly and independently. The Lanczos algorithm introduced in [2] and thoroughly analyzed by =-=[1]-=- is particularly efficient for solving LLSF on very large and sparse matrices, and has a good convergence property. The step-wise complexities are given below: Step 1. Compute the truncated SVD X = US... |

2 |
Microsoft cambridge at
- Robertson, Walker
- 2001
(Show Context)
Citation Context ... used much smaller subsets (23, 28 or 49 categories) [9, 6] until the Text Retrieval Conference (TREC-9) in 2000, wherein a subset of 4904 categories from OHSUMED were selected for the filtering track=-=[13]-=-. However, only three systems were able to submit complete results on that subset of categories; the remaining systems used a subset of that subset, consisting of 500 categories. Since 2001, the Opera... |