Results 1 -
5 of
5
A Re-Examination of Text Categorization Methods
, 1999
"... This paper reports a controlled study with statistical significance tests on five text categorization methods: the Support Vector Machines (SVM), a k-Nearest Neighbor (kNN) classifier, a neural network (NNet) approach, the Linear Leastsquares Fit (LLSF) mapping and a NaiveBayes (NB) classifier. We f ..."
Abstract
-
Cited by 533 (15 self)
- Add to MetaCart
This paper reports a controlled study with statistical significance tests on five text categorization methods: the Support Vector Machines (SVM), a k-Nearest Neighbor (kNN) classifier, a neural network (NNet) approach, the Linear Leastsquares Fit (LLSF) mapping and a NaiveBayes (NB) classifier. We focus on the robustness of these methods in dealing with a skewed category distribution, and their performance as function of the training-set category frequency. Our results show that SVM, kNN and LLSF significantly outperform NNet and NB when the number of positive training instances per category are small (less than ten), and that all the methods perform comparably when the categories are sufficiently common (over 300 instances).
An evaluation of statistical approaches to text categorization
- Journal of Information Retrieval
, 1999
"... Abstract. This paper focuses on a comparative evaluation of a wide-range of text categorization methods, including previously published results on the Reuters corpus and new results of additional experiments. A controlled study using three classifiers, kNN, LLSF and WORD, was conducted to examine th ..."
Abstract
-
Cited by 413 (16 self)
- Add to MetaCart
Abstract. This paper focuses on a comparative evaluation of a wide-range of text categorization methods, including previously published results on the Reuters corpus and new results of additional experiments. A controlled study using three classifiers, kNN, LLSF and WORD, was conducted to examine the impact of configuration variations in five versions of Reuters on the observed performance of classifiers. Analysis and empirical evidence suggest that the evaluation results on some versions of Reuters were significantly affected by the inclusion of a large portion of unlabelled documents, mading those results difficult to interpret and leading to considerable confusions in the literature. Using the results evaluated on the other versions of Reuters which exclude the unlabelled documents, the performance of twelve methods are compared directly or indirectly. For indirect compararions, kNN, LLSF and WORD were used as baselines, since they were evaluated on all versions of Reuters that exclude the unlabelled documents. As a global observation, kNN, LLSF and a neural network method had the best performance; except for a Naive Bayes approach, the other learning algorithms also performed relatively well.
A Study of Approaches to Hypertext Categorization
- Journal of Intelligent Information Systems
, 2002
"... . Hypertext poses new research challenges for text classification. Hyperlinks, HTML tags, category labels distributed over linked documents, and meta data extracted from related web sites all provide rich information for classifying hypertext documents. How to appropriately represent that informatio ..."
Abstract
-
Cited by 78 (3 self)
- Add to MetaCart
. Hypertext poses new research challenges for text classification. Hyperlinks, HTML tags, category labels distributed over linked documents, and meta data extracted from related web sites all provide rich information for classifying hypertext documents. How to appropriately represent that information and automatically learn statistical patterns for solving hypertext classification problems is an open question. This paper seeks a principled approach to providing the answers. Specifically, we define five hypertext regularities which may (or may not) hold in a particular application domain, and whose presence (or absence) may significantly influence the optimal design of a classifier. Using three hypertext datasets and three well-known learning algorithms (Naive Bayes, Nearest Neighbor, and First Order Inductive Learner), we examine these regularities in different domains, and compare alternative ways to exploit them. Our experimental results suggest that a naive use of linked pages, such as treating the words in the linked neighborhood of a page as local to that page, can be more harmful than helpful when the linked neighborhood is highly "noisy". This is especially true if the classifier is not sufficiently robust in discriminating informative words from noisy ones. It is also evident in our results that extracting meta data (when available) from related web sites can be extremely useful for improving classification accuracy. Finally, the relative performance of the classifiers being tested provides insights into their strengths and limitations for solving classification problems involving diverse and often noisy web pages. Keywords: hypertext classification, machine learning, web mining 1.
Text categorization of low quality images
- In Proceedings of SDAIR-95, 4th Annual Symposium on Document Analysis and Information Retrieval
, 1995
"... Categorization of text images into content-oriented classes would be a useful capability in a variety of document handling systems. Many methods can be usedtocategorize texts once their words are known, but OCR can garble a large proportion of words, particularly when low quality images are used. De ..."
Abstract
-
Cited by 52 (2 self)
- Add to MetaCart
Categorization of text images into content-oriented classes would be a useful capability in a variety of document handling systems. Many methods can be usedtocategorize texts once their words are known, but OCR can garble a large proportion of words, particularly when low quality images are used. Despite this, we show for one data set that fax quality images can be categorized with nearly the same accuracy as the original text. Further, the categorization system can be trained on noisy OCR output, without need for the true text of any image, or for editing of OCR output. The useofavector space classi er and training method robust to large feature sets, combined with discarding of low frequency OCR output strings are the key to our approach. 1
LEARNING TO SURF: MULTIAGENT SYSTEMS FOR ADAPTIVE WEB PAGE RECOMMENDATION
"... that I have read this dissertation and that in my ..."

