• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

A Re-Examination of Text Categorization Methods (1999)

by Yiming Yang, Xin Liu
Add To MetaCart

Tools

Sorted by:
Results 11 - 20 of 355
Next 10 →

A Study on Thresholding Strategies for Text Categorization

by Yiming Yang - Proceedings of SIGIR-01, 24th ACM International Conference on Research and Development in Information Retrieval , 2001
"... Thresholding strategies in automated text categorization are an underexplored area of research. This paper presents an examination of the effect of thresholding strategies on the performance of a classifier under various conditions. Using k-Nearest Neighbor (kNN) as the classifier and five evaluatio ..."
Abstract - Cited by 69 (8 self) - Add to MetaCart
Thresholding strategies in automated text categorization are an underexplored area of research. This paper presents an examination of the effect of thresholding strategies on the performance of a classifier under various conditions. Using k-Nearest Neighbor (kNN) as the classifier and five evaluation benchmark collections as the testbets, three common thresholding methods were investigated, including rank-based thresholding (RCut), proportion-based assignments (PCut) and score-based local optimization (SCut); in addition, new variants of these methods are proposed to overcome significant problems in the existing approaches. Experimental results show that the choice of thresholding strategy can significantly influence the performance of kNN, and that the "optimal" strategy may vary by application. SCut is potentially better for fine-tuning but risks overfitting. PCut copes better with rare categories and exhibits a smoother trade-off in recall versus precision, but is not suitable for online decision making. RCut is most natural for online response but is too coarse-grained for global or local optimization. RTCut, a new method combining the strength of category ranking and scoring, outperforms both PCut and RCut significantly. 1.

Text Categorization Based on Regularized Linear Classification Methods

by Tong Zhang, Frank J. Oles - Information Retrieval , 2000
"... A number of linear classification methods such as the linear least squares fit (LLSF), logistic regression, and support vector machines (SVM's) have been applied to text categorization problems. These methods share the similarity by finding hyperplanes that approximately separate a class of document ..."
Abstract - Cited by 67 (2 self) - Add to MetaCart
A number of linear classification methods such as the linear least squares fit (LLSF), logistic regression, and support vector machines (SVM's) have been applied to text categorization problems. These methods share the similarity by finding hyperplanes that approximately separate a class of document vectors from its complement. However, support vector machines are so far considered special in that they have been demonstrated to achieve the state of the art performance. It is therefore worthwhile to understand whether such good performance is unique to the SVM design, or if it can also be achieved by other linear classification methods. In this paper, we compare a number of known linear classification methods as well as some variants in the framework of regularized linear systems. We will discuss the statistical and numerical properties of these algorithms, with a focus on text categorization. We will also provide some numerical experiments to illustrate these algorithms on a number of datasets.

Novelty Detection: A Review - Part 1: Statistical Approaches

by Markos Markou, Sameer Singh - Signal Processing , 2003
"... Novelty detection is the identification of new or unknown data or signal that a machine learning system is not aware of during training. Novelty detection is one of the fundamental requirements of a good classification or identification system since sometimes the test data contains information abou ..."
Abstract - Cited by 67 (0 self) - Add to MetaCart
Novelty detection is the identification of new or unknown data or signal that a machine learning system is not aware of during training. Novelty detection is one of the fundamental requirements of a good classification or identification system since sometimes the test data contains information about objects that were not known at the time of training the model. In this paper we provide stateof -the-art review in the area of novelty detection based on statistical approaches. The second part paper details novelty detection using neural networks. As discussed, there are a multitude of applications where novelty detection is extremely important including signal processing, computer vision, pattern recognition, data mining, and robotics.

Feature generation for text categorization using world knowledge

by Evgeniy Gabrilovich, Shaul Markovitch - In IJCAI’05 , 2005
"... We enhance machine learning algorithms for text categorization with generated features based on domain-specific and common-sense knowledge. This knowledge is represented using publicly available ontologies that contain hundreds of thousands of concepts, such as the Open Directory; these ontologies a ..."
Abstract - Cited by 62 (13 self) - Add to MetaCart
We enhance machine learning algorithms for text categorization with generated features based on domain-specific and common-sense knowledge. This knowledge is represented using publicly available ontologies that contain hundreds of thousands of concepts, such as the Open Directory; these ontologies are further enriched by several orders of magnitude through controlled Web crawling. Prior to text categorization, a feature generator analyzes the documents and maps them onto appropriate ontology concepts, which in turn induce a set of generated features that augment the standard bag of words. Feature generation is accomplished through contextual analysis of document text, implicitly performing word sense disambiguation. Coupled with the ability to generalize concepts using the ontology, this approach addresses the two main problems of natural language processing—synonymy and polysemy. Categorizing documents with the aid of knowledge-based features leverages information that cannot be deduced from the documents alone. Experimental results confirm improved performance, breaking through the plateau previously reached in the field. 1

Learning approaches for Detecting and Tracking News Events

by Yiming Yang, Jaime Carbonell, Ralf Brown, Tom Pierce, Brian T. Archibald, Xin Liu - IEEE Intelligent Systems , 1999
"... This paper studies the effective use of information retrieval and machine learning techniques in a new task, event detection and tracking. The objective is to automatically detect novel events from chronologically-ordered streams of news stories, and track events of interest over time. We extended e ..."
Abstract - Cited by 58 (6 self) - Add to MetaCart
This paper studies the effective use of information retrieval and machine learning techniques in a new task, event detection and tracking. The objective is to automatically detect novel events from chronologically-ordered streams of news stories, and track events of interest over time. We extended existing supervised learning and unsupervised clustering algorithms to allow document classification based on both information content and temporal aspects of events. A task-oriented evaluation was conducted using Reuters and CNN news stories. We found agglomerative document clustering highly effective (82% in the F 1 measure) for retrospective event detection, and single-pass clustering with time windowing a better choice for on-line alerting of novel events. We also observed robust learning behavior for k-nearest neighbor (kNN) classification and a decision-tree approach in event tracking, under the difficult condition when the number of positive training examples is extremely small. 1 Intr...

Improving Multi-class Text Classification with Naive Bayes

by Tommi Jaakkola, Jason D. M. Rennie, Jason D. M. Rennie , 2001
"... There are numerous text documents available in electronic form. More and more are becoming available every day. Such documents represent a massive amount of information that is easily accessible. Seeking value in this huge collection requires organization; much of the work of organizing documents ca ..."
Abstract - Cited by 58 (1 self) - Add to MetaCart
There are numerous text documents available in electronic form. More and more are becoming available every day. Such documents represent a massive amount of information that is easily accessible. Seeking value in this huge collection requires organization; much of the work of organizing documents can be automated through text classification. The accuracy and our understanding of such systems greatly influences their usefulness. In this paper, we seek 1) to advance the understanding of commonly used text classification techniques, and 2) through that understanding, improve the tools that are available for text classification. We begin by clarifying the assumptions made in the derivation of Naive Bayes, noting basic properties and proposing ways for its extension and improvement. Next, we investigate the quality of Naive Bayes parameter estimates and their impact on classification. Our analysis leads to a theorem which gives an explanation for the improvements that can be found in multiclass classification with Naive Bayes using Error-Correcting Output Codes. We use experimental evidence on two commonly-used data sets to exhibit an application of the theorem. Finally, we show fundamental flaws in a commonly-used feature selection algorithm and develop a statistics-based framework for text feature selection. Greater understanding of Naive Bayes and the properties of text allows us to make better use of it in text classification.

Concept indexing: A fast dimensionality reduction algorithm with applications to document retrieval and categorization

by George Karypis, Eui-Hong (Sam) Han - IN CIKM’00 , 2000
"... In recent years, we have seen a tremendous growth in the volume of text documents available on the Internet, digital libraries, news sources, and company-wide intranets. This has led to an increased interest in developing meth-ods that can efficiently categorize and retrieve relevant information. Re ..."
Abstract - Cited by 58 (2 self) - Add to MetaCart
In recent years, we have seen a tremendous growth in the volume of text documents available on the Internet, digital libraries, news sources, and company-wide intranets. This has led to an increased interest in developing meth-ods that can efficiently categorize and retrieve relevant information. Retrieval techniques based on dimensionality reduction, such as Latent Semantic Indexing (LSI), have been shown to improve the quality of the information being retrieved by capturing the latent meaning of the words present in the documents. Unfortunately, the high computa-tional requirements of LSI and its inability to compute an effective dimensionality reduction in a supervised setting limits its applicability. In this paper we present a fast dimensionality reduction algorithm, called concept indexing (CI) that is equally effective for unsupervised and supervised dimensionality reduction. CI computes a k-dimensional representation of a collection of documents by first clustering the documents into k groups, and then using the centroid vectors of the clusters to derive the axes of the reduced k-dimensional space. Experimental results show that the dimensionality reduction computed by CI achieves comparable retrieval performance to that obtained using LSI, while requiring an order of magnitude less time. Moreover, when CI is used to compute the dimensionality reduction in a supervised setting, it greatly improves the performance of traditional classification algorithms such as C4.5 and kNN.

QProber: A system for automatic classification of hidden-web databases

by Panagiotis G. Ipeirotis, Luis Gravano, Mehran Sahami - ACM TOIS , 2003
"... The contents of many valuable web-accessible databases are only available through search interfaces and are hence invisible to traditional web “crawlers. ” Recently, commercial web sites have started to manually organize web-accessible databases into Yahoo!-like hierarchical classification schemes. ..."
Abstract - Cited by 53 (11 self) - Add to MetaCart
The contents of many valuable web-accessible databases are only available through search interfaces and are hence invisible to traditional web “crawlers. ” Recently, commercial web sites have started to manually organize web-accessible databases into Yahoo!-like hierarchical classification schemes. Here, we introduce QProber, a modular system that automates this classification process by using a small number of query probes, generated by document classifiers. QProber can use a variety of types of classifiers to generate the probes. To classify a database, QProber does not retrieve or inspect any documents or pages from the database, but rather just exploits the number of matches that each query probe generates at the database in question. We have conducted an extensive experimental evaluation of QProber over collections of real documents, experimenting with different types of document classifiers and retrieval models. We have also tested our system with over one hundred web-accessible databases. Our experiments show that our system has low overhead and achieves high classification accuracy across a variety of databases.

Automatic text categorization in terms of genre and author

by Efstathios Stamatatos, Nikos Fakotakis, George Kokkinakis - COMPUTATIONAL LINGUISTICS , 2001
"... The two main factors that characterize a text are its content and its style, and both can be used as a means of categorization. In this paper we present an approach to text categorization in terms of genre and author for Modern Greek. In contrast to previous stylometric approaches, we attempt to tak ..."
Abstract - Cited by 53 (5 self) - Add to MetaCart
The two main factors that characterize a text are its content and its style, and both can be used as a means of categorization. In this paper we present an approach to text categorization in terms of genre and author for Modern Greek. In contrast to previous stylometric approaches, we attempt to take full advantage of existing natural language processing (NLP) tools. To this end, we propose a set of style markers including analysis-level measures that represent the way in which the input text has been analyzed and capture useful stylistic information without additional cost. We present a set of small-scale but reasonable experiments in text genre detection, author identification, and author verification tasks and show that the proposed method performs better than the most popular distributional lexical measures, i.e., functions of vocabulary richness and frequencies of occurrence of the most frequent words. All the presented experiments are based on unrestricted text downloaded from the World Wide Web without any manual text preprocessing or text sampling. Various performance issues regarding the training set size and the significance of the proposed style markers are discussed. Our system can be used in any application that requires fast and easily adaptable text categorization in terms of stylistically homogeneous categories. Moreover, the procedure of defining analysis-level markers can be followed in order to extract useful stylistic information using existing text processing tools.

Question classification using support vector machines

by Dell Zhang - In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval , 2003
"... Question classification is very important for question answering. This paper presents our research work on automatic question classification through machine learning approaches. We have experimented with five machine learning algorithms: Nearest ..."
Abstract - Cited by 48 (1 self) - Add to MetaCart
Question classification is very important for question answering. This paper presents our research work on automatic question classification through machine learning approaches. We have experimented with five machine learning algorithms: Nearest
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University