• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Text Categorisation: A Survey (1999)

by K Aas, L Eikvil
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 34
Next 10 →

Efficient phrase-based document indexing for Web document clustering

by Khaled M. Hammouda, Mohamed S. Kamel - IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING , 2004
"... Document clustering techniques mostly rely on single term analysis of the document data set, such as the Vector Space Model. To achieve more accurate document clustering, more informative features including phrases and their weights are particularly important in such scenarios. Document clustering ..."
Abstract - Cited by 31 (1 self) - Add to MetaCart
Document clustering techniques mostly rely on single term analysis of the document data set, such as the Vector Space Model. To achieve more accurate document clustering, more informative features including phrases and their weights are particularly important in such scenarios. Document clustering is particularly useful in many applications such as automatic categorization of documents, grouping search engine results, building a taxonomy of documents, and others. This paper presents two key parts of successful document clustering. The first part is a novel phrase-based document index model, the Document Index Graph, which allows for incremental construction of a phrase-based index of the document set with an emphasis on efficiency, rather than relying on single-term indexes only. It provides efficient phrase matching that is used to judge the similarity between documents. The model is flexible in that it could revert to a compact representation of the vector space model if we choose not to index phrases. The second part is an incremental document clustering algorithm based on maximizing the tightness of clusters by carefully watching the pair-wise document similarity distribution inside clusters. The combination of these two components creates an underlying model for robust and accurate document similarity calculation that leads to much improved results in Web document clustering over traditional methods.

Use of K-Nearest Neighbor Classifier for Intrusion Detection

by Yihua Liao, V. Rao Vemuri , 2002
"... A new approach, based on the k-Nearest Neighbor (kNN) classifier, is used to classify program behavior as normal or intrusive. Program behavior, in turn, is represented by frequencies of system calls. Each system call is treated as a word and the collection of system calls over each program executio ..."
Abstract - Cited by 26 (2 self) - Add to MetaCart
A new approach, based on the k-Nearest Neighbor (kNN) classifier, is used to classify program behavior as normal or intrusive. Program behavior, in turn, is represented by frequencies of system calls. Each system call is treated as a word and the collection of system calls over each program execution as a document. These documents are then classified using kNN classifier, a popular method in text categorization. This method seems to offer some computational advantages over those that seek to characterize program behavior with short sequences of system calls and generate individual program profiles. Preliminary experiments with 1998 DARPA BSM audit data show that the kNN classifier can effectively detect intrusive attacks and achieve a low false positive rate. Key words: k-Nearest Neighbor classifier, intrusion detection, system calls, text categorization, program profile. 1.

A Large Benchmark Dataset for Web Document Clustering

by Mark Sinka, David Corne - Soft Computing Systems: Design, Management and Applications, Volume 87 of Frontiers in Artificial Intelligence and Applications , 2002
"... Targeting useful and relevant information on the WWW is a topical and highly complicated research area. A thriving research effort that feeds into this area is document clustering, which overlaps closely with areas usually known as text classification and text categorisation. A foundational aspect o ..."
Abstract - Cited by 21 (1 self) - Add to MetaCart
Targeting useful and relevant information on the WWW is a topical and highly complicated research area. A thriving research effort that feeds into this area is document clustering, which overlaps closely with areas usually known as text classification and text categorisation. A foundational aspect of such research (which has been proven over and over again in other research disciplines) is the use of standard datasets, against which different techniques can be properly benchmarked andassessedincomparisontoeachother. Wenotehereinthat,sofarinthisbroad area of research, as many datasets have been used as research papers written, thus making it difficult to reason about the relative performance of different categorisation/clustering techniques used in different papers. In this paper we propose a standard dataset with a variety of properties suitable for a wide range of clustering and related experiments. We describe how the dataset was generated, and provide a pointer to it, and encourage its access and use. We also illustrate the use of part of the dataset by establishing benchmark results for simple k-means clustering, comparing the relative performance of k-means on a pair of `close' categories and a pair of `distant' categories. We naturally find that performance is better on the pair of `distant' categories, however the experiments reveal that although stop-word removal is confirmed as helpful, word-stemming is, (perhaps counter to intuition), not necessarily always recommended on `distant' categories.

Web Page Classification: Features and Algorithms

by Xiaoguang Qi, Brian D. Davison , 2007
"... Classification of web page content is essential to many tasks in web information retrieval such as maintaining web directories and focused crawling. The uncontrolled nature of web content presents additional challenges to web page classification as compared to traditional text classification, but th ..."
Abstract - Cited by 16 (0 self) - Add to MetaCart
Classification of web page content is essential to many tasks in web information retrieval such as maintaining web directories and focused crawling. The uncontrolled nature of web content presents additional challenges to web page classification as compared to traditional text classification, but the interconnected nature of hypertext also provides features that can assist the process. As we review work in web page classification, we note the importance of these web-specific features and algorithms, describe state-of-the-art practices, and track the underlying assumptions behind the use of information from neighboring pages. 1

Phrase-based document similarity based on an index graph model

by Khaled M. Hammouda, Mohamed S. Kamel - In Proceedings of the 2002 IEEE Int'l Conf. on Data Mining (ICDM'02 , 2002
"... Document clustering techniques mostly rely on single term analysis of the document data set, such as the Vector Space Model. To better capture the structure of documents, the underlying data model should be able to represent the phrases in the document as well as single terms. We present a novel dat ..."
Abstract - Cited by 7 (1 self) - Add to MetaCart
Document clustering techniques mostly rely on single term analysis of the document data set, such as the Vector Space Model. To better capture the structure of documents, the underlying data model should be able to represent the phrases in the document as well as single terms. We present a novel data model, the Document Index Graph, which indexes web documents based on phrases, rather than single terms only. The semi-structured web documents help in identifying potential phrases that when matched with other documents indicate strong similarity between the documents. The Document Index Graph captures this information, and finding significant matching phrases between documents becomes easy and efficient with such model. The similarity between documents is based on both single term weights and matching phrases weights. The combined similarities are used with standard document clustering techniques to test their effect on the clustering quality. Experimental results show that our phrase-based similarity, combined with single-term similarity measures, enhances web document clustering quality significantly. 1.

Document Classification via Structure Synopses

by Liping Ma , John Shepherd, Anh Nguyen , 2003
"... Information available in the Internet is frequently supplied simply as plain ascii text, structured according to orthographic and semantic conventions. Traditional document classification is typically formulated as a learning problem where each instance is a whole document that is represented by a f ..."
Abstract - Cited by 7 (0 self) - Add to MetaCart
Information available in the Internet is frequently supplied simply as plain ascii text, structured according to orthographic and semantic conventions. Traditional document classification is typically formulated as a learning problem where each instance is a whole document that is represented by a feature vector. Such feature vectors are often generated based on the appearance and frequencies of words in the documents. The high-dimensionality of these feature vectors causes some problems: important clues might be missed out, and the classification might be misled by some trivial elements. In this paper, we propose a method which makes use of structuring conventions to reduce size of the feature vector without a#ecting the accuracy of the classification process. E#ectively, a synopsis of document structure is extracted, which contains only the most informative features; then a succinct feature vector is generated to represent the instance. Finally, a decision tree machine learning algorithm is used to classify the document based on its succinct feature vector.

A novel refinement approach for text categorization

by Songbo Tan, Xueqi Cheng, Moustafa M. Ghanem, Bin Wang, Hongbo Xu - In CIKM ’05: Proceedings of the 14th ACM international conference on Information and knowledge management , 2005
"... In this paper we present a novel strategy, DragPushing, for improving the performance of text classifiers. The strategy is generic and takes advantage of training errors to successively refine the classification model of a base classifier. We describe how it is applied to generate two new classifica ..."
Abstract - Cited by 7 (2 self) - Add to MetaCart
In this paper we present a novel strategy, DragPushing, for improving the performance of text classifiers. The strategy is generic and takes advantage of training errors to successively refine the classification model of a base classifier. We describe how it is applied to generate two new classification algorithms; a Refined Centroid Classifier and a Refined Naïve Bayes Classifier. We present an extensive experimental evaluation of both algorithms on three English collections and one Chinese corpus. The results indicate that in each case, the refined classifiers achieve significant performance improvement over the base classifiers used. Furthermore, the performance of the Refined Centroid Classifier implemented is comparable, if not better, to that of state-of-the-art support vector machine (SVM)-based classifier, but offers a much lower computational cost.

Document Space Adapted Ontology: Application in Query Enrichment

by Stein L. Tomassen, Jon Atle Gulla, Darijus Strasunskas - 11th International Conference on Applications of Natural Language to Information Systems (NLDB 2006), LNCS 3999 , 2006
"... Abstract. Retrieval of correct and precise information at the right time is essential in knowledge intensive tasks requiring quick decision-making. In this paper, we propose a method for utilizing ontologies to enhance the quality of information retrieval (IR) by query enrichment. We explain how a r ..."
Abstract - Cited by 6 (4 self) - Add to MetaCart
Abstract. Retrieval of correct and precise information at the right time is essential in knowledge intensive tasks requiring quick decision-making. In this paper, we propose a method for utilizing ontologies to enhance the quality of information retrieval (IR) by query enrichment. We explain how a retrieval system can be tuned by adapting ontologies to provide both an in-depth understanding of the user's needs as well as an easy integration with standard vector-space retrieval systems. The ontology concepts are adapted to the domain terminology by computing a feature vector for each concept. The feature vector is used to enrich a provided query. The ontology and the whole retrieval system are under development as part of a Semantic Web standardization project for the Norwegian oil and gas industry. 1

Hierarchical text categorization using fuzzy relational thesaurus

by Domonkos Tikk, Jae Dong Yang, Sun Lee Bang
"... Text categorization is the classi cation to assign a text document toan appropriate category in a prede ned set of categories. We present a new approach for the text categorization by means of Fuzzy Relational Thesaurus (FRT). FRT isamultilevel category ..."
Abstract - Cited by 5 (5 self) - Add to MetaCart
Text categorization is the classi cation to assign a text document toan appropriate category in a prede ned set of categories. We present a new approach for the text categorization by means of Fuzzy Relational Thesaurus (FRT). FRT isamultilevel category

M.: Comparing Natural Language Identification Methods based on Markov Processes

by Peter Vojtek, Mária Bieliková - In: Slovko, International Seminar on Computer Treatment of Slavic and East European Languages , 2007
"... Abstract. We discover and experiment with categorization-based methods to natural language identification. Two approaches to language identification based on Markov processes are compared, both methods treat the incoming text on the character level. We performed series of experiments with the aim to ..."
Abstract - Cited by 3 (0 self) - Add to MetaCart
Abstract. We discover and experiment with categorization-based methods to natural language identification. Two approaches to language identification based on Markov processes are compared, both methods treat the incoming text on the character level. We performed series of experiments with the aim to make certain of high precision in language identification task of selected methods and also with the objective to compare them against themselves. Experimental evaluation was based on largescaled Multilingual Reuters Corpus with various European and Slavic languages. Our research results showed that both methods are comparable in the task of natural language identification achieving recall as high as 99,75%. 1
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University