• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

2001, ‘Machine Learning in Automated Text Categorisation (0)

by F Sebastiani
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 13
Next 10 →

Support Vector Machine Active Learning with Applications to Text Classification

by Simon Tong , Daphne Koller - JOURNAL OF MACHINE LEARNING RESEARCH , 2001
"... Support vector machines have met with significant success in numerous real-world learning tasks. However, like most machine learning algorithms, they are generally applied using a randomly selected training set classified in advance. In many settings, we also have the option of using pool-based acti ..."
Abstract - Cited by 338 (3 self) - Add to MetaCart
Support vector machines have met with significant success in numerous real-world learning tasks. However, like most machine learning algorithms, they are generally applied using a randomly selected training set classified in advance. In many settings, we also have the option of using pool-based active learning. Instead of using a randomly selected training set, the learner has access to a pool of unlabeled instances and can request the labels for some number of them. We introduce a new algorithm for performing active learning with support vector machines, i.e., an algorithm for choosing which instances to request next. We provide a theoretical motivation for the algorithm using the notion of a version space. We present experimental results showing that employing our active learning method can significantly reduce the need for labeled training instances in both the standard inductive and transductive settings.

Using Latent Semantic Indexing to Filter Spam

by Kevin R. Gee - In Proceedings of the 2003 ACM symposium on Applied computing , 2003
"... Past research has explored the effectiveness of a Naive Bayesian classifier when filtering unsolicited bulk email (or "spare"). Results have shown that The degree of precision of this approach is generally superior to The degree of recall. This study evaluates The effectiveness of a classifier incor ..."
Abstract - Cited by 16 (0 self) - Add to MetaCart
Past research has explored the effectiveness of a Naive Bayesian classifier when filtering unsolicited bulk email (or "spare"). Results have shown that The degree of precision of this approach is generally superior to The degree of recall. This study evaluates The effectiveness of a classifier incorporating LaTent Semantic Indexing ("LSP') to filter spare email, using a corpus used in previous studies. Results show that using LSI as The basis for an email classifier to filter out spare enjoys a very high degree of recall as well as a high degree of precision, no matter if The corpus is treated using a stop list or a lemmadzer. While using LSI leads to precision roughly equal to That of using a Naive Bayesian approach, the LSI technique has a substantially highes' recall and is generally more effective under certain conditions.

Multi-classification of Patent Applications with Winnow

by Cornelis H. A. Koster, Marc Seutter, Jean Beney , 2003
"... The Winnow family of learning algorithms can cope well with large numbers of features and is tolerant to variations in document length, which makes it suitable for classifying large collections of large documents, like patent applications. ..."
Abstract - Cited by 6 (4 self) - Add to MetaCart
The Winnow family of learning algorithms can cope well with large numbers of features and is tolerant to variations in document length, which makes it suitable for classifying large collections of large documents, like patent applications.

Automating hierarchical document classification for construction management information systems

by Carlos H. Caldas, Lucio Soibelman , 2003
"... The widespread use of information technologies for construction is considerably increasing the number of electronic text documents stored in construction management information systems. Consequently, automated methods for organizing and improving the access to the information contained in these type ..."
Abstract - Cited by 6 (0 self) - Add to MetaCart
The widespread use of information technologies for construction is considerably increasing the number of electronic text documents stored in construction management information systems. Consequently, automated methods for organizing and improving the access to the information contained in these types of documents become essential to construction information management. This paper describes a methodology developed to improve information organization and access in construction management information systems based on automatic hierarchical classification of construction project documents according to project components. A prototype system for document classification is presented, as well as the experiments conducted to verify the feasibility of the proposed approach.

Improving Text Classification with LSI Using Background Knowledge

by Sarah Zelikovitz, Haym Hirsh - IJCAI01 Workshop Notes on Text Learning: Beyond Supervision , 2001
"... We present work in progress that uses Latent Semantic Indexing (LSI) in conjunction with background knowledge and unlabeled examples to improve text classification accuracy. The singular value decomposition (SVD) that is performed by LSI is done on an expanded term by document matrix that incl ..."
Abstract - Cited by 5 (0 self) - Add to MetaCart
We present work in progress that uses Latent Semantic Indexing (LSI) in conjunction with background knowledge and unlabeled examples to improve text classification accuracy. The singular value decomposition (SVD) that is performed by LSI is done on an expanded term by document matrix that includes the labeled training examples as well as the unlabeled examples. We report classification accuracy on different data sets both with and withoutthe inclusion of background knowledge and compare it to other known work.

Multinomial Mixture Modelling for Bilingual Text Classification

by Jorge Civera, Alfons Juan, Departament De Sistemes Informàtics , 2005
"... Abstract. Mixture modelling of class-conditional densities is a standard pattern classification technique. In text classification, the use of class-conditional multinomial mixtures can be seen as a generalisation of the Naive Bayes text classifier relaxing its (class-conditional feature) independenc ..."
Abstract - Cited by 3 (3 self) - Add to MetaCart
Abstract. Mixture modelling of class-conditional densities is a standard pattern classification technique. In text classification, the use of class-conditional multinomial mixtures can be seen as a generalisation of the Naive Bayes text classifier relaxing its (class-conditional feature) independence assumption. In this paper, we describe and compare several extensions of the class-conditional multinomial mixture-based text classifier for bilingual texts. 1

Classifying Patent Applications with Winnow

by C. H. A. Koster, M. Seutter, J. Beney
"... The Winnow family of learning algorithms can cope well with large numbers of features and is tolerant to variations in document length, which makes it suitable for classifying large collections of large documents, like patent applications. This note ..."
Abstract - Cited by 2 (0 self) - Add to MetaCart
The Winnow family of learning algorithms can cope well with large numbers of features and is tolerant to variations in document length, which makes it suitable for classifying large collections of large documents, like patent applications. This note

Combining Multiclass Maximum Entropy Text Classifiers with Neural Network Voting

by Philipp Koehn , 2002
"... We improve a high-accuracy maximum entropy classi er by combining an ensemble of classi ers with neural network voting. In our experiments we demonstrate signi cantly superior performance both over a single classi er as well as over the use of the traditional weightedsum voting approach. Spe ..."
Abstract - Cited by 1 (0 self) - Add to MetaCart
We improve a high-accuracy maximum entropy classi er by combining an ensemble of classi ers with neural network voting. In our experiments we demonstrate signi cantly superior performance both over a single classi er as well as over the use of the traditional weightedsum voting approach. Speci cally, we apply this to a maximum entropy classi er on a large scale multi-class text categorization task: the online job directory Flipdog with over half a million jobs in 65 categories.

Towards a Comprehensive Topic Hierarchy for News

by Vijay Boyapati, Vijay Boyapati, Vijay Boyapati , 2000
"... To date, a comprehensive, Yahoo-like hierarchy of topics has yet to be offered for the domain of news. The Yahoo approach of managing such a hierarchy --- hiring editorial staff to read documents and correctly assign them to topics --- is simply not practical in the domain of news. Far too many stor ..."
Abstract - Add to MetaCart
To date, a comprehensive, Yahoo-like hierarchy of topics has yet to be offered for the domain of news. The Yahoo approach of managing such a hierarchy --- hiring editorial staff to read documents and correctly assign them to topics --- is simply not practical in the domain of news. Far too many stories are written and made available online everyday. While many Machine Learning methods exist for organising documents into topics, these methods typically require a large number of labelled training examples before performing accurately. When managing a large and ever-changing topic hierarchy, it is unlikely that there would be enough time to provide many examples per topic. For this reason, it would be useful to identify extra information within the domain of news that could be harnessed to minimise the number of labelled examples required to achieve reasonable accuracy. To this end, the notion of a semi-labelled document is introduced. These documents, which are partially labelled by th...

Recognize, Categorize, and Retrieve

by Kazem Taghva Thomas, Thomas A. Nartker, Julie Borsack - Laboratory for Language and Media Processing, University of Maryland , 2001
"... A successful text categorization experiment divides a textual collection into pre-de ned classes. A true representative for each class is generally obtained during training of the categorizer. ..."
Abstract - Add to MetaCart
A successful text categorization experiment divides a textual collection into pre-de ned classes. A true representative for each class is generally obtained during training of the categorizer.
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University