Results 11 - 20
of
72
A New Family of Online Algorithms for Category Ranking
- Jornal of Machine Learning Research
, 2002
"... We describe a new family of topic-ranking algorithms for multi-labeled documents. The motivation for the algorithms stems from recent advances in online learning algorithms. The algorithms we present are simple to implement and are time and memory ecient. We evaluate the algorithms on the Reuters-21 ..."
Abstract
-
Cited by 49 (11 self)
- Add to MetaCart
We describe a new family of topic-ranking algorithms for multi-labeled documents. The motivation for the algorithms stems from recent advances in online learning algorithms. The algorithms we present are simple to implement and are time and memory ecient. We evaluate the algorithms on the Reuters-21578 corpus and the new corpus released by Reuters in 2000. On both corpora the algorithms we present outperform adaptations to topic-ranking of Rocchio's algorithm and the Perceptron algorithm. We also outline the formal analysis of the algorithm in the mistake bound model. To our knowledge, this work is the rst to report performance results with the entire new Reuters corpus.
A Family of Additive Online Algorithms for Category Ranking
- Journal of Machine Learning Research
, 2003
"... We describe a new family of topic-ranking algorithms for multi-labeled documents. The motivation for the algorithms stem from recent advances in online learning algorithms. The algorithms are simple to implement and are also time and memory efficient. We provide a unified analysis of the family o ..."
Abstract
-
Cited by 47 (0 self)
- Add to MetaCart
We describe a new family of topic-ranking algorithms for multi-labeled documents. The motivation for the algorithms stem from recent advances in online learning algorithms. The algorithms are simple to implement and are also time and memory efficient. We provide a unified analysis of the family of algorithms in the mistake bound model. We then discuss experiments with the proposed family of topic-ranking algorithms on the Reuters-21578 corpus and the new corpus released by Reuters in 2000. On both corpora, the algorithms we present achieve state-of-the-art results and outperforms topic-ranking adaptations of Rocchio's algorithm and of the Perceptron algorithm.
Authorship Attribution with Support Vector Machines
- APPLIED INTELLIGENCE
, 2000
"... In this paper we explore the use of text-mining methods for the identification of the author of a text. For the first time we apply the support vector machine (SVM) to this problem. As it is able to cope with half a million of inputs it requires no feature selection and can process the frequency v ..."
Abstract
-
Cited by 45 (0 self)
- Add to MetaCart
In this paper we explore the use of text-mining methods for the identification of the author of a text. For the first time we apply the support vector machine (SVM) to this problem. As it is able to cope with half a million of inputs it requires no feature selection and can process the frequency vector of all words of a text. We performed a number of experiments with texts from a German newspaper. With nearly perfect reliability the SVM was able to reject other authors and detected the target author in 60-80% of the cases. In a second experiment we ignored nouns, verbs and adjectives and replaced them by grammatical tags and bigrams. This resulted in slightly reduced performance. Author detection with SVM on full word forms was remarkably robust even if the author wrote about different topics.
Hierarchical Neural Networks for Text Categorization
- In Proceedings of the 22 nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
, 1999
"... This paper presents the design and evaluation of a text categorization method based on the Hierarchical Mixture of Experts model. This model uses a divide and conquer principle to dene smaller categorization problems based on a predened hierarchical structure. The nal classier is a hierarchical arra ..."
Abstract
-
Cited by 44 (1 self)
- Add to MetaCart
This paper presents the design and evaluation of a text categorization method based on the Hierarchical Mixture of Experts model. This model uses a divide and conquer principle to dene smaller categorization problems based on a predened hierarchical structure. The nal classier is a hierarchical array of neural networks. The method is evaluated using the UMLS Metathesaurus as the underlying hierarchical structure, and the OHSUMED test set of MEDLINE records. Comparisons with traditional Rocchio's algorithm adapted for text categorization, as well as at neural network classi- ers are provided. The results show that the use of the hierarchical structure improves text categorization performance signicantly. 1 Introduction Text categorization, also known as automatic indexing, is the process of algorithmically analyzing an electronic document to assign a set of categories (or index terms) that succinctly describe the content of the document. This assignment can be used for classic...
Experiments on the use of feature selection and negative evidence in automated text categorization
- Proceedings of ECDL-00, 4th European Conference on Research and Advanced Technology for Digital Libraries
, 2000
"... In this work we tackle two different problems of text categorization (TC), namely feature selection and classifier induction. Feature selection refers to the activity of selecting, from the set of r distinct features (i.e. words) occurring in the collection, the subset of r ′ ≪ r features that are ..."
Abstract
-
Cited by 35 (7 self)
- Add to MetaCart
In this work we tackle two different problems of text categorization (TC), namely feature selection and classifier induction. Feature selection refers to the activity of selecting, from the set of r distinct features (i.e. words) occurring in the collection, the subset of r ′ ≪ r features that are most useful for compactly representing the meaning of the documents. We propose a novel feature selection technique, based on a simplified variant of the χ 2 statistics. Classifier induction refers instead to the problem of automatically building a text classifier by learning from a set of documents pre-classified under the categories of interest. We propose a novel variant, based on the exploitation of negative evidence, of the well-known k-NN method. We report the results of systematic experimentation of these two methods performed on the standard Reuters-21578 benchmark.
A maximum entropy approach to information extraction from semi-structured and free text
- In Proceedings of the Eighteenth National Conference on Artificial Intelligence
, 2002
"... In this paper, we present a classification-based approach towards single-slot as well as multi-slot information extraction (IE). For single-slot IE, we worked on the domain of Seminar Announcements, where each document contains information on only one seminar. For multi-slot IE, we worked on the dom ..."
Abstract
-
Cited by 35 (0 self)
- Add to MetaCart
In this paper, we present a classification-based approach towards single-slot as well as multi-slot information extraction (IE). For single-slot IE, we worked on the domain of Seminar Announcements, where each document contains information on only one seminar. For multi-slot IE, we worked on the domain of Management Succession. For this domain, we restrict ourselves to extracting information sentence by sentence, in the same way as (Soderland 1999). Each sentence can contain information on several management succession events. By using a classification approach based on a maximum entropy framework, our system achieves higher accuracy than the best previously published results in both domains.
A tutorial on automated text categorisation
- Proceedings of ASAI-99, 1st Argentinian Symposium on Artificial Intelligence, pages 7--35, Buenos Aires, AR
, 1999
"... The automated categorisation (or classification) of texts into topical categories has a long history, dating back at least to 1960. Until the late ’80s, the dominant approach to the problem involved knowledge-engineering automatic categorisers, i.e. manually building a set of rules encoding expert k ..."
Abstract
-
Cited by 27 (0 self)
- Add to MetaCart
The automated categorisation (or classification) of texts into topical categories has a long history, dating back at least to 1960. Until the late ’80s, the dominant approach to the problem involved knowledge-engineering automatic categorisers, i.e. manually building a set of rules encoding expert knowledge on how to classify documents. In the ’90s, with the booming production and availability of on-line documents, automated text categorisation has witnessed an increased and renewed interest. A newer paradigm based on machine learning has superseded the previous approach. Within this paradigm, a general inductive process automatically builds a classifier by “learning”, from a set of previously classified documents, the characteristics of one or more categories; the advantages are a very good effectiveness, a considerable savings in terms of expert manpower, and domain independence. In this tutorial we look at the main approaches that have been taken towards automatic text categorisation within the general machine learning paradigm. Issues of document indexing, classifier construction, and classifier evaluation, will be touched upon. 1 A definition of the text categorisation task
Proofs in Context
- In Principles of Knowledge Representation and Reasoning
, 1994
"... Assistance in retrieving of documents on the World Wide Web is provided either by search engines, through keyword based queries, or by catalogues, which organise documents into hierarchical collections. Maintaining catalogues manually is becoming increasingly difficult due to the sheer amount of mat ..."
Abstract
-
Cited by 26 (4 self)
- Add to MetaCart
Assistance in retrieving of documents on the World Wide Web is provided either by search engines, through keyword based queries, or by catalogues, which organise documents into hierarchical collections. Maintaining catalogues manually is becoming increasingly difficult due to the sheer amount of material, and therefore it will be necessary to resort to techniques for automatic classification of documents. Classification is traditionally performed by extracting information for indexing a document from the document itself. The paper describes the technique of categorisation by context, which exploits the context perceivable from the structure of HTML documents to extract useful information for classifying the documents they refer to. We present the results of experiments with a preliminary implementation of the technique. 1.
Text categorization
- Text Mining and its Applications to Intelligence, CRM and Knowledge Management
, 2005
"... Text categorization (also known as text classification, or topic spotting) is the task of automatically sorting a set of documents into categories from a predefined set. This task has several applications, including automated indexing of scientific articles according to predefined thesauri of techni ..."
Abstract
-
Cited by 24 (1 self)
- Add to MetaCart
Text categorization (also known as text classification, or topic spotting) is the task of automatically sorting a set of documents into categories from a predefined set. This task has several applications, including automated indexing of scientific articles according to predefined thesauri of technical terms, filing patents into patent directories, selective dissemination of information to information consumers, automated population of hierarchical catalogues of Web resources, spam filtering, identification of document genre, authorship attribution, survey coding, and even automated essay grading. Automated text classification is attractive because it frees organizations from the need of manually organizing document bases, which can be too expensive, or simply not feasible given the time constraints of the application or the number of documents involved. The accuracy of modern text classification systems rivals that of trained human professionals, thanks to a combination of information retrieval (IR) technology and machine learning (ML) technology. This chapter will outline the fundamental traits of the technologies involved, of the applications that can feasibly be tackled through text classification, and of the tools and resources that are available to the researcher and developer wishing to take up these technologies for deploying real-world applications. 1
Text categorization using compression models
- In Proceedings of DCC-00, IEEE Data Compression Conference, Snowbird, US
, 2000
"... Text categorization, or the assignment of natural language texts to predefined categories based on their content, is of growing importance as the volume of information available on the internet continues to overwhelm us. The use of predefined categories implies a “supervised learning ” approach to c ..."
Abstract
-
Cited by 21 (1 self)
- Add to MetaCart
Text categorization, or the assignment of natural language texts to predefined categories based on their content, is of growing importance as the volume of information available on the internet continues to overwhelm us. The use of predefined categories implies a “supervised learning ” approach to categorization, where already-classified articles—which

