Results 1 - 10
of
14
Machine Learning in Automated Text Categorization
- ACM Computing Surveys
, 2002
"... The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last ten years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this p ..."
Abstract
-
Cited by 839 (13 self)
- Add to MetaCart
The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last ten years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert labor power, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely document representation, classifier construction, and classifier evaluation.
Hierarchical Text Categorization Using Neural Networks
- Information Retrieval
, 2002
"... This paper presents the design and evaluation of a text categorization method based on the Hierarchical Mixture of Experts model. This model uses a divide and conquer principle to define smaller categorization problems based on a predefined hierarchical structure. The final classifier is a hierarchi ..."
Abstract
-
Cited by 63 (0 self)
- Add to MetaCart
This paper presents the design and evaluation of a text categorization method based on the Hierarchical Mixture of Experts model. This model uses a divide and conquer principle to define smaller categorization problems based on a predefined hierarchical structure. The final classifier is a hierarchical array of neural networks. The method is evaluated using the UMLS Metathesaurus as the underlying hierarchical structure, and the OHSUMED test set of MEDLINE records. Comparisons with an optimized version of the traditional Rocchio's algorithm adapted for text categorization, as well as at neural network classifiers are provided. The results show that the use of the hierarchical structure improves text categorization performance with respect to an equivalent at model. The optimized Rocchio algorithm achieves a performance comparable with that of the hierarchical neural networks.
Machine Learning in Automated Text Categorisation
, 1999
"... The automated categorisation (or classification) of texts into topical categories has a long history, dating back at least to the early ’60s. Until the late ’80s, the most effective approach to the problem seemed to be that of manually building automatic classifiers by means of knowledgeengineering ..."
Abstract
-
Cited by 16 (1 self)
- Add to MetaCart
The automated categorisation (or classification) of texts into topical categories has a long history, dating back at least to the early ’60s. Until the late ’80s, the most effective approach to the problem seemed to be that of manually building automatic classifiers by means of knowledgeengineering techniques, i.e. manually defining a set of rules encoding expert knowledge on how to classify documents under a given set of categories. In the ’90s, with the booming production and availability of on-line documents, automated text categorisation has witnessed an increased and renewed interest, prompted by which the machine learning paradigm to automatic classifier construction has emerged and definitely superseded the knowledge-engineering approach. Within the machine learning paradigm, a general inductive process (called the learner) automatically builds a classifier (also called the rule, or the hypothesis) by “learning”, from a set of previously classified documents, the characteristics of one or more categories. The advantages of this approach are a very good effectiveness, a considerable savings in terms of expert manpower, and domain independence. In this survey we look at the main approaches that have been taken towards automatic text categorisation within the general machine learning paradigm. Issues pertaining to document indexing, classifier construction, and classifier evaluation, will be discussed in detail. A final section will be devoted to the techniques that have specifically been devised for an emerging application such as the automatic classification of Web pages into “Yahoo!-like ” hierarchically structured sets of categories.
MeSH Up: effective MeSH text classification for improved document retrieval
- Bioinformatics
, 2009
"... Motivation: Controlled vocabularies such as the Medical Subject Headings (MeSH) thesaurus and the Gene Ontology (GO) provide an efficient way of accessing and organizing biomedical information by reducing the ambiguity inherent to free-text data. Different methods of automating the assignment of MeS ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
Motivation: Controlled vocabularies such as the Medical Subject Headings (MeSH) thesaurus and the Gene Ontology (GO) provide an efficient way of accessing and organizing biomedical information by reducing the ambiguity inherent to free-text data. Different methods of automating the assignment of MeSH concepts have been proposed to replace manual annotation, but they are either limited to a small subset of MeSH or have only been compared to a limited number of other systems. Results: We compare the performance of 6 MeSH classification systems (MetaMap, EAGL, a language and a vector space model based approach, a K-Nearest Neighbor approach and MTI) in terms of reproducing and complementing manual MeSH annotations. A K-Nearest Neighbor system clearly outperforms the other published approaches and scales well with large amounts of text using the full MeSH thesaurus. Our measurements demonstrate to what extent manual MeSH annotations can be reproduced and how they can be complemented by automatic annotations. We also show that a statistically significant improvement can be obtained in information retrieval (IR) when the text of a user’s query is automatically annotated with MeSH concepts, compared to using the original textual query alone. Conclusions: The annotation of biomedical texts using controlled vocabularies such as MeSH can be automated to improve text-only IR. Furthermore, the automatic MeSH annotation system we propose is highly scalable and it generates improvements in IR comparable to those observed for manual annotations. Contact:
Robust Statistical Techniques for the Categorization of Images Using Associated Text
, 2003
"... The field of text categorization, which aids applications such as browsing, filtering, and search, has experienced a revival due to the vast amounts of unlabeled data available on line and as part of digital collections. Almost all of the literature in the field, however, deals with the categorizati ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
The field of text categorization, which aids applications such as browsing, filtering, and search, has experienced a revival due to the vast amounts of unlabeled data available on line and as part of digital collections. Almost all of the literature in the field, however, deals with the categorization of text-only documents. Many of the same techniques can be applied to text associated with multimedia docu-ments to label the multimedia component. My dissertation provides an in-depth exploration of the automatic categorization of images using associated text. This research takes advantage of a corpus I have created containing news documents with embedded captioned images and multiple sets of categories. It turns out that the text and categories associated with images tend to have different properties than those associated with full-length text documents such as e-mails, articles, and web pages. Also, images provide us with an additional type of information; namely, low-level image features. For these reasons, I have achieved success in several ar-eas of research that have previously been problematic, such as combining systems and using NLP techniques to improve performance. Some benefits of this work
Inverted files and dynamic signature files for optimisation of Web Directories
- International Journal of High Performance Computing Applications
, 2002
"... Web directories are taxonomies for the classification of Web documents. This kind of IR systems present a specific type of search where the document collection is restricted to one area of the category graph. This paper introduces a specific data architecture for Web directories which improves the p ..."
Abstract
- Add to MetaCart
Web directories are taxonomies for the classification of Web documents. This kind of IR systems present a specific type of search where the document collection is restricted to one area of the category graph. This paper introduces a specific data architecture for Web directories which improves the performance of restricted searches. That architecture is based on a hybrid data structure composed of an inverted file with multiple embedded signature files. Two variants based on the proposed model are presented: hybrid architecture with total information and hybrid architecture with partial information. The validity of this architecture has been analysed by means of developing both variants to be compared with a basic model. The performance of the restricted queries was clearly improved, specially the hybrid model with partial information, which yielded a positive response under any load of the search system.
A framework for automatic combination of media contents by minimising information redundancy Case: Integrated publishing in multimedia networks
, 2001
"... Information redundancy becomes a crucial problem in the Web when contents from different resources are automatically combined to produce a new WWW--publication. ..."
Abstract
- Add to MetaCart
Information redundancy becomes a crucial problem in the Web when contents from different resources are automatically combined to produce a new WWW--publication.
A Categorization Method for French Legal Documents
"... This paper briefly describes an on-going work in categorizing French legal documents. We used documents from the French official publication Journal Officiel de la Rpublique franaise, dition Lois et Dcrets (J.O.) which gathers laws, decrees, decisions from various administrations. These documents ar ..."
Abstract
- Add to MetaCart
This paper briefly describes an on-going work in categorizing French legal documents. We used documents from the French official publication Journal Officiel de la Rpublique franaise, dition Lois et Dcrets (J.O.) which gathers laws, decrees, decisions from various administrations. These documents are published on an internet site http://droit.org which intends to be a web portal for French law. The principle aim of this text categorization system is to determine which subfields of law a given legal document is dealing with. As a result, a thematic access to subfields of law will be provided and text retrieval effectiveness of legal documents improved, as reported by [3]
A Multilevel Semantic Document Classifier Based On SVM Integrated With Domain Ontologies
"... A multilevel semantic document classification system based on Support Vector Machine (SVM) in association with domain ontologies has been developed. The documents related to the scientific domains such as computer science and chemistry are treated as the test source. The classification results are m ..."
Abstract
- Add to MetaCart
A multilevel semantic document classification system based on Support Vector Machine (SVM) in association with domain ontologies has been developed. The documents related to the scientific domains such as computer science and chemistry are treated as the test source. The classification results are more precise and fine grained when compared to the conventional methodologies. The sharpness of the classification has been found to be enhanced when the domain knowledge in terms of ontologies is integrated with SVM procedures. So the developed system provides the advantages of high generalization performance, prevention of over fitting, less computational complexity, high accuracy, and robustness. The use of automated identification of the semantic components derived from the domain ontologies enables the system to provide semantically rich classification results.

