Results 1 - 10
of
37
Machine Learning in Automated Text Categorization
- ACM Computing Surveys
, 2002
"... The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last ten years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this p ..."
Abstract
-
Cited by 839 (13 self)
- Add to MetaCart
The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last ten years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert labor power, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely document representation, classifier construction, and classifier evaluation.
A Case Study in Using Linguistic Phrases for Text Categorization on the WWW
- In Working Notes of the AAAI/ICML Workshop on Learning for Text Categorization
, 1998
"... Most learning algorithms that are applied to text categorization problems rely on a bag-of-words document representation, i.e., each word occurring in the document is considered as a separate feature. In this paper, we investigate the use of linguistic phrases as input features for text categorizati ..."
Abstract
-
Cited by 41 (9 self)
- Add to MetaCart
Most learning algorithms that are applied to text categorization problems rely on a bag-of-words document representation, i.e., each word occurring in the document is considered as a separate feature. In this paper, we investigate the use of linguistic phrases as input features for text categorization problems. These features are based on information extraction patterns that are generated and used by the AutoSlog-TS system. We present experimental results on using such features as background knowledge for two machine learning algorithms on a classification task on the WWW. The results show that phrasal features can improve the precision of learned theories at the expense of coverage.
The Role of Information Extraction for Textual CBR
- In Proceedings of the 4th International Conference on Case-Based Reasoning, ICCBR ’01
, 2001
"... The benefits of CBR methods in domains where cases are text depend on the underlying text representation. Today, most TCBR approaches are limited to the degree that they are based on efficient, but weak IR methods. These do not allow for reasoning about the similarities between cases, which is manda ..."
Abstract
-
Cited by 28 (5 self)
- Add to MetaCart
The benefits of CBR methods in domains where cases are text depend on the underlying text representation. Today, most TCBR approaches are limited to the degree that they are based on efficient, but weak IR methods. These do not allow for reasoning about the similarities between cases, which is mandatory for many CBR tasks beyond text retrieval, including adaptation or argumentation. In order to carry out more advanced CBR that compares complex cases in terms of abstract indexes, NLP methods are required to derive a better case representation. This paper discusses how state-of-the-art NLP/IE methods might be used for automatically extracting relevant factual information, preserving information captured in text structure and ascertaining negation. It also presents our ongoing research on automatically deriving abstract indexing concepts from legal case texts. We report progress toward integrating IE techniques and ML for generalizing from case texts to our CBR case represen...
NLP for Term Variant Extraction: Synergy between Morphology, Lexicon, and Syntax
, 1999
"... . We present a natural language processing (NLP) approach to automatic indexing over controlled vocabulary which accounts for term variation. The approach combines a part of speech tagger, a generator of morphologically related forms, and a shallow transformational parser. The system is applied to t ..."
Abstract
-
Cited by 22 (1 self)
- Add to MetaCart
. We present a natural language processing (NLP) approach to automatic indexing over controlled vocabulary which accounts for term variation. The approach combines a part of speech tagger, a generator of morphologically related forms, and a shallow transformational parser. The system is applied to the French language; it is trained on newspaper articles and tested on scientific literature. Precision rate of indexing on term and variants is 97.2%. It is only slightly lower than indexing without accounting for term variation (99.7%). Recall rate of indexing on term and variants (93.4%) is much higher than recall of indexing on term occurrences only (72.4%). Conflation of term variants increases indexing coverage up to 30%. The system is a convincing example of the potential synergy between full-fledged morphological analysis and local syntactic analysis. Many details are provided on the implementation of the system. Illustrative examples of syntactic transformations for the French language are given together with the theoretical and empirical methods for their formulation. 2 CHRISTIAN JACQUEMIN AND EVELYNE TZOUKERMANN 1.
A Large Benchmark Dataset for Web Document Clustering
- Soft Computing Systems: Design, Management and Applications, Volume 87 of Frontiers in Artificial Intelligence and Applications
, 2002
"... Targeting useful and relevant information on the WWW is a topical and highly complicated research area. A thriving research effort that feeds into this area is document clustering, which overlaps closely with areas usually known as text classification and text categorisation. A foundational aspect o ..."
Abstract
-
Cited by 21 (1 self)
- Add to MetaCart
Targeting useful and relevant information on the WWW is a topical and highly complicated research area. A thriving research effort that feeds into this area is document clustering, which overlaps closely with areas usually known as text classification and text categorisation. A foundational aspect of such research (which has been proven over and over again in other research disciplines) is the use of standard datasets, against which different techniques can be properly benchmarked andassessedincomparisontoeachother. Wenotehereinthat,sofarinthisbroad area of research, as many datasets have been used as research papers written, thus making it difficult to reason about the relative performance of different categorisation/clustering techniques used in different papers. In this paper we propose a standard dataset with a variety of properties suitable for a wide range of clustering and related experiments. We describe how the dataset was generated, and provide a pointer to it, and encourage its access and use. We also illustrate the use of part of the dataset by establishing benchmark results for simple k-means clustering, comparing the relative performance of k-means on a pair of `close' categories and a pair of `distant' categories. We naturally find that performance is better on the pair of `distant' categories, however the experiments reveal that although stop-word removal is confirmed as helpful, word-stemming is, (perhaps counter to intuition), not necessarily always recommended on `distant' categories.
A Study Using n-gram Features for Text Categorization
, 1998
"... In this paper, we study the effect of using-grams (sequences of words of length) for text categorization. We use an efficient algorithm for generating such-gram features in two benchmark domains, the 20 newsgroups data set and 21,578 REUTERS newswire articles. Our results with the rule learning algo ..."
Abstract
-
Cited by 20 (1 self)
- Add to MetaCart
In this paper, we study the effect of using-grams (sequences of words of length) for text categorization. We use an efficient algorithm for generating such-gram features in two benchmark domains, the 20 newsgroups data set and 21,578 REUTERS newswire articles. Our results with the rule learning algorithm RIPPER indicate that, after the removal of stop words, word sequences of length 2 or 3 are most useful. Using longer sequences reduces classification performance.
Effective Use of Natural Language Processing Techniques for Automatic Conflation of Multi-Word Terms: The Role of Derivational Morphology, Part of Speech Tagging, and Shallow Parsing
- In Research and Development in Information Retrieval
"... We present a corpus-based system to expand multi-word index terms using a part-of-speech tagger and a full-fledged derivational morphological system, combined with a shallow parser. The system has been applied to French. The unique contribution of the research is in using these linguistically based ..."
Abstract
-
Cited by 20 (3 self)
- Add to MetaCart
We present a corpus-based system to expand multi-word index terms using a part-of-speech tagger and a full-fledged derivational morphological system, combined with a shallow parser. The system has been applied to French. The unique contribution of the research is in using these linguistically based tools with safety filters in order to avoid the problems of degradation typically associated with derivational analysis and generation. The successful expansion and thus conflation of terms, increases indexing coverage up to 30% with precision of nearly 90% for correct identification of related terms. The fully implemented system is described with particular attention on the role of derivational morphology and phrasal relations. Results and evaluation are presented in terms of precision and recall, with an analysis and discussion of errors. This paper illustrates how natural language processing tools, when combined effectively for tasks to which they are especially suited, indicates the pote...
Cross-Lingual Text Categorization
, 2003
"... This article deals with the problem of Cross-Lingual Text Categorization (CLTC), which arises when documents in different languages must be classified according to the same classification tree. We describe practical and cost-e#ective solutions for automatic Cross-Lingual Text Categorization, bot ..."
Abstract
-
Cited by 17 (2 self)
- Add to MetaCart
This article deals with the problem of Cross-Lingual Text Categorization (CLTC), which arises when documents in different languages must be classified according to the same classification tree. We describe practical and cost-e#ective solutions for automatic Cross-Lingual Text Categorization, both in case a sufficient number of training examples is available for each new language and in the case that for some language no training examples are available. Experimental
Information Retrieval: A Survey
, 2000
"... Information Retrieval (IR) is the discipline that deals with retrieval of unstructured data, especially textual documents, in response to a query or topic statement, which may itself be unstructured, e.g., a sentence or even another document, or which may be structured, e.g., a boolean expression. T ..."
Abstract
-
Cited by 14 (0 self)
- Add to MetaCart
Information Retrieval (IR) is the discipline that deals with retrieval of unstructured data, especially textual documents, in response to a query or topic statement, which may itself be unstructured, e.g., a sentence or even another document, or which may be structured, e.g., a boolean expression. The need for effective methods of automated IR has grown in importance because of the tremendous explosion in the amount of unstructured data, both internal, corporate document collections, and the immense and growing number of document sources on the Internet. This report is a tutorial and survey of the state of the art, both research and commercial, in this dynamic field. The topics covered include: formulation of structured and unstructured queries and topic statements, indexing (including term weighting) of document collections, methods for computing the similarity of queries and documents, classification and routing of documents in an incoming stream to users on the basis of topic or nee...

