Results 1 - 10
of
11
A Comparative Study on Feature Selection in Text Categorization
, 1997
"... This paper is a comparative study of feature selection methods in statistical learning of text categorization. The focus is on aggressive dimensionality reduction. Five methods were evaluated, including term selection based on document frequency (DF), information gain (IG), mutual information (MI), ..."
Abstract
-
Cited by 739 (11 self)
- Add to MetaCart
This paper is a comparative study of feature selection methods in statistical learning of text categorization. The focus is on aggressive dimensionality reduction. Five methods were evaluated, including term selection based on document frequency (DF), information gain (IG), mutual information (MI), a Ø 2 -test (CHI), and term strength (TS). We found IG and CHI most effective in our experiments. Using IG thresholding with a knearest neighbor classifier on the Reuters corpus, removal of up to 98% removal of unique terms actually yielded an improved classification accuracy (measured by average precision) . DF thresholding performed similarly. Indeed we found strong correlations between the DF, IG and CHI values of a term. This suggests that DF thresholding, the simplest method with the lowest cost in computation, can be reliably used instead of IG or CHI when the computation of these measures are too expensive. TS compares favorably with the other methods with up to 50% vocabulary redu...
Noise Reduction in a Statistical Approach to Text Categorization
, 1995
"... This paper studies noise reduction for computational efficiency improvements in a statistical learning method for text categorization, the Linear Least Squares Fit (LLSF) mapping. Multiple noise reduction strategies are proposedand evaluated, including: an aggressive removal of “non-informative word ..."
Abstract
-
Cited by 50 (5 self)
- Add to MetaCart
This paper studies noise reduction for computational efficiency improvements in a statistical learning method for text categorization, the Linear Least Squares Fit (LLSF) mapping. Multiple noise reduction strategies are proposedand evaluated, including: an aggressive removal of “non-informative words ” from texts before training; the use of a truncated singular value decomposition to cut off noisy “latent semantic structures ” during training; the elimination of non-influential components in the LLSF solution (a word-concept association matrix) after training. Text collections in different domains were used for evaluation. Significant improvements in computational efficiency without losing categorization accuracy were evident in the testing results. 1
Quantifying Query Ambiguity
"... We develop a measure of a query with respect to a collection of documents with the aim of quantifying the query's ambiguity with respect to those documents. This measure, the clarity score, is the relative entropy between a query language model and the corresponding collection language model. We sub ..."
Abstract
-
Cited by 31 (5 self)
- Add to MetaCart
We develop a measure of a query with respect to a collection of documents with the aim of quantifying the query's ambiguity with respect to those documents. This measure, the clarity score, is the relative entropy between a query language model and the corresponding collection language model. We substantiate that the clarity score measures the coherence and specificity of the language used in documents likely to satisfy the query. We also argue that it provides a suitable quantification of the (lack of) ambiguity of a query with respect to a collection of documents and has potential applications throughout the field of information retrieval. In particular, the clarity score is shown to correlate positively with average precision in evaluations using TREC test collections. Hence, as one example, the clarity score could serve as a predictor of query performance. Systems would then be able to identify vague information requests and respond di#erently than they would to clear and specific requests.
An Evaluation on Feature Selection for Text Clustering
- In ICML
, 2003
"... Feature selection methods have been successfully applied to text categorization but seldom applied to text clustering due to the unavailability of class label information. In this paper, we first give empirical evidence that feature selection methods can improve the efficiency and performance of tex ..."
Abstract
-
Cited by 17 (2 self)
- Add to MetaCart
Feature selection methods have been successfully applied to text categorization but seldom applied to text clustering due to the unavailability of class label information. In this paper, we first give empirical evidence that feature selection methods can improve the efficiency and performance of text clustering algorithm. Then we propose a new feature selection method called “Term Contribution (TC) ” and perform a comparative study on a variety of feature selection methods for text clustering, including Document Frequency (DF), Term Strength (TS), Entropy-based (En), Information Gain (IG) and א 2 statistic (CHI). Finally, we propose an “Iterative Feature Selection (IF) ” method that addresses the unavailability of label problem by utilizing effective supervised feature selection method to iteratively select features and perform clustering. Detailed experimental results on Web Directory data are provided in the paper. 1.
A Comparative Study for Domain Ontology Guided Feature Extraction
, 2003
"... We introduced a novel method employing a hierarchical domain ontology structure to extract features representing documents in our previous publication (Wang 2002). All raw words in the training documents are mapped to concepts in a concept hierarchy derived from the domain ontology. Based on these c ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
We introduced a novel method employing a hierarchical domain ontology structure to extract features representing documents in our previous publication (Wang 2002). All raw words in the training documents are mapped to concepts in a concept hierarchy derived from the domain ontology. Based on these concepts, a concept hierarchy is established for the training document space, using is-a relationships defined in the domain ontology. An optimum concept set may be obtained by searching the concept hierarchy with an appropriate heuristic function. This may be used as the feature space to represent the training dataset. The proposed method aims to solve some drawbacks suffered by text classification algorithms and feature selection algorithms. In this paper, we conducted a series of experiments to compare our approach with other comparable feature-selection and feature-extraction methods. The results indicated that our approach has advantages in many aspects.
Feature Weighting for Co-occurrence-based Classification of Words
- Proc. 20th Int’l Conf. Computational Linguistics, article No. 799
, 2004
"... The paper comparatively studies methods of feature weighting in application to the task of cooccurrence-based classification of words according to their meaning. We explore parameter optimization of several weighting methods frequently used for similar problems such as text classification. We ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
The paper comparatively studies methods of feature weighting in application to the task of cooccurrence-based classification of words according to their meaning. We explore parameter optimization of several weighting methods frequently used for similar problems such as text classification. We find that successful application of all the methods crucially depends on a number of parameters; only a carefully chosen weighting procedure allows to obtain consistent improvement on a classifier learned from non-weighted data.
Weighting Distributional Features for Automatic Semantic Classification of Words
- In International Conference on Recent Advances In Natural Language Processing
, 2003
"... The paper is concerned with weighting distributional features of words with the aim of improving their automatic semantic classification, a task relevant to a number of NLP applications such as lexicon acquisition or named entity recognition. The purpose of the paper is to bring attention to d ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
The paper is concerned with weighting distributional features of words with the aim of improving their automatic semantic classification, a task relevant to a number of NLP applications such as lexicon acquisition or named entity recognition. The purpose of the paper is to bring attention to di#erences between two major weighting strategies: Discriminative Feature Weighting and Characteristic Feature Weighting. The comparative study includes three popular discriminative weighting functions (Mutual Information, Information Gain, and Gain Ratio), and three characteristic weighting functions (Term Strength, and the two newly introduced Local Term Strength and Confidence). We find that the two strategies, on the one hand, are characterized by their own optimal settings, and, on the other hand, similarly interact with the parameter optimization of the learning algorithm.
Associative Feature Selection for Text Mining
"... With the exponential growth of the number of documents available on the Internet, automatic feature selection approaches are increasingly important for the preprocessing of textual documents for data mining. Feature selection, which focuses on identifying relevant data, can help reduce the workload ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
With the exponential growth of the number of documents available on the Internet, automatic feature selection approaches are increasingly important for the preprocessing of textual documents for data mining. Feature selection, which focuses on identifying relevant data, can help reduce the workload of processing huge amounts of data as well as increase the accuracy for the subsequent data mining tasks. In this paper, we propose a new feature selection approach for text mining based on association rules. An evaluation on the performance of the proposed associative feature selection approach based on a dataset of
Libraries of Medicine
"... Further information about the programs described in this administrative report is available from: Office of Public Information ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Further information about the programs described in this administrative report is available from: Office of Public Information
1 Reducing the Loss of Information through Annealing Text Distortion
"... Abstract — Compression distances have been widely used in knowledge discovery and data mining. They are parameter-free, widely applicable, and very effective in several domains. However, little has been done to interpret their results or to explain their behavior. In this paper we take a step toward ..."
Abstract
- Add to MetaCart
Abstract — Compression distances have been widely used in knowledge discovery and data mining. They are parameter-free, widely applicable, and very effective in several domains. However, little has been done to interpret their results or to explain their behavior. In this paper we take a step towards understanding compression distances by performing an experimental evaluation of the impact of several kinds of information distortion on compression-based text clustering. We show how progressively removing words in such a way that the complexity of a document is slowly reduced helps the compression-based text clustering and improves its accuracy. In fact, we show how the non-distorted text clustering can be improved by means of annealing text distortion. The experimental results shown in this paper are consistent using different data sets, and different compression algorithms belonging to the most important compression families: Lempel-

