## Noise Reduction in a Statistical Approach to Text Categorization (1995)

### Cached

### Download Links

- [nyc.lti.cs.cmu.edu]
- [www.cs.cmu.edu]
- [nyc.lti.cs.cmu.edu]
- DBLP

### Other Repositories/Bibliography

Citations: | 58 - 7 self |

### BibTeX

@TECHREPORT{Yang95noisereduction,

author = {Yiming Yang},

title = {Noise Reduction in a Statistical Approach to Text Categorization},

institution = {},

year = {1995}

}

### Years of Citing Articles

### OpenURL

### Abstract

This paper studies noise reduction for computational efficiency improvements in a statistical learning method for text categorization, the Linear Least Squares Fit (LLSF) mapping. Multiple noise reduction strategies are proposedand evaluated, including: an aggressive removal of “non-informative words ” from texts before training; the use of a truncated singular value decomposition to cut off noisy “latent semantic structures ” during training; the elimination of non-influential components in the LLSF solution (a word-concept association matrix) after training. Text collections in different domains were used for evaluation. Significant improvements in computational efficiency without losing categorization accuracy were evident in the testing results. 1

### Citations

2721 | Indexing by Latent Semantic Analysis
- Deerwester, Dumais, et al.
- 1990
(Show Context)
Citation Context ...other set of singular vectors has dependent variables as the dimensions. These singular vectors are interpreted as the "orthogonal factors", "artificial concepts"or "latent se=-=mantic structures" (LSS) [20]. The word-=- "latent" is used because it is often difficult to give an intuitive interpretation about the meanings of these structures. It would be helpful, however, to have some understanding of the LS... |

1213 |
Automatic text processing: the transformation, analysis, and retrieval of information by computer
- Salton
- 1989
(Show Context)
Citation Context ...arly, word strengths are domain specific or application specific. An important difference between word strength and other corpus-dependentword weighting schemes such as the Inverse Document Frequency =-=[17]-=- is that word strength is not computed based on word occurrencies in documents, but based on word co-occurrencies in pairs of related documents. Word strength measures how informative a word is in ide... |

153 |
Expert network: effective and efficient learning from human decisions in text categorization and retrieval
- Yang
- 1994
(Show Context)
Citation Context ...ary gap problem has been recognized, and statistical learning of text-to-categories mapping based on human assignments has been a major focus in recent research in text categorization [3] [4] [5] [6] =-=[7]-=-. The LLSF mapping is a successful learner relying on past human relevance judgments, and can be used for both text retrieval [2] and text categorization. Significant improvements of LLSF mapping have... |

115 |
An example-based mapping method for text categorization and retrieval
- Yang, Chute
- 1994
(Show Context)
Citation Context ...trolled indexing language of a particular database is usually large. Consequently, search methods based on shared words between free text and category names typically exhibit poor performance [1] [2] =-=[3]-=-. The importance of using human knowledge to solve the vocabulary gap problem has been recognized, and statistical learning of text-to-categories mapping based on human assignments has been a major fo... |

97 | Latent Semantic Indexing (LSI) and TREC-2
- Dumais
- 1994
(Show Context)
Citation Context ... additional cost of the SVD computation is far more expensive than baseline text matching. Evaluation results, however, have shown no reliable improvement of LSI over baseline text matching [20] [21] =-=[22]-=-. My hypothesis about truncated SVD is different from the hypothesis in LSI. I do not claim an improvement of synonym representation in a document matrix by using suchan approach. Whether such a hypot... |

92 |
Classifying news stories using memory based reasoning
- MASAND, LINOFF, et al.
- 1992
(Show Context)
Citation Context ...abulary gap problem has been recognized, and statistical learning of text-to-categories mapping based on human assignments has been a major focus in recent research in text categorization [3] [4] [5] =-=[6]-=- [7]. The LLSF mapping is a successful learner relying on past human relevance judgments, and can be used for both text retrieval [2] and text categorization. Significant improvements of LLSF mapping ... |

75 |
Automatic Retrieval With Locality Information Using SMART
- Buckley, Salton, et al.
- 1993
(Show Context)
Citation Context ...hile the generic stop words are relatively "safe" to remove in the sense that their removal rarely causes a significant accuracy loss, the chance of significant accuracy improvement is also =-=small [3] [15]-=-. Since a generic stop-word list is often much smaller than the vocabularies of real-world document collections, only a limited number of words can be removed from texts, and the improvement in comput... |

54 | Automatic indexing based on Bayesian inference networks
- Tzeras, Hartmann
- 1993
(Show Context)
Citation Context ... vocabulary gap problem has been recognized, and statistical learning of text-to-categories mapping based on human assignments has been a major focus in recent research in text categorization [3] [4] =-=[5]-=- [6] [7]. The LLSF mapping is a successful learner relying on past human relevance judgments, and can be used for both text retrieval [2] and text categorization. Significant improvements of LLSF mapp... |

51 | Air/x - a rule-based multistage indexing systems for large subject elds - Fuhr, Hartmanna, et al. - 1991 |

24 | A Linear Least Squares Fit Mapping Method for Information Retrieval from Natural Language Texts
- Yang, Chute
- 1992
(Show Context)
Citation Context ... the controlled indexing language of a particular database is usually large. Consequently, search methods based on shared words between free text and category names typically exhibit poor performance =-=[1]-=- [2] [3]. The importance of using human knowledge to solve the vocabulary gap problem has been recognized, and statistical learning of text-to-categories mapping based on human assignments has been a ... |

24 |
The automatic identification of stop words
- Wilbur, Sirotkin
- 1992
(Show Context)
Citation Context ...to using generic stop words, Wilbur and Sirotkin developed a novel stop-word identification method which allows a far more aggressive removal of words from documents without losing retrieval accuracy =-=[16]. This met-=-hod uses a collection of training documents to estimate word importance using a score, namely "word strength". Clearly, word strengths are domain specific or application specific. An importa... |

16 |
An application of least squares fit mapping to text information retrieval
- Yang, Chute
- 1993
(Show Context)
Citation Context ... controlled indexing language of a particular database is usually large. Consequently, search methods based on shared words between free text and category names typically exhibit poor performance [1] =-=[2]-=- [3]. The importance of using human knowledge to solve the vocabulary gap problem has been recognized, and statistical learning of text-to-categories mapping based on human assignments has been a majo... |

12 |
LINPACK user's guide
- JJ, JR, et al.
- 1979
(Show Context)
Citation Context ...solving a LLSF problem employs a singular value decomposition(SVD) [8] of the input matrix A (the text-word matrix) as a part of the computation. The standard SVD algorithm, as implemented in LINPACK =-=[9]-=-, has a time complexity O(m 2 n), where m is the number of texts in the training corpus and n is the number of distinct words in these texts (assuming msn). If m and n are large, this cubic complexity... |

7 |
Latent Semantic Indexing of medical diagnoses using UMLS semantic structures. Proc Annu Symp Comput Appl Med Care
- CG, Yang, et al.
(Show Context)
Citation Context ...t the additional cost of the SVD computation is far more expensive than baseline text matching. Evaluation results, however, have shown no reliable improvement of LSI over baseline text matching [20] =-=[21]-=- [22]. My hypothesis about truncated SVD is different from the hypothesis in LSI. I do not claim an improvement of synonym representation in a document matrix by using suchan approach. Whether such a ... |

3 |
Using Corpus Statistics to Remove Redundant Words
- Yang
- 1996
(Show Context)
Citation Context ...t. Yang and Wilbur have applied the aggressive word removal method to document categorization to remove non-informative words from documents before applying a categorization method to these documents =-=[18]-=-. The effects on several categorization methods on different document collections have been studied and the effectiveness has been evident in the experiments. For all the methods tested, including two... |

1 |
Matrix Computations, 2nd Edition
- GH, CE
- 1989
(Show Context)
Citation Context ...ial question about LLSF is the computational complexity when applying LLSF to very large text collections. A conventional method for solving a LLSF problem employs a singular value decomposition(SVD) =-=[8]-=- of the input matrix A (the text-word matrix) as a part of the computation. The standard SVD algorithm, as implemented in LINPACK [9], has a time complexity O(m 2 n), where m is the number of texts in... |

1 |
LanczosAlgorithm for Large Symmetric Eigenvalue Computations. Vol.1: Theory
- RA
- 1985
(Show Context)
Citation Context ...exts (assuming msn). If m and n are large, this cubic complexity can become a computational bottleneck. Alternate algorithms were developed for very large and sparse matrices. The Lanczos methods [8] =-=[10]-=- [11], for example, are particularly efficient in situations where relatively few singular values are desired and relatively few orthogonalizations are needed. To take full advantage of Lanczos, howev... |

1 |
Large-Scale Sparse Singular Value Computations
- MW
- 1992
(Show Context)
Citation Context ...(assuming msn). If m and n are large, this cubic complexity can become a computational bottleneck. Alternate algorithms were developed for very large and sparse matrices. The Lanczos methods [8] [10] =-=[11]-=-, for example, are particularly efficient in situations where relatively few singular values are desired and relatively few orthogonalizations are needed. To take full advantage of Lanczos, however, t... |

1 |
Evaluating Text Categorization
- DD
- 1991
(Show Context)
Citation Context ...f binary decisions over categories [3] [4] [5] [6] [7]. A ranked list, of course, can be used to obtain binary decisions by setting a threshold. There are discussions about using alternative measures =-=[12]-=- [3]; these issues, however, are open research questions, and are not the focus of this paper. To measure efficiency improvement, the time savings in SVD is used because it is the major part of the tr... |

1 |
Review of "Methods for statistical data analysis of multivariate observations
- JE
- 1978
(Show Context)
Citation Context ... reduce the noise level in training documents is inspired by the Latent Structure Analysis theory, a well-known statistical method for analyzing causal factors or hidden reasons behind observedevents =-=[19]-=-. A latent structure is defined as a linear combination of the independent variables, and can be computed using the SVD of a matrix which represents correspondences between independent and dependent v... |

1 |
ACM Guide to Computing Literature Baltimore, MD: Association for Computing Machinery
- Harman, Ed
- 1984
(Show Context)
Citation Context ...and “keys” without distinction, and call these words together a document. The categories are defined in the Classification System for Computing Reviews (CSCR) and were assigned by humans to documents =-=[14]-=-. We used a subsetof the 3703 documentsin our experiments. We eliminated documents with an empty category field as obviously unsuitable for the study. We also eliminated documents with empty abstract ... |

1 |
Review of “Methods for statistical data analysis of multivariate observations” Technometrics Vol 20
- JE
- 1978
(Show Context)
Citation Context ...reduce the noise level in training documents is inspired by the Latent Structure Analysis theory, a well-known statistical method for analyzing causal factors or hidden reasons behind observed events =-=[19]-=-. A latent structure is defined as a linear combination of the independent variables, and can be computed using the SVD of a matrix which represents correspondences between independent and dependent v... |

1 |
Indexing by Latent Semantic analysis. JAmer Soc Inf Sci 41
- Deerwester, ST, et al.
- 1990
(Show Context)
Citation Context ...other set of singular vectors has dependent variables as the dimensions. These singular vectors are interpreted as the “orthogonal factors”, “artificial concepts”or “latent semantic structures” (LSS) =-=[20]-=-. The word “latent” is used because it is often difficult to give an intuitive interpretation about the meanings of these structures. It would be helpful, however, to have some understanding of the LS... |