• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Using Unlabeled Data to Improve Text Classification (2001)

Cached

  • Download as a PDF
  •  
  • Download as a PS

Download Links

  • [www.kamalnigam.com]
  • [reports-archive.adm.cs.cmu.edu]
  • [www-poleia.lip6.fr]
  • [www.kamalnigam.com]
  • [www-connex.lip6.fr]

  • Save to List
  • Add to Collection
  • Correct Errors
  • Monitor Changes
by Kamal Paul Nigam
Citations:41 - 0 self
  • Summary
  • Active Bibliography
  • Co-citation
  • Clustered Documents
  • Version History

BibTeX

@TECHREPORT{Nigam01usingunlabeled,
    author = {Kamal Paul Nigam},
    title = {Using Unlabeled Data to Improve Text Classification},
    institution = {},
    year = {2001}
}

Years of Citing Articles

Bookmark

citeulike Connotea Bibsonomy Del.icio.us Digg Reddit

OpenURL

 

Abstract

One key difficulty with text classification learning algorithms is that they require many hand-labeled examples to learn accurately. This dissertation demonstrates that supervised learning algorithms that use a small number of labeled examples and many inexpensive unlabeled examples can create high-accuracy text classifiers. By assuming that documents are created by a parametric generative model, Expectation-Maximization (EM) finds local maximum a posteriori models and classifiers from all the data -- labeled and unlabeled. These generative models do not capture all the intricacies of text; however on some domains this technique substantially improves classification accuracy, especially when labeled data are sparse. Two problems arise from this basic approach. First, unlabeled data can hurt performance in domains where the generative modeling assumptions are too strongly violated. In this case the assumptions can be made more representative in two ways: by modeling sub-topic class structure, and by modeling super-topic hierarchical class relationships. By doing so, model probability and classification accuracy come into correspondence, allowing unlabeled data to improve classification performance. The second problem is that even with a representative model, the improvements given by unlabeled data do not sufficiently compensate for a paucity of labeled data. Here, limited labeled data provide EM initializations that lead to low-probability models. Performance can be significantly improved by using active learning to select high-quality initializations, and by using alternatives to EM that avoid low-probability local maxima.

Citations

6695 Statistical Learning Theory - Vapnik - 1998
6517 Elements of Information Theory - Cover, Thomas - 1991
6236 Maximum likelihood from incomplete data via EM algorithm - Dempster, Laird, et al. - 1977
1366 Text categorization with support vector machines: learning with many relevant features - Joachims - 1998
947 Combining labeled and unlabeled data with co-training - Blum, Mitchell - 1998
726 The EM Algorithm and Extensions - McLachlan, Krishnan - 1997
634 Hierarchical mixtures of experts and the EM algorithm - Jordan, Jacobs - 1994
620 A comparison of event models for naive bayes text classi - McCallum, Nigam - 1998
595 Finite Mixture Models - McLachlan, Peel - 2000
550 Numerical Recipes - Press, Flannery, et al. - 1986
537 Neural networks and the bias/variance dilemma - Geman, Bienenstock, et al. - 1992
510 Relevance weighting of search terms - Robertson, Jones - 1976
509 Transductive Inference for Text Classification using Support Vector Machines - Joachims - 1999
486 Pazzani M: On the optimality of the simple Bayesian classifier under zero-one loss - Domingos - 1997
464 Inducing features of random fields - Pietra, S, et al. - 1997
448 Information theory and statistical mechanics - Jaynes - 1957
419 Inductive learning algorithms and representations for text categorization - Dumais, Platt, et al. - 1998
417 Approximate statistical tests for comparing supervised classification learning algorithms - Dietterich - 1998
416 Bayesian classification (AutoClass): Theory and results - Cheeseman, Stutz - 1996
413 An Evaluation of Statistical Approaches to Text Categorization - Yang - 1999
402 Active learning with statistical models - Cohn, Ghahramani, et al. - 1996
383 Unsupervised Word Sense Disambiguation Rivaling Supervised Methods - Yarowsky - 1995
373 BoosTexter: A boosting-based system for text categorization.” Machine learning 39.2 - Schapire, Singer - 2000
365 A sequential algorithm for training text classifiers - Lewis, Gale - 1994
363 Hierarchically classifying documents using very few words. ICML - Koller, Sahami - 1997
359 Unsupervised models for named entity classification - Collins, Singer - 1999
358 Machine learning - Mitchell - 1997
353 Newsweeder: Learning to filter netnews - Lang - 1995
338 Support vector machine active learning with applications to text classification - Tong, Koller - 2001
334 Improving generalization with active learning - Cohn, Atlas, et al. - 1994
324 A universal prior for integers and estimation by minimum description length, The Annals of Statistics 11 - Rissanen - 1983
315 Exploiting generative models in discriminative classifiers - Jaakkola, Haussler - 1998
309 A Bayesian approach to filtering junk E-mail - Sahami, Dumais, et al. - 1998
298 Divergence measures based on the Shannon entropy - Lin - 1991
285 A probabilistic analysis of the rocchio algorithm with tfidf for text categorization - Joachims - 1997
271 Syskill and Webert: Identifying Interesting Web Sites - Pazzani, Muramatsu, et al. - 1996
268 Naive (Bayes) at forty: The independence assumption in information retrieval - Lewis - 1998
256 Selective sampling using the query by committee algorithm, Machine Learning 28 - Freund, Seung, et al. - 1997
244 Automatically generating extraction patterns from untagged text - Riloff - 1996
243 Smopolinsky H: Query by committee - Seung, Opper - 1992
239 Comparison of Two Learning Algorithms for Text Categorization - Lewis, Ringuette - 1994
217 AutoClass: A Bayesian classification system - Cheeseman, Kelly, et al. - 1988
216 Training algorithms for linear text classi ers - Lewis, Shapire, et al. - 1996
213 Context-sensitive learning methods for text categorization - Cohen, Singer - 1999
212 Tagging English text with probabilistic model - Merialdo - 1994
208 Bayes and Empirical Bayes Methods for Data Analysis - CARLIN, T - 1996
207 Automated learning of decision rules for text categorization - Apte, Damerau, et al. - 1994
207 Using maximum entropy for text classification - Nigam - 1999
203 Improving text classification by shrinkage in a hierarchy of classes - McCallum, Rosenfeld, et al. - 1998
198 Employing EM and pool-based active learning for text classification - McCallum, Nigam - 1998
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University