## Employing EM in Pool-Based Active Learning for Text Classification (1998)

### Cached

### Download Links

Citations: | 269 - 9 self |

### BibTeX

@MISC{Mccallum98employingem,

author = {Andrew Mccallum and Kamal Nigam},

title = {Employing EM in Pool-Based Active Learning for Text Classification},

year = {1998}

}

### Years of Citing Articles

### OpenURL

### Abstract

This paper shows how a text classifier's need for labeled training data can be reduced by a combination of active learning and Expectation Maximization (EM) on a pool of unlabeled data. Query-by-Committee is used to actively select documents for labeling, then EM with a naive Bayes model further improves classification accuracy by concurrently estimating probabilistic labels for the remaining unlabeled documents and using them to improve the model. We also present a metric for better measuring disagreement among committee members; it accounts for the strength of their disagreement and for the distribution of the documents. Experimental results show that our method of combining EM and active learning requires only half as many labeled training examples to achieve the same accuracy as either EM or active learning alone. Keywords: text classification active learning unsupervised learning information retrieval 1 Introduction In many settings for learning text classifiers, obtaining lab...

### Citations

8906 | Maximum likelihood from incomplete data via the EM algorithm - Dempster, Laird, et al. - 1977 |

1865 | Text categorization with support vector machines: Learning with many relevant features
- Joachims
- 1998
(Show Context)
Citation Context ...st set. The ‘ModApte’ train/test split of the Reuters 21578 Distribution 1.0 data set consists of 12902 Reuters newswire articles in 135 overlapping topic categories. Following several other studies [=-=Joachims 1998-=-; Liere & Tadepalli 1997] we build binary classifiers for each of the 10 most populous classes. We ignore words on a stoplist, but do not use stemming. The resulting vocabulary has 19371 words. Result... |

828 | A comparison of event models for naive bayes text classification
- McCallum, Nigam
- 1998
(Show Context)
Citation Context ...ormulation of naive Bayes assumes a multinomial event model for documents; this generally produces better text classification accuracy than another formulation that assumes a multi-variate Bernoulli [=-=McCallum & Nigam 1998-=-]. 3 EM and Unlabeled Data When naive Bayes is given just a small set of labeled training data, classification accuracy will suffer because variance in the parameter estimates of the generative model ... |

637 | On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning 29:103–130
- Domingos, Pazzani
- 1997
(Show Context)
Citation Context ...4; Craven et al. 1998; Joachims 1997]. This paradox is explained by the fact that classification estimation is only a function of the sign (in binary cases) of the function estimation [Friedman 1997; =-=Domingos & Pazzani 1997-=-]. Also note that our formulation of naive Bayes assumes a multinomial event model for documents; this generally produces better text classification accuracy than another formulation that assumes a mu... |

565 | Distributional clustering of English words - Pereira, Tishby, et al. - 1993 |

556 | Active learning with statistical models - Cohn, Ghahramani, et al. - 1995 |

505 | A sequential algorithm for training text classifiers
- Lewis, Gale
- 1994
(Show Context)
Citation Context ...ith a Normal distribution. Several other studies have investigated active learning for text categorization. Lewis and Gale examine uncertainty sampling and relevance sampling in a pool-based setting [=-=Lewis & Gale 1994-=-; Lewis 1995]. These techniques select queries based on only a single classifier instead of a committee, and thus cannot approximate classification variance. Liere and Tadepalli [1997] use committees ... |

374 | A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization
- Joachims
- 1997
(Show Context)
Citation Context ... each class would have generated the test document in question, and selecting the most probable class. Our parametric model is naive Bayes, which is based on commonly used assumptions [Friedman 1997; =-=Joachims 1997-=-]. First we assume that text documents are generated by a mixture model (parameterized by θ), and that there is a one-to-one correspondence between the (observed) class labels and the mixture componen... |

359 | Learning to extract symbolic knowledge from the World Wide Web
- Craven, DiPasquo, et al.
- 1998
(Show Context)
Citation Context ... Note that our assumptions about the generation of text documents are all violated in practice, and yet empirically, naive Bayes does a good job of classifying text documents [Lewis & Ringuette 1994; =-=Craven et al. 1998-=-; Joachims 1997]. This paradox is explained by the fact that classification estimation is only a function of the sign (in binary cases) of the function estimation [Friedman 1997; Domingos & Pazzani 19... |

357 | Selective sampling using the Query by Committee algorithm
- Freund, Seung, et al.
- 1997
(Show Context)
Citation Context ...fication error and variance over the distribution of examples [Cohn, Ghahramani, & Jordan 1996]. When calculating this in closed-form is prohibitively complex, the Query-by-Committee (QBC) algorithm [=-=Freund et al. 1997-=-] can be used to select documents that have high classification variance themselves. QBC measures the variance indirectly, by examining the disagreement among class labels assigned by a set of classif... |

313 | Beyond independence: conditions for the optimality of the simple Bayesian classifier - Domingos, Pazzani |

289 | Comparison of Two Learning Algorithms for Text Categorization
- Lewis, Ringuette
- 1994
(Show Context)
Citation Context ... arg maxj P(cj|di; ˆ θ). Note that our assumptions about the generation of text documents are all violated in practice, and yet empirically, naive Bayes does a good job of classifying text documents [=-=Lewis & Ringuette 1994-=-; Craven et al. 1998; Joachims 1997]. This paradox is explained by the fact that classification estimation is only a function of the sign (in binary cases) of the function estimation [Friedman 1997; D... |

204 | On bias, variance, 0/1 - loss, and the curse-of-dimensionality. Data Mining and Knowledge Discovery 1:55–77
- Friedman
- 1997
(Show Context)
Citation Context ...robability that each class would have generated the test document in question, and selecting the most probable class. Our parametric model is naive Bayes, which is based on commonly used assumptions [=-=Friedman 1997-=-; Joachims 1997]. First we assume that text documents are generated by a mixture model (parameterized by θ), and that there is a one-to-one correspondence between the (observed) class labels and the m... |

193 | Supervised learning from incomplete data via an EM approach
- Ghahramani, Jordan
(Show Context)
Citation Context ...antly improved classification accuracy on the test set when the pool of labeled examples is small [Nigam et al. 1998]. 1 This use of EM is a special case of a more general missing values formulation [=-=Ghahramani & Jordan 1994-=-]. In implementation, EM is an iterative two-step process. The E-step calculates probabilistically-weighted class labels, P(cj|di; ˆ θ), for every unlabeled document using a current estimate of θ and ... |

171 | Learning to classify text from labeled and unlabeled documents
- Nigam, McCallum, et al.
- 1998
(Show Context)
Citation Context ...ctive learning with Expectation-Maximization (EM) in order to take advantage of the word co-occurrence information contained in the many documents that remain in the unlabeled pool. In previous work [=-=Nigam et al. 1998-=-] we show that combining the evidence of labeled and unlabeled documents via EM can reduce text classification error by one-third. We treat the absent labels as “hidden variables” and use EM to fill t... |

132 | Neural networks exploration using optimal experiment design
- Cohn
- 1994
(Show Context)
Citation Context ..., QBC should pick more informative documents to label. The complete active learning algorithm, both with and without EM, is summarized in Table 1. Unlike settings in which queries must be generated [=-=Cohn 1994-=-], and previous work in which the unlabeled data is available as a stream [Dagan & Engelson 1995; Liere & Tadepalli 1997; Freund et al. 1997], our assumption about the availability of a pool of unlabe... |

121 | Committee-based sampling for training probabilistic classi
- Dagan, Engelson
- 1995
(Show Context)
Citation Context ...lassifier k times, resulting in k committee members. Individual committee members are denoted by m. We consider two metrics for measuring committee disagreement. The previously employed vote entropy [=-=Dagan & Engelson 1995-=-] is the entropy of the class label distribution resulting from having each committee member “vote” with probability mass 1/k for its winning class. One disadvantage of vote entropy is that it does no... |

109 | The effect of unlabeled samples in reducing the small sample size problem and mitigating the Hughes phenomenon - Shahshahani, Landgrebe - 1994 |

107 |
A mixture of experts classifier with learning based on both labelled and unlabelled data
- Miller, Uyar
- 1997
(Show Context)
Citation Context ...ces text classification error by one-third [Nigam et al. 1998]. Two other studies have used EM to combine labeled and unlabeled data without active learning for classification, but on non-text tasks [=-=Miller & Uyar 1997-=-; Shahshahani & Landgrebe 1994]. Ghahramani and Jordan [1994] use EM with mixture models to fill in missing feature values. 6 Experimental Results This section provides evidence that using a combinati... |

89 | Active learning with committees for text categorization
- Liere, Tadepalli
- 1997
(Show Context)
Citation Context ...can be a computational convenience.) We consider three ways of selecting documents: stream-based, pool-based, and densityweighted pool-based. Some previous applications of QBC [Dagan & Engelson 1995; =-=Liere & Tadepalli 1997-=-] use a simulated stream of unlabeled documents. When a document is produced by the stream, this approach measures the classification disagreement among the committee members, and decides, based on th... |

76 | Similarity-based estimation of word cooccurrence probabilities - Dagan, Lee, et al. - 1994 |

18 | Estimations of dependences based on statistical data - Vapnik - 1982 |