## Semi-Supervised Learning of Mixture Models (2003)

### Cached

### Download Links

- [www.aaai.org]
- [www-connex.lip6.fr]
- [www-poleia.lip6.fr]
- [www-connex.lip6.fr]
- CiteULike
- DBLP

### Other Repositories/Bibliography

Venue: | ICML-03, 20th International Conference on Machine Learning |

Citations: | 41 - 5 self |

### BibTeX

@INPROCEEDINGS{Cozman03semi-supervisedlearning,

author = {Fabio Gagliardi Cozman and Ira Cohen and Marcelo Cesar Cirelo and Escola Politécnica},

title = {Semi-Supervised Learning of Mixture Models},

booktitle = {ICML-03, 20th International Conference on Machine Learning},

year = {2003},

pages = {99--106}

}

### Years of Citing Articles

### OpenURL

### Abstract

This paper analyzes the performance of semisupervised learning of mixture models. We show that unlabeled data can lead to an increase in classification error even in situations where additional labeled data would decrease classification error. We present a mathematical analysis of this "degradation" phenomenon and show that it is due to the fact that bias may be adversely affected by unlabeled data. We discuss the impact of these theoretical results to practical situations.

### Citations

1004 |
G.: A Probabilistic Theory of Pattern Recognition
- Devroye, Györfi, et al.
- 1996
(Show Context)
Citation Context ...s. If we knew exactly the joint distribution F ¦ C¢ X§ , the optimal rule would be to choose class c¡ when the probability ofsC ¨ c¡¥¤ given x is larger than 1© 2, and to choose class c¡ ¡ =-=otherwise (Devroye et al., 1996). -=-This classification rule attains the minimum possible classification error, called the Bayes error. We take that the probabilities of ¦ C¢ X§ , or functions of these probabilities, are estimated fr... |

808 | Text Classification from Labeled and Unlabeled Documents Using EM
- Nigam, McCallum, et al.
(Show Context)
Citation Context ...sh (1995) use unlabeled samples to es1 Relevant references: (Baluja, 1998; Bruce, 2001; Collins & Singer, 2000; Comité et al., 1999; Goldman & Zhou, 2000; McCallum & Nigam, 1998; Miller & Uyar, 1996;=-= Nigam et al., 2000; -=-Shahshahani & Landgrebe, 1994b).stimate decision regions (by estimating p¦ X§ ), and labeled samples are used solely to determine the labels of each region (Ratsaby and Venkatesh refer to this proce... |

598 | Bayesian network classifiers
- Friedman, Geiger, et al.
- 1997
(Show Context)
Citation Context ...tic and theoretical tone, we would like to mention some positive practical experience with semi-supervised learning. We have observed that semi-supervised learning of Naive Bayes and TAN classifiers (=-=Friedman et al., 1997-=-), using the EM algorithm to handle unlabeled samples, can be quite successful in classification problems with very large numbers of features and not so large labeled datasets. Text classification and... |

472 | Mixture Model Inference and Application to Clustering - McLachlan, Basford - 1988 |

464 | The Behavior of Maximum Likelihood Estimates under NonStandard Conditions.” Pp. 221-33 - Huber - 1967 |

437 | Unsupervised models for named entity classi in
- Collins, Singer
- 1999
(Show Context)
Citation Context ...ortant positive theoretical results concerning unlabeled data. Castelli and Cover (1996) and Ratsaby and Venkatesh (1995) use unlabeled samples to es1 Relevant references: (Baluja, 1998; Bruce, 2001; =-=Collins & Singer, 2000; C-=-omité et al., 1999; Goldman & Zhou, 2000; McCallum & Nigam, 1998; Miller & Uyar, 1996; Nigam et al., 2000; Shahshahani & Landgrebe, 1994b).stimate decision regions (by estimating p¦ X§ ), and label... |

422 |
Maximum likelihood estimation of misspecified models
- White
- 1982
(Show Context)
Citation Context ...ugh the indices i and j): AY ¦ θ§�¨ Es∂ 2 log p¦ Y � θ§ © ∂θiθ j¡ , BY ¦ θ§�¨ E � ¦ ∂log p¦ Y � θ§ © ∂θi§ ¦ ∂log p¦ Y � θ§�© ∂θ j§�� . We use=-= the following known result (Berk, 1966; Huber, 10 4s1967; White, 1982). Consider a -=-parametric model F ¦ Y � θ§ with the properties discussed in previous sections, and a sequence of maximum likelihood estimates ˆθN, obtained by maximization of ∑ N is1 log p¦ yi � θ§ , w... |

386 |
GJ: Discriminant Analysis and Statistical Pattern Recognition
- McLachlan
- 1992
(Show Context)
Citation Context ...no guarantees concerning the supposedly superior effect of labeled data. eling errors. First, we have avoided the possibility that labeled and unlabeled data are sampled from different distributions (=-=McLachlan, 1992-=-, pages 42-43); second, we have avoided the possibility that more classes are represented in the unlabeled data than in the labeled data, perhaps due to the scarcity of labeled samples (Nigam et al., ... |

259 | Employing EM and pool-based active learning for text classification
- McCallum, Nigam
- 1998
(Show Context)
Citation Context ...elli and Cover (1996) and Ratsaby and Venkatesh (1995) use unlabeled samples to es1 Relevant references: (Baluja, 1998; Bruce, 2001; Collins & Singer, 2000; Comité et al., 1999; Goldman & Zhou, 2000;=-= McCallum & Nigam, 1998; -=-Miller & Uyar, 1996; Nigam et al., 2000; Shahshahani & Landgrebe, 1994b).stimate decision regions (by estimating p¦ X§ ), and labeled samples are used solely to determine the labels of each region (... |

122 | Maximum entropy discrimination
- JAAKKOLA, MEILA, et al.
- 1999
(Show Context)
Citation Context ...of modeling errors (Kharin, 1996). Finally, it would be important to investigate performance degradation in other frameworks, such as support vector machines, co-training, or entropy based solutions (=-=Jaakkola et al., 1999-=-). We conjecture that any approach that incorporates unlabeled data, so as to improve performance when the model is correct, may suffer from performance degradation when the model is incorrect (this f... |

121 | Enhancing supervised learning with unlabeled data
- Goldman, Z
(Show Context)
Citation Context ...g unlabeled data. Castelli and Cover (1996) and Ratsaby and Venkatesh (1995) use unlabeled samples to es1 Relevant references: (Baluja, 1998; Bruce, 2001; Collins & Singer, 2000; Comité et al., 1999;=-= Goldman & Zhou, 2000; -=-McCallum & Nigam, 1998; Miller & Uyar, 1996; Nigam et al., 2000; Shahshahani & Landgrebe, 1994b).stimate decision regions (by estimating p¦ X§ ), and labeled samples are used solely to determine the... |

113 | Learning with mixtures of trees - Meilǎ, Jordan - 2001 |

103 |
A Mixture of Experts Classifier with Learning Based on Both Labelled and Unlabelled Data
- Miller, Uyar
- 1997
(Show Context)
Citation Context ...d Ratsaby and Venkatesh (1995) use unlabeled samples to es1 Relevant references: (Baluja, 1998; Bruce, 2001; Collins & Singer, 2000; Comité et al., 1999; Goldman & Zhou, 2000; McCallum & Nigam, 1998;=-= Miller & Uyar, 1996; -=-Nigam et al., 2000; Shahshahani & Landgrebe, 1994b).stimate decision regions (by estimating p¦ X§ ), and labeled samples are used solely to determine the labels of each region (Ratsaby and Venkatesh... |

100 |
The relative value of labeled and unlabeled samples in pattern recognition with an unknown mixing parameter
- Castelli, Cover
- 1996
(Show Context)
Citation Context ...ily of distributions 3 We have to handle a difficulty with e� θ�u� : given only unlabeled data, there is no information to decide the labels for decision regions, and the classification error i=-=s 1/2 (Castelli, 1994). To simplif-=-y the discussion, we assume that, when λ� 0, an “oracle” will be available to indicate the labels of the decision regions.s¦ C¢ X� θ§ F . In view of Theorem 1, it is perhaps not surprisin... |

100 | The effect of unlabeled samples in reducing the small sample size problem and mitigating the hughes phenomenon
- Shahshahani, Landgrebe
- 1994
(Show Context)
Citation Context ...led samples to es1 Relevant references: (Baluja, 1998; Bruce, 2001; Collins & Singer, 2000; Comité et al., 1999; Goldman & Zhou, 2000; McCallum & Nigam, 1998; Miller & Uyar, 1996; Nigam et al., 2000;=-= Shahshahani & Landgrebe, 1994b).sti-=-mate decision regions (by estimating p¦ X§ ), and labeled samples are used solely to determine the labels of each region (Ratsaby and Venkatesh refer to this procedure as “Algorithm M”). Castell... |

90 |
A probability analysis on the value of unlabeled data for classi problems
- Zhang, Oles
- 2000
(Show Context)
Citation Context ...parts from the generative scheme is to focus only on p¦ C � X¢ θ§ and to take the marginal p¦ X§ to be independent of θ. Such a strategy produces a diagnostic model (for example, logistic reg=-=ression (Zhang & Oles, 2000-=-)). In this narrow sense of diagnostic models, maximum likelihood cannot process unlabeled data for any given dataset (see Zhang and Oles (2000) for a discussion). In this paper we adopt maximum likel... |

76 | On the Exponential Value of Labeled Samples - Castelli, Cover - 1995 |

51 | Using unlabeled data to improve text classification - Nigam - 2001 |

46 | Probabilistic modeling for face orientation discrimination: Learning from labeled and unlabeled data
- Baluja
- 1998
(Show Context)
Citation Context ...e. There have also been important positive theoretical results concerning unlabeled data. Castelli and Cover (1996) and Ratsaby and Venkatesh (1995) use unlabeled samples to es1 Relevant references: (=-=Baluja, 1998;-=- Bruce, 2001; Collins & Singer, 2000; Comité et al., 1999; Goldman & Zhou, 2000; McCallum & Nigam, 1998; Miller & Uyar, 1996; Nigam et al., 2000; Shahshahani & Landgrebe, 1994b).stimate decision regi... |

42 |
Limiting behavior of posterior distributions when the model is incorrect
- Berk
- 1966
(Show Context)
Citation Context ...ces are formed by running through the indices i and j): AY ¦ θ§�¨ Es∂ 2 log p¦ Y � θ§ © ∂θiθ j¡ , BY ¦ θ§�¨ E � ¦ ∂log p¦ Y � θ§ © ∂θi§ ¦ ∂log p¦ Y � θ§=-=�© ∂θ j§�� . We use the following known result (Berk, 1966; Huber-=-, 10 4s1967; White, 1982). Consider a parametric model F ¦ Y � θ§ with the properties discussed in previous sections, and a sequence of maximum likelihood estimates ˆθN, obtained by maximizatio... |

38 | Normal discrimination with unclassified observations - O’Neill - 1978 |

36 | Unlabeled data can degrade classification performance of generative classifiers
- Cozman, Cohen
(Show Context)
Citation Context ...). However, a more detailed analysis of current empirical results does reveal some puzzling aspects of unlabeled data. 2 We have reviewed descriptions of performance degradation in the literature in (=-=Cozman & Cohen, 2002-=-); here we just mention the relevant references. Four results are particularly interesting: Shahshahani and Landgrebe (1994b) and Baluja (1998) describe degradation in image understanding, while Nigam... |

36 | Learning from a Mixture of Labeled and Unlabeled Examples with Parametric Side - Ratsaby, Venkatesh - 1995 |

32 |
Learning with Labeled and Unlabeled Data,” technical report, Univ
- Seeger
(Show Context)
Citation Context ...tical situations. 1. Introduction Semi-supervised learning has received considerable attention in the machine learning literature due to its potential in reducing the need for expensive labeled data (=-=Seeger, 2001-=-). Applications such as text classification, genetic research and machine vision are examples where cheap unlabeled data can be added to a pool of labeled samples. The literature seems to hold a rathe... |

29 | Combining labeled and unlabeled data for text classification with a large number of categories - Ghani - 2001 |

26 | Learning Bayesian network classifiers for facial expression recognition using both labelled and unlabeled data, in
- Cohen, Sebe, et al.
- 2003
(Show Context)
Citation Context ...SSS) that essentially performs MetropolisHastings runs in the space of Bayesian networks; we have observed that this method, while demanding huge computational effort, can improve on TAN classifiers (=-=Cohen et al., 2003).-=- To illustrate these statements, take the Shuttle dataset from the UCI repository. With 43500 labeled samples, a Naive Bayes classifier has classification error of 0� 07% (on independent test set wi... |

22 | Continuation Methods for Mixing Heterogeneous Sources
- Corduneanu, Jaakkola
- 2002
(Show Context)
Citation Context ...nection, as we have argued, comes from an understanding of asymptotic bias. We have on purpose not dealt with two types of mod5 Some authors have argued that labeled data should be given more weight (=-=Corduneanu & Jaakkola, 2002-=-), but this example shows that there are no guarantees concerning the supposedly superior effect of labeled data. eling errors. First, we have avoided the possibility that labeled and unlabeled data a... |

19 | The efficiency of a linear discriminant function based on unclassified initial samples - Ganesalingam, McLachlan - 1978 |

15 |
Positive and unlabeled examples help learning
- Comité, Denis, et al.
- 1999
(Show Context)
Citation Context ...cal results concerning unlabeled data. Castelli and Cover (1996) and Ratsaby and Venkatesh (1995) use unlabeled samples to es1 Relevant references: (Baluja, 1998; Bruce, 2001; Collins & Singer, 2000; =-=Comité et al., 1999; -=-Goldman & Zhou, 2000; McCallum & Nigam, 1998; Miller & Uyar, 1996; Nigam et al., 2000; Shahshahani & Landgrebe, 1994b).stimate decision regions (by estimating p¦ X§ ), and labeled samples are used s... |

14 |
Semi-supervised learning using prior probabilities and em
- Bruce
- 2001
(Show Context)
Citation Context ...also been important positive theoretical results concerning unlabeled data. Castelli and Cover (1996) and Ratsaby and Venkatesh (1995) use unlabeled samples to es1 Relevant references: (Baluja, 1998; =-=Bruce, 2001;-=- Collins & Singer, 2000; Comité et al., 1999; Goldman & Zhou, 2000; McCallum & Nigam, 1998; Miller & Uyar, 1996; Nigam et al., 2000; Shahshahani & Landgrebe, 1994b).stimate decision regions (by estim... |

10 |
Classification of multi-spectral data by joint supervised-unsupervised learning(Technical
- Shahshahani, Landgrebe
- 1994
(Show Context)
Citation Context ...led samples to es1 Relevant references: (Baluja, 1998; Bruce, 2001; Collins & Singer, 2000; Comité et al., 1999; Goldman & Zhou, 2000; McCallum & Nigam, 1998; Miller & Uyar, 1996; Nigam et al., 2000;=-= Shahshahani & Landgrebe, 1994b).sti-=-mate decision regions (by estimating p¦ X§ ), and labeled samples are used solely to determine the labels of each region (Ratsaby and Venkatesh refer to this procedure as “Algorithm M”). Castell... |

9 |
On the asymptotic improvement in the outcome of supervised learning provided by additional nonsupervised learning
- Cooper, Freeman
- 1970
(Show Context)
Citation Context ...e variance of estimates, and the smaller the classification error. Several reports in the literature seem to corroborate this informal reasoning. Investigations in the seventies are quite optimistic (=-=Cooper & Freeman, 1970; -=-Jr., 1973; O’Neill, 1978). More recently, there has been plenty of applied work with semi-supervised learning, 1 with some notable successes. There have also been workshops on semi-supervised learni... |

8 |
Robustness in Statistical Pattern Recognition
- Kharin
- 1996
(Show Context)
Citation Context ...led data. Also, the analysis of bias should be much enlarged, with the addition of finite sample results. Another possible avenue is to look for optimal estimators in the presence of modeling errors (=-=Kharin, 1996-=-). Finally, it would be important to investigate performance degradation in other frameworks, such as support vector machines, co-training, or entropy based solutions (Jaakkola et al., 1999). We conje... |

5 | A comparison of iterative maximum likelihood estimates of the parameters of a mixture of two normal distributions under three different types of sample - Jr - 1973 |

4 |
The Effect of Modeling Errors in Semi-Supervised Learning of Mixture Models, How Unlabeled Data Can Degrade Performance of Generative Classifiers, at http://www.poli.usp.br/p/fabio.cozman/ Publications/Report/lul.ps.gz
- Cozman, Cohen
- 2003
(Show Context)
Citation Context ...n measurable spaces, all functions are twice differentiable and all functions and their derivatives are measurable and dominated by integrable functions. A formal list of assumptions can be found in (=-=Cozman & Cohen, 2003-=-). In semi-supervised learning, classifiers are built from a combination of Nl labeled and Nu unlabeled samples. We assume that the samples are independent and ordered so that the first Nl samples are... |