## Combining labeled and unlabeled data with co-training (1998)

### Cached

### Download Links

Citations: | 1318 - 28 self |

### BibTeX

@INPROCEEDINGS{Blum98combininglabeled,

author = {Avrim Blum and Tom Mitchell},

title = {Combining labeled and unlabeled data with co-training},

booktitle = {},

year = {1998},

pages = {92--100},

publisher = {Morgan Kaufmann Publishers}

}

### Years of Citing Articles

### OpenURL

### Abstract

We consider the problem of using a large unlabeled sample to boost performance of a learning algorithm when only a small set of labeled examples is available. In particular, we consider a setting in which the description of each example can be partitioned into two distinct views, motivated by the task of learning to classify web pages. For example, the description of a web page can be partitioned into the words occurring on that page, and the words occurring in hyperlinks that point to that page. We assume that either view of the example would be su cient for learning if we had enough labeled data, but our goal is to use both views together to allow inexpensive unlabeled data to augment amuch smaller set of labeled examples. Speci cally, the presence of two distinct views of each example suggests strategies in which two learning algorithms are trained separately on each view, and then each algorithm's predictions on new unlabeled examples are used to enlarge the training set of the other. Our goal in this paper is to provide a PAC-style analysis for this setting, and, more broadly, a PAC-style framework for the general problem of learning from both labeled and unlabeled data. We also provide empirical results on real web-page data indicating that this use of unlabeled examples can lead to signi cant improvement of hypotheses in practice. As part of our analysis, we provide new re-

### Citations

9033 | Maximum likelihood from incomplete data via the EM algorithm
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ...her methods that have been used for combining labeled and unlabeled data. One standard approach to learning with missing values (e.g., such as when some of the labels are unknown) is the EM algorithm =-=[3]-=-. The EM algorithm is typically analyzed under the assumption that the data is generated according to some simple known parametric model. For instance, a common assumption is that the positive example... |

4172 |
Pattern Classification and Scene Analysis
- DUDA, HART
- 1973
(Show Context)
Citation Context ...t F30602-97-1-0215 and by NSF National Young Investigator grant CCR-9357793. 1 INTRODUCTION In many machine learning settings, unlabeled examples are significantly easier to come by than labeled ones =-=[4, 15]-=-. One example of this is web-page classification. Suppose that we want a program to electronically visit some web site and download all the web pages of interest to us, such as all the CS faculty memb... |

517 | Unsupervised Word Sense Disambiguation Rivaling Supervised Methods
- Yarowsky
- 1995
(Show Context)
Citation Context ...t F30602-97-1-0215 and by NSF National Young Investigator grant CCR-9357793. 1 INTRODUCTION In many machine learning settings, unlabeled examples are significantly easier to come by than labeled ones =-=[4, 15]-=-. One example of this is web-page classification. Suppose that we want a program to electronically visit some web site and download all the web pages of interest to us, such as all the CS faculty memb... |

362 | Learning to Extract Symbolic Knowledge from the World Wide Web
- Craven, DiPasquo, et al.
- 1998
(Show Context)
Citation Context ...that we want a program to electronically visit some web site and download all the web pages of interest to us, such as all the CS faculty member pages, or all the course home pages at some university =-=[1]-=-. To train such a system to automatically classify web pages, one would typically rely on hand labeled web pages. These labeled examples are fairly expensive to obtain because they require human effor... |

301 | Efficient noise-tolerant learning from statistical queries
- Kearns
- 1999
(Show Context)
Citation Context ...abeled data only, given an initial weakly-useful predictor h(x 1 ). Thus, for instance, the conditional independence assumption implies that any concept class learnable in the Statistical Query model =-=[11]-=- is learnable from unlabeled data and an initial weakly-useful predictor. Before proving the theorem, it will be convenient to define a variation on the standard classification noise model where the n... |

292 | A comparison of two learning algorithms for text categorization
- Lewis, Ringuette
- 1994
(Show Context)
Citation Context ...efer to these as the page-based and the hyperlink-based classifiers, respectively. This naive Bayes algorithm has been empirically observed to be successful for a variety of text-categorization tasks =-=[12]-=-. The co-training algorithm we used is described in Table 1. Given a set L of labeled examples and a set U of unlabeled examples, the algorithm first creates a smaller pool U 0 containing u unlabeled ... |

271 |
Pattern Classi cation and Scene Analysis
- Duda, Hart
- 1973
(Show Context)
Citation Context ...nce Carnegie Mellon University Pittsburgh, PA 15213-3891 mitchell+@cs.cmu.edu 1 INTRODUCTION In many machine learning settings, unlabeled examples are signi cantly easier to come by than labeled ones =-=[4, 15]-=-. One example of this is web-page classi cation. Suppose that we want a program to electronically visit some web site and download all the web pages of interest to us, such as all the CS faculty membe... |

195 | Supervised learning from incomplete data via an EM approach
- Ghahramani, Jordan
- 1994
(Show Context)
Citation Context ...xt page, and vice-versa. We call this type of bootstrapping cotraining, and it has a close connection to bootstrapping from incomplete data in the Expectation-Maximization setting; see, for instance, =-=[5, 13]-=-. The question this raises is: is there any reason to believe co-training will help? Our goal is to address this question by developing a PAC-style theoretical framework to better understand the issue... |

104 |
The relative value of labeled and unlabeled samples in pattern recognition with an unknown mixing parameter
- Castelli, Cover
- 1996
(Show Context)
Citation Context ...ne standard setting in which this problem has been analyzed is to assume that the data is generated according to some simple known parametric model. Under assumptions of this form, Castelli and Cover =-=[1, 2]-=- precisely quantify relative values of labeled and unlabeled data for Bayesian optimal learners. The EM algorithm, widely used in practice for learning from data with missing information, can also be ... |

103 | On the complexity of teaching
- Goldman, Kemxm
- 1991
(Show Context)
Citation Context ...ion noise. In terms of other PAC-style models, we can think of our setting as somewhat in between the uniform distribution model, in which the distribution is particularly neutral, and teacher models =-=[6, 8]-=- in which examples are being supplied by a helpful oracle. 2.1 A BIPARTITE GRAPH REPRESENTATION One way to look at the co-training problem is to view the distribution D as a weighted bipartite graph, ... |

94 | Informedia: News-on-demand multimedia information acquisition and retrieval
- Hauptmann, Witbrock
- 1997
(Show Context)
Citation Context ...ecture that there are many practical learning problems that fit or approximately fit the co-training model. For example, consider the problem of learning to classify segments of television broadcasts =-=[7, 14]-=-. We might be interested, say, in learning to identify televised segments containing the US President. Here X 1 could be the set of possible video images, X 2 the set of possible audio signals, and X ... |

81 |
On the exponential value of labeled samples
- Castelli, Cover
- 1995
(Show Context)
Citation Context ...ne standard setting in which this problem has been analyzed is to assume that the data is generated according to some simple known parametric model. Under assumptions of this form, Castelli and Cover =-=[1, 2]-=- precisely quantify relative values of labeled and unlabeled data for Bayesian optimal learners. The EM algorithm, widely used in practice for learning from data with missing information, can also be ... |

75 | Random sampling in cut, flow, and network design problems
- Karger
- 1994
(Show Context)
Citation Context ...nt c j of S. If msjSj, the above formula is approximately X c j 2GS s j jSj ` 1 \Gamma s j jSj 'm ; in analogy to Equation 1. In fact, we can use recent results in the study of random graph processes =-=[9]-=- to describe quantitatively how 1 To make this more plausible in the context of web pages, think of x1 as not the document itself but rather some small set of attributes of the document. we expect the... |

37 |
Learning from a mixture of labeled and unlabeled examples with parametric side information
- Ratsaby, Venkatesh
- 1995
(Show Context)
Citation Context ...xt page, and vice-versa. We call this type of bootstrapping cotraining, and it has a close connection to bootstrapping from incomplete data in the Expectation-Maximization setting; see, for instance, =-=[5, 13]-=-. The question this raises is: is there any reason to believe co-training will help? Our goal is to address this question by developing a PAC-style theoretical framework to better understand the issue... |

26 | A computational model of teaching
- Jackson, Tomkins
- 1992
(Show Context)
Citation Context ...ion noise. In terms of other PAC-style models, we can think of our setting as somewhat in between the uniform distribution model, in which the distribution is particularly neutral, and teacher models =-=[6, 8]-=- in which examples are being supplied by a helpful oracle. 2.1 A BIPARTITE GRAPH REPRESENTATION One way to look at the co-training problem is to view the distribution D as a weighted bipartite graph, ... |

10 | Improving Acoustic Models by Watching Television
- Witbrock, Hauptmann
- 1997
(Show Context)
Citation Context ...ecture that there are many practical learning problems that fit or approximately fit the co-training model. For example, consider the problem of learning to classify segments of television broadcasts =-=[7, 14]-=-. We might be interested, say, in learning to identify televised segments containing the US President. Here X 1 could be the set of possible video images, X 2 the set of possible audio signals, and X ... |

9 |
learning with constantpartition classification noise and applications to decision tree induction
- PAC
- 1997
(Show Context)
Citation Context ...nd fi are known can be viewed as a probability distribution over these m+ 1 experiments. The (ff; fi) classification noise model can be thought of as a kind of constant-partition classification noise =-=[2]-=-. However, the results in [2] require that each noise rate be less than 1=2. We will need the stronger statement presented here, namely that it suffices to assume only that the sum of ff and fi is les... |

4 |
cient noise-tolerant learning from statistical queries
- unknown authors
- 1993
(Show Context)
Citation Context ...nlabeled data only, given an initial weakly-useful predictor h(x1). Thus, for instance, the conditional independence assumption implies that any concept class learnable in the Statistical Query model =-=[11]-=- is learnable from unlabeled data and an initial weakly-useful predictor. Before proving the theorem, it will be convenient to de ne a variation on the standard classi cation noise model where the noi... |

1 |
learning with constantpartition classi cation noise and applications to decision tree induction
- PAC
- 1997
(Show Context)
Citation Context ... when and are known can be viewed as a probability distribution over these m + 1 experiments. The ( ; ) classi cation noise model can be thought of as a kind of constant-partition classi cation noise =-=[2]-=-. However, the results in [2] require that each noise rate be less than 1=2. We will need the stronger statement presented here, namely that it su ces to assume only that the sum of and is less than 1... |

1 |
Random sampling in cut, ow, and network design problems. Journal version draft
- Karger
- 1997
(Show Context)
Citation Context ...s in component cj of S. If mjSj, the above formula is approximately X cj 2GS sj jSj 1, sj jSj ; m ; in analogy to Equation 1. In fact, we can use recent results in the study of random graph processes =-=[9]-=- to describe quantitatively how 1 To make this more plausible in the context of web pages, think of x1 as not the document itself but rather some small set of attributes of the document. we expect the... |

1 | Pattern Classificataon and Scene Analysis - Duda, Hart - 1973 |

1 |
R itn~ I 01x1 sampling in cut, flow, and network
- Karger
- 1997
(Show Context)
Citation Context ....) However, it is not hard to convert this to the setting we are concerned with, in which a fixed m samples are drawn, each independently from the distribution defined by the pe’s. In fact, Karger in =-=[lo]-=- handles this conversion formally.supper bound on the expected number of labeled examples needed to cover all of DD. On the other hand. if we let u = 1/(2m,), then $Ncc;(cu) is a lower bound since the... |

1 | noise-tolerant learning from statistical queries - Efficient - 1993 |