## Active Learning with Statistical Models (1995)

### Cached

### Download Links

- [publications.ai.mit.edu]
- [dspace.mit.edu]
- [dspace.mit.edu]
- [dspace.mit.edu]
- [wexler.free.fr]
- [wexler.free.fr]
- [wexler.free.fr]
- [www.cs.cmu.edu]
- [www-2.cs.cmu.edu]
- [www.cs.cmu.edu]
- [www.jair.org]
- [mlg.eng.cam.ac.uk]
- [www.dcs.shef.ac.uk]
- [mlg.eng.cam.ac.uk]
- [www.cs.washington.edu]
- [signal.kuamp.kyoto-u.ac.jp]
- DBLP

### Other Repositories/Bibliography

Citations: | 561 - 10 self |

### BibTeX

@MISC{Cohn95activelearning,

author = {David A. Cohn and Zoubin Ghahramani and Michael I. Jordan},

title = {Active Learning with Statistical Models},

year = {1995}

}

### Years of Citing Articles

### OpenURL

### Abstract

For manytypes of learners one can compute the statistically "optimal" way to select data. We review how these techniques have been used with feedforward neural networks [MacKay, 1992# Cohn, 1994]. We then showhow the same principles may be used to select data for two alternative, statistically-based learning architectures: mixtures of Gaussians and locally weighted regression. While the techniques for neural networks are expensive and approximate, the techniques for mixtures of Gaussians and locally weighted regression are both efficient and accurate.

### Citations

9054 | Maximum likelihood from incomplete data via the EM algorithm
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ...:::; N (see Figure 1). In the context of learning from random examples, one begins by producing a joint density estimate over the input/output space X Y based on the training set D. The EM algorithm (=-=Dempster, Laird, & Rubin, 1977-=-) can be used to e ciently nd a locally optimal t of the Gaussians to the data. It is then straightforward to compute ^y given x by conditioning the joint distribution on x and taking the expected val... |

7493 |
Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference
- Pearl
- 1988
(Show Context)
Citation Context ...s the derivation of variance- (and bias-) minimizing techniques for other statistical learning models. Of particular interest is the class of models known as \belief networks" or \Bayesian networ=-=ks" (Pearl, 1988-=-; Heckerman, Geiger, & Chickering, 1994). These models have the advantage of allowing inclusion of domain knowledge and prior constraints while still adhering to a statistically sound framework. Curre... |

953 | Learning Bayesian networks: The combination of knowledge and statistical data
- Heckerman, Geiger, et al.
- 1995
(Show Context)
Citation Context ...ion of variance- (and bias-) minimizing techniques for other statistical learning models. Of particular interest is the class of models known as \belief networks" or \Bayesian networks" (Pearl, 1988; =-=Heckerman, Geiger, & Chickering, 1994-=-). These models have the advantage of allowing inclusion of domain knowledge and prior constraints while still adhering to a statistically sound framework. Current research in belief networks focuses ... |

724 |
Statistical Analysis of Finite Mixture Distributions
- Titterington, Smith, et al.
- 1985
(Show Context)
Citation Context ...at are much more amenable to optimal data selection. 3. Mixtures of Gaussians The mixture of Gaussians model is a powerful estimation and prediction technique with roots in the statistics literature (=-=Titterington, Smith, & Makov, 1985-=-); it has, over the last few years, been adopted by researchers in machine learning (Cheeseman et al., 1988; Nowlan, 1991; Specht, 1991; Ghahramani & Jordan, 1994). The model assumes that the data are... |

678 |
Queries and concept learning
- Angluin
- 1987
(Show Context)
Citation Context ...anding problem. When actions/queries are selected properly, the data requirements for some problems decrease drastically, and some NP-complete learning problems become polynomial in computation time (=-=Angluin, 1988-=-; Baum & Lang, 1991). In practice, active learning o ers its greatest rewards in situations where data are expensive or di cult to obtain, or when the environment is complex or dangerous. In industria... |

479 |
Empirical ModelBuilding and Responses Surfaces
- Box, Drapper
- 1987
(Show Context)
Citation Context ...h knowledge; the latter case is generally of greater interest to machine learning practitioners. The favored technique for this kind of optimization is usually a form of response surface methodology (=-=Box & Draper, 1987-=-), which performs experiments that guide hill-climbing through the input space. A related problem exists in the eld of adaptive control, where one must learn a control policy by taking actions. In con... |

436 | Improving generalization with active learning - Cohn, Atlas, et al. - 1994 |

416 |
Theory of optimal experiments
- Fedorov
- 1972
(Show Context)
Citation Context ...amples. 2.2 Example: Active Learning with a Neural Network In this section we review the use of techniques from Optimal Experiment Design (OED) to minimize the estimated variance of a neural network (=-=Fedorov, 1972-=-; MacKay, 1992; Cohn, 1994). We will assume wehave been given a learner ^y = f ^w(), a training set D = f(xi;yi)g m i=1 and a parameter vector estimate ^w that maximizes some likelihood measure given ... |

343 | Information-based objective functions for active data selection
- MacKay
- 1992
(Show Context)
Citation Context ...in a statistically \optimal" manner for some classes of machine learning algorithms. We rst brie y review how the statistical approach can be applied to neural networks, as described in earlier w=-=ork (MacKay, 1992-=-; Cohn, 1994). Then, in Sections 3 and 4 we consider two alternative, statistically-based learning architectures: mixtures of Gaussians and locally weighted regression. Section 5 presents the empirica... |

258 |
Applied Linear Regression
- Weisberg
- 1985
(Show Context)
Citation Context ... model, one would select examples so as to maximize discriminability between Gaussians; for locally weighted regression, one would use a logistic regression instead of the linear one considered here (=-=Weisberg, 1985-=-). Our future work will proceed in several directions. The most important is active bias minimization. As noted in Section 2, the learner's error is composed of both bias and variance. The variance-mi... |

195 | Supervised learning from incomplete data via an EM approach
- Ghahramani, Jordan
- 1994
(Show Context)
Citation Context ...n the statistics literature (Titterington, Smith, & Makov, 1985); it has, over the last few years, been adopted by researchers in machine learning (Cheeseman et al., 1988; Nowlan, 1991; Specht, 1991; =-=Ghahramani & Jordan, 1994-=-). The model assumes that the data are produced by a mixture of N multivariate Gaussians gi, for i =1;:::; N (see Figure 1). In the context of learning from random examples, one begins by producing a ... |

184 |
A general regression neural network
- Specht
- 1991
(Show Context)
Citation Context ...e with roots in the statistics literature (Titterington, Smith, & Makov, 1985); it has, over the last few years, been adopted by researchers in machine learning (Cheeseman et al., 1988; Nowlan, 1991; =-=Specht, 1991-=-; Ghahramani & Jordan, 1994). The model assumes that the data are produced by a mixture of N multivariate Gaussians gi, for i =1; :::; N (see Figure 1). In the context of learning from random examples... |

133 | Neural network exploration using optimal experiment design
- Cohn
- 1996
(Show Context)
Citation Context ...ally \optimal" manner for some classes of machine learning algorithms. We rst brie y review how the statistical approach can be applied to neural networks, as described in earlier work (MacKay, 1=-=992; Cohn, 1994-=-). Then, in Sections 3 and 4 we consider two alternative, statistically-based learning architectures: mixtures of Gaussians and locally weighted regression. Section 5 presents the empirical results of... |

96 |
Robot juggling: An implementation of memory-based learning. Control Systems Magazine 14
- Schaal, Atkeson
- 1994
(Show Context)
Citation Context ... only training data that are \local" to that point. One recent study demonstrated that LWR was suitable for real-time control by constructing an LWR-based system that learned a di cult juggling task (=-=Schaal & Atkeson, 1994-=-). o o o o o o o o o o o o o x Figure 2: In locally weighted regression, points are weighted by proximity to the current x in question using a kernel. A regression is then computed using the weighted ... |

84 |
Soft competitive adaptation: Neural network learning algorithms based on fitting statistical mixtures. Carnegie Mellon University Doctoral thesis
- Nowlan
- 1991
(Show Context)
Citation Context ...ction technique with roots in the statistics literature (Titterington, Smith, & Makov, 1985); it has, over the last few years, been adopted by researchers in machine learning (Cheeseman et al., 1988; =-=Nowlan, 1991-=-; Specht, 1991; Ghahramani & Jordan, 1994). The model assumes that the data are produced by a mixture of N multivariate Gaussians gi, for i =1; :::; N (see Figure 1). In the context of learning from r... |

69 |
Training connectionist networks with queries and selective sampling
- Cohn, Atlas, et al.
- 1990
(Show Context)
Citation Context ...choosing places where we don't have data (Whitehead, 1991), where we perform poorly (Linden & Weber, 1993), where we have low con dence (Thrun & Moller, 1992), where we expect it to change our model (=-=Cohn, Atlas, & Ladner, 1990-=-, 1994), and where we previously found data that resulted in learning (Schmidhuber & Storck, 1993). In this paper we will consider how one may select ~x in a statistically \optimal" manner for some cl... |

64 |
Eds., “Active exploration in dynamic environments
- Thrun, Möller, et al.
- 1992
(Show Context)
Citation Context ...try next. There are many heuristics for choosing ~x, including choosing places where we don't have data (Whitehead, 1991), where we perform poorly (Linden & Weber, 1993), where we have low con dence (=-=Thrun & Moller, 1992-=-), where we expect it to change our model (Cohn, Atlas, & Ladner, 1990, 1994), and where we previously found data that resulted in learning (Schmidhuber & Storck, 1993). In this paper we will consider... |

52 | Selecting concise training sets from clean data - Franco, Plutowski, et al. - 1993 |

51 |
Regression by Local Fitting
- Cleveland, Delvin, et al.
- 1988
(Show Context)
Citation Context ... E =\Omega ~ oe 2 y;i ff \Gamma\Omega ~ oe 2 xy;i ff ~ oe 2 x;i : 4 LOCALLY WEIGHTED REGRESSION We consider here two forms of locally weighted regression (LWR): kernel regression and the LOESS model [=-=Cleveland et al, 1988-=-]. Kernel regression computessy as an average of the y i in the data set, weighted by a kernel centered at x. The LOESS model performs a linear regression on points in the data set, weighted by a kern... |

49 | Reinforcement driven information acquisition in non-deterministic environments
- Storck, Hochreiter, et al.
- 1995
(Show Context)
Citation Context ...r, 1993), where we have low con dence (Thrun & Moller, 1992), where we expect it to change our model (Cohn, Atlas, & Ladner, 1990, 1994), and where we previously found data that resulted in learning (=-=Schmidhuber & Storck, 1993-=-). In this paper we will consider how one may select ~x in a statistically \optimal" manner for some classes of machine learning algorithms. We rst brie y review how the statistical approach can be ap... |

46 |
Bayesian Classification
- CHEESEMAN, KELLY, et al.
- 1988
(Show Context)
Citation Context ...ful estimation and prediction technique with roots in the statistics literature (Titterington, Smith, & Makov, 1985); it has, over the last few years, been adopted by researchers in machine learning (=-=Cheeseman et al., 1988-=-; Nowlan, 1991; Specht, 1991; Ghahramani & Jordan, 1994). The model assumes that the data are produced by a mixture of N multivariate Gaussians g i , for i = 1; :::; N (see Figure 1). In the context o... |

40 |
Optimal Control Systems
- Feldbaum
- 1965
(Show Context)
Citation Context ...on), one is usually concerned with the performing well during the learning task and must trade of exploitation of the current policy for exploration which may improve it. The sub eld of dual control (=-=Fe'ldbaum, 1965-=-) is speci cally concerned with nding an optimal balance of exploration and control while learning. In this paper, we will restrict ourselves to examining the problem of supervised learning: based on ... |

19 | Bayesian query construction for neural network models - Paas, Kindermann - 1995 |

10 | Implementing inner drive by competence reflection - Linden, Weber - 1992 |

6 |
Bayesian classi cation
- Cheeseman, Self, et al.
- 1988
(Show Context)
Citation Context ...ful estimation and prediction technique with roots in the statistics literature (Titterington, Smith, & Makov, 1985); it has, over the last few years, been adopted by researchers in machine learning (=-=Cheeseman et al., 1988-=-; Nowlan, 1991; Specht, 1991; Ghahramani & Jordan, 1994). The model assumes that the data are produced by a mixture of N multivariate Gaussians gi, for i =1; :::; N (see Figure 1). In the context of l... |

3 |
Neural network algorithms that learn in polynomial time from examples and queries
- Baum, Lang
- 1991
(Show Context)
Citation Context ... When actions/queries are selected properly, the data requirements for some problems decrease drastically, and some NP-complete learning problems become polynomial in computation time (Angluin, 1988; =-=Baum & Lang, 1991-=-). In practice, active learning o ers its greatest rewards in situations where data are expensive or di cult to obtain, or when the environment is complex or dangerous. In industrial settings each tra... |

3 | Regression by local tting - Cleveland, Devlin, et al. - 1988 |

2 |
Implementing inner drive by competence re ection
- Linden, Weber
- 1993
(Show Context)
Citation Context ... will be concerned with is how tochoose which ~x to try next. There are many heuristics for choosing ~x, including choosing places where we don't have data (Whitehead, 1991), where we perform poorly (=-=Linden & Weber, 1993-=-), where we have low con dence (Thrun & Moller, 1992), where we expect it to change our model (Cohn, Atlas, & Ladner, 1990, 1994), and where we previously found data that resulted in learning (Schmidh... |

1 |
Minimizing statistical bias with queries. AI Lab memo AIM1552, Massachusetts Institute of Technology. Available by anonymous ftp from publications.ai.mit.edu
- Cohn
- 1995
(Show Context)
Citation Context ...ed here ignores the bias component, which can lead to signi cant errors when the learner's bias is non-negligible. Work in progress examines e ective ways of measuring and optimally eliminating bias (=-=Cohn, 1995-=-); future work will examine how to jointly minimize both bias and variance to produce a criterion that truly minimizes the learner's expected error. Another direction for future research is the deriva... |

1 | 144 Learning with Statistical Models Geman - Bienenstock, E, et al. - 1992 |

1 |
Regression by local Econometrics
- Cleveland, Devlin, et al.
- 1988
(Show Context)
Citation Context ...mity to the current x in question using a kernel. A regression is then computed using the weighted points. We consider here a form of locally weighted regression that is a variant of the LOESS model (=-=Cleveland, Devlin, & Grosse, 1988-=-). The LOESS model performs a linear regression on points in the data set, weighted by akernel centered at x (see Figure 2). The kernel shape is a design parameter for which there are many possible ch... |