## Active Sampling for Class Probability Estimation and Ranking (2004)

### Cached

### Download Links

Venue: | Machine Learning |

Citations: | 64 - 9 self |

### BibTeX

@INPROCEEDINGS{Saar-tsechansky04activesampling,

author = {Maytal Saar-tsechansky and Foster Provost},

title = {Active Sampling for Class Probability Estimation and Ranking},

booktitle = {Machine Learning},

year = {2004},

pages = {153--178}

}

### Years of Citing Articles

### OpenURL

### Abstract

In many cost-sensitive environments class probability estimates are used by decision makers to evaluate the expected utility from a set of alternatives. Supervised learning can be used to build class probability estimates; however, it often is very costly to obtain training data with class labels. Active learning acquires data incrementally, at each phase identifying especially useful additional data for labeling, and can be used to economize on examples needed for learning. We outline the critical features of an active learner and present a sampling-based active learning method for estimating class probabilities and class-based rankings. BOOT- STRAP-LV identifies particularly informative new data for learning based on the variance in probability estimates, and uses weighted sampling to account for a potential example's informative value for the rest of the input space. We show empirically that the method reduces the number of data items that must be obtained and labeled, across a wide variety of domains. We investigate the contribution of the components of the algorithm and show that each provides valuable information to help identify informative examples. We also compare BOOTSTRAP- LV with UNCERTAINTY SAMPLING, an existing active learning method designed to maximize classification accuracy. The results show that BOOTSTRAP-LV uses fewer examples to exhibit a certain estimation accuracy and provide insights to the behavior of the algorithms. Finally, we experiment with another new active sampling algorithm drawing from both UNCERTAINTY SAMPLING and BOOTSTRAP-LV and show that it is significantly more competitive with BOOTSTRAP-LV compared to UNCERTAINTY SAMPLING. The analysis suggests more general implications for improving existing active sampling ...

### Citations

5438 |
C4.5: Programs for Machine Learning
- Quinlan
- 1993
(Show Context)
Citation Context ... produce class probability estimates. 3 In particular, for the experiments presented here, the underlying probability estimator is a Probability Estimation Tree (PET), an unpruned C4.5 decision tree [=-=Quinlan, 1993-=-] for which the Laplace correction [Cestnik, 1990] is applied at the leaves. Not pruning and using the Laplace correction had been shown to improve the CPEs produced by PETs [Bauer and Kohavi, 1999; P... |

3389 | An Introduction to the Bootstrap
- Efron, Tibshirani
- 1993
(Show Context)
Citation Context ... examples. Given enough data, random sampling eventually catches up. We introduce a new sampling-based active learning technique, BOOTSTRAP-LV, for learning CPEs. BOOTSTRAP-LV uses bootstrap samples [=-=Efron and Tibshirani, 1993-=-] of available labeled data to examine the variance in the probability estimates for not-yetlabeled data, and employs a weight-sampling procedure to select particularly informative examples for labeli... |

3085 |
UCI repository of machine learning databases
- Blake, Merz
- 1998
(Show Context)
Citation Context ...stigation that provide further insight into the elements of the BOOTSTRAP-LV algorithm. 4.1. Experimental setting We applied BOOTSTRAP-LV to 20 data sets, 17 from the UCI machine learning repository (=-=Blake & Merz 1998-=-) and 3 used previously to evaluate rule-learning algorithms (Cohen & Singer, 1999). Data sets with more than two classes were mapped into two-class problems. For these data sets the minority class wa... |

2765 | Bagging predictors - Breiman - 1996 |

1774 | Experiments with a new boosting algorithm - Freund, Schapire - 1996 |

1754 | A theory of the learnable - Valiant - 1984 |

678 |
Queries and concept learning
- Angluin
- 1987
(Show Context)
Citation Context ... instances that miss being class members for only a few reasons. Subsequently, theoretical results showed that the number of training data can be reduced substantially if they are selected carefully [=-=Angluin, 1988-=-]. The term active learning was coined later to describe induction where the algorithm controls the selection of potential unlabeled training examples [Cohn et al., 1994]. A generic algorithm for acti... |

582 | An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting and Variants
- Bauer, Kohavi
- 1999
(Show Context)
Citation Context ...ision tree [Quinlan, 1993] for which the Laplace correction [Cestnik, 1990] is applied at the leaves. Not pruning and using the Laplace correction had been shown to improve the CPEs produced by PETs [=-=Bauer and Kohavi, 1999-=-; Provost et al., 1998; Provost & Domingos, 2000; Perlich et al., 2001]. As models are learned from more data, performance improves typically as a learning curve; BOOTSTRAP-LV aims to obtain comparabl... |

561 | Active learning with statistical models
- Cohn, Ghahramani, et al.
- 1996
(Show Context)
Citation Context ...ated variance is divided by the average value of the minority-class probability estimates p . The minority class is deteri, min mined once from the initial random sample. 3. Related Work Cohn et al. [=-=Cohn et al., 1996-=-] propose an active learning approach for statistical learning models, generating queries (i.e., training examples) from the input space to be used as inputs to the learning algorithm. This approach d... |

507 | A sequential algorithm for training text classiers
- Lewis, Gale
- 1994
(Show Context)
Citation Context ...aluate the expected effect an example may have on the generalization error. Random sampling is often referred to in the active learning literature as “noninformed” learning (e.g., [Cohn et al., 19=-=94, Lewis and Gale, 1994-=-]). Nevertheless, random sampling is powerful because it allows the incorporation of information about the distribution of examples even when this information is not known explicitly. For example, con... |

473 | The use of the area under the roc curve in the evaluation of machine learning algorithms
- Bradley
- 1997
(Show Context)
Citation Context ... Criteria We also evaluated BOOTSTRAP-LV using alternative performance measures: the mean squared error measure used by Bauer and Kohavi [1999], as well as the area under the ROC curve (denoted AUC) [=-=Bradley 1997-=-], which specifically evaluates ranking accuracy. The results for these measures agree with those obtained with BMAE. For example, BOOTSTRAP-LV generally leads to fatter ROC curves with fewer examples... |

436 | Improving generalization with active learning
- Cohn, Atlas, et al.
- 1994
(Show Context)
Citation Context ...ly if they are selected carefully [Angluin, 1988]. The term active learning was coined later to describe induction where the algorithm controls the selection of potential unlabeled training examples [=-=Cohn et al., 1994-=-]. A generic algorithm for active learning is shown in Figure 2. A learner first is applied to an initial set L of labeled examples (usually selected at random or provided by an expert). Subsequently,... |

352 | The case against accuracy estimation for comparing induction algorithms
- Provost, Fawcett, et al.
- 1998
(Show Context)
Citation Context ... generation or the obtaining of training examples. Figure 1 shows the desired behavior of an active 1 Classification accuracy has been criticized previously as a metric for machine learning research (=-=Provost et al., 1998-=-). 143 343 543 743 943 1143 Training set size 2 Random Sampling Active SamplingsActive Sampling for Class Probability Estimation and Ranking learner. The horizontal axis represents the information nee... |

343 | Query by committee
- Seung, Opper, et al.
- 1992
(Show Context)
Citation Context ...mputation of the error or incremental model updating is not possible, various active learning approaches compute alternative effectiveness scores. For example, the QUERY BY COMMITTEE (QBC) algorithm [=-=Seung et al., 1992] wa-=-s proposed to select training examples actively for training a binary classifier. Examples are sampled at random, generating a “stream” of potential training examples, and each example is consider... |

328 |
Learning Structural Descriptions from Examples
- Winston
- 1975
(Show Context)
Citation Context ...g and the Bootstrap-LV Algorithm The fundamental notion of active sampling has a long history in machine learning. To our knowledge, the first to discuss it explicitly were [Simon and Lea, 1974] and [=-=Winston, 1975-=-]. Simon and Lea describe how machine learning is different from other types of problem solving, because learning involves the simultaneous search of two spaces: the hypothesis space and the instance ... |

270 | Employing EM in pool-based active learning for text classification - McCallum, Nigam - 1998 |

268 | Toward optimal active learning through sampling estimation of error reduction
- Roy, McCallum
(Show Context)
Citation Context ... , until some predefined condition is met (e.g., the labeling budget is exhausted). If UL is very large a subset of randomly sampled examples from UL may be used as a substitute for the complete set [=-=Roy and McCallum, 2001].-=- In each phase, each candidate example xi ∈ UL is assigned an effectiveness score ES i based on an objective function, reflecting its contribution to subsequent learning. Examples then are selected ... |

207 | On bias, variance 0/1 loss, and the curse-of-dimensionality. Data Mining and Knowledge Discovery
- Friedman
- 1997
(Show Context)
Citation Context ...our inductive learning setting we typically do not know the class probability, f (x), for an input x,even when we do know the true class of a particular instance described by x. A common formulation (=-=Friedman, 1997-=-) of the estimation error decomposes the expected squared estimation error into the sum of two terms: ET [( f (x) − ˆf (x | T )] 2 = ET [ ˆf (x | T ) − ET ˆf (x | T )] 2 + [ f (x) − ET ˆf (x | T )] 2 ... |

176 |
Estimating probabilities: A crucial task in machine learning
- Cestnik
- 1990
(Show Context)
Citation Context ...ular, for the experiments presented here, the underlying probability estimator is a Probability Estimation Tree (PET), an unpruned C4.5 decision tree (Quinlan, 1993) for which the Laplace correction (=-=Cestnik, 1990-=-) is applied at the leaves. Not pruning and using the Laplace correction had been shown to improve the CPEs produced by PETs (Bauer & Kohavi, 1999; Provost, Fawcett, & Kohavi, 1998; Provost & Domingos... |

100 | Learning and making decisions when costs and probabilities are both unknown
- Zadrozny, Elkan
- 2001
(Show Context)
Citation Context ...to incorporate costs/benefits for evaluating alternatives. For example, in targeted marketing the estimated probability that a customer will respond to an offer is combined with the estimated profit (=-=Zadrozny & Elkan, 2001-=-) to evaluate various offer propositions. Other applications require ranking cases by the likelihood of class membership, to improve the response rate to offer propositions, or to add flexibility for ... |

99 | A Simple, Fast, and Effective Rule Learner
- Cohen, Singer
- 1999
(Show Context)
Citation Context ...gorithm. 4.1 Experimental Setting We applied BOOTSTRAP-LV to 20 data sets, 17 from the UCI machine learning repository (Blake et al., 1998) and 3 used previously to evaluate rule-learning algorithms (=-=Cohen and Singer, 1999-=-). Data sets with more than two classes were mapped into two-class problems. For these experiments we use tree induction to produce class probability estimates2. In particular, for the experiments pre... |

98 |
Query learning strategies using boosting and bagging
- Abe, Mamitsuka
- 1998
(Show Context)
Citation Context ... each example and thereby obtain a ranking of the examples’ informative values. Subsequently the example(s) with the highest effectiveness score(s) is (are) selected. For instance, Abe and Mamitsuka=-= [Abe and Mamitsuka, 1998-=-] use bagging and boosting to generate a committee of classifiers and quantify disagreement as the margin (i.e., the difference in weight assigned to either class). Examples with the minimum margin ar... |

71 | Tree Induction vs. Logistic Regression: A LearningCurve Analysis
- Perlich, Provost, et al.
- 2003
(Show Context)
Citation Context ...90] is applied at the leaves. Not pruning and using the Laplace correction had been shown to improve the CPEs produced by PETs [Bauer and Kohavi, 1999; Provost et al., 1998; Provost & Domingos, 2000; =-=Perlich et al., 2001-=-]. As models are learned from more data, performance improves typically as a learning curve; BOOTSTRAP-LV aims to obtain comparable performance with fewer labeled data (recall figure 1). To evaluate t... |

57 | Committeebased sample selection for probabilistic classi - Argamon-Engelson, Dagan - 1999 |

57 | Problem solving and rule induction: A unified view - Simon, Lea - 1974 |

41 | Active learning using adaptive resampling
- Iyengar, Apte, et al.
- 2000
(Show Context)
Citation Context ... examples more likely to be informative regarding other examples in the space. Note that weight sampling also is employed in the AdaBoost algorithm [Freund and Shapire, 1996] on which Iyengar et al. [=-=Iyengar et al., 2000-=-] base their active learning approach. Their algorithm results in an ensemble of classifiers where weight sampling is used both to select examples from which successive classifiers in the ensemble are... |

41 | Well-Trained PETs: Improving Probability Estimation Trees. CeDER Working Paper #IS-00-04
- Provost, Domingos
- 2000
(Show Context)
Citation Context ...ce correction [Cestnik, 1990] is applied at the leaves. Not pruning and using the Laplace correction had been shown to improve the CPEs produced by PETs [Bauer and Kohavi, 1999; Provost et al., 1998; =-=Provost & Domingos, 2000-=-; Perlich et al., 2001]. As models are learned from more data, performance improves typically as a learning curve; BOOTSTRAP-LV aims to obtain comparable performance with fewer labeled data (recall fi... |

19 |
Experimental goal regression: A method for learning problem-solving heuristics
- Porter, Kibler
- 1986
(Show Context)
Citation Context ...the simultaneous search of two spaces: the hypothesis space and the instance space. The results of searching the hypothesis space can affect how the instance space will be sampled. Porter and Kibler [=-=Porter and Kibler, 1986-=-] address the symbiosis between learning and problem solving, and propose a learning apprentice system that learns problem-solving rules. Their method reduces reliance on the teacher to provide exampl... |

6 |
Heterogeneous uncertainty sampling, in
- Lewis, Catlett
- 1994
(Show Context)
Citation Context ...sts, such as when the primary concern is to obtain accurate CPE or ranking with minimal costly labeling. BOOTSTRAP-LV also does not address computational concerns explicitly, as do Lewis and Catlett [=-=Lewis and Catlett, 1994-=-]. However, while UNCERTAINTY SAMPLING is simpler computationally, its performance is significantly inferior to that of BOOTSTRAP-LV and in the initial sampling phases is often inferior to random samp... |

6 | Types of cost in inductive concept learning. Workshop on Cost-Sensitive Learning at - Turney - 2000 |

3 | On bias, variance, 0/1-loss, and the curse-of-dimensionality - H - 1997 |

1 |
Active Sampling for Class Probability Estimation and Ranking
- Cestnik
- 1990
(Show Context)
Citation Context ...ular, for the experiments presented here, the underlying probability estimator is a Probability Estimation Tree (PET), an unpruned C4.5 decision tree [Quinlan, 1993] for which the Laplace correction [=-=Cestnik, 1990-=-] is applied at the leaves. Not pruning and using the Laplace correction had been shown to improve the CPEs produced by PETs [Bauer and Kohavi, 1999; Provost et al., 1998; Provost & Domingos, 2000; Pe... |

1 |
Active Sampling for Class Probability Estimation and Ranking
- Lewis, Gale
- 1994
(Show Context)
Citation Context ...evaluate the expected effect an example may have on the generalization error. Random sampling is often referred to in the active learning literature as noninformed learning (e.g., (Cohn et al., 1994, =-=Lewis and Gale, 1994-=-)). Nevertheless, random sampling is powerful because it allows the incorporation of information about the distribution of examples even when the information is not known explicitly. For example, cons... |