Results 1 - 10
of
16
Ensemble Methods in Machine Learning
- MULTIPLE CLASSIFIER SYSTEMS, LBCS-1857
, 2000
"... Ensemble methods are learning algorithms that construct a set of classifiers and then classify new data points by taking a (weighted) vote of their predictions. The original ensemble method is Bayesian averaging, but more recent algorithms include error-correcting output coding, Bagging, and boostin ..."
Abstract
-
Cited by 339 (2 self)
- Add to MetaCart
Ensemble methods are learning algorithms that construct a set of classifiers and then classify new data points by taking a (weighted) vote of their predictions. The original ensemble method is Bayesian averaging, but more recent algorithms include error-correcting output coding, Bagging, and boosting. This paper reviews these methods and explains why ensembles can often perform better than any single classifier. Some previous studies comparing ensemble methods are reviewed, and some new experiments are presented to uncover the reasons that Adaboost does not overfit rapidly.
Object Detection in Images by Components
, 1999
"... In this paper we present a component based person detection system that is capable of detecting frontal, rear and near side views of people, and partially occluded persons in cluttered scenes. The framework that is described here for people is easily applied to other objects as well. The motivatio ..."
Abstract
-
Cited by 186 (10 self)
- Add to MetaCart
In this paper we present a component based person detection system that is capable of detecting frontal, rear and near side views of people, and partially occluded persons in cluttered scenes. The framework that is described here for people is easily applied to other objects as well. The motivation for developing a component based approach istwofold: rst, to enhance the performance of person detection systems on frontal and rear views of people and second, to develop a framework that directly addresses the problem of detecting people who are partially occluded or whose body parts blend in with the background. The data classi cation is handled by several support vector machine classi ers arranged in two layers. This architecture is known as Adaptive Combination of Classi ers (ACC). The system performs very well and is capable of detecting people even when all components of a person are not found. The performance of the system is signi cantly better than a full body
Tree Induction for Probability-based Ranking
, 2002
"... Tree induction is one of the most effective and widely used methods for building classification models. However, many applications require cases to be ranked by the probability of class membership. Probability estimation trees (PETs) have the same attractive features as classification trees (e.g., c ..."
Abstract
-
Cited by 97 (4 self)
- Add to MetaCart
Tree induction is one of the most effective and widely used methods for building classification models. However, many applications require cases to be ranked by the probability of class membership. Probability estimation trees (PETs) have the same attractive features as classification trees (e.g., comprehensibility, accuracy and efficiency in high dimensions and on large data sets). Unfortunately, decision trees have been found to provide poor probability estimates. Several techniques have been proposed to build more accurate PETs, but, to our knowledge, there has not been a systematic experimental analysis of which techniques actually improve the probability-based rankings, and by how much. In this paper we first discuss why the decision-tree representation is not intrinsically inadequate for probability estimation. Inaccurate probabilities are partially the result of decision-tree induction algorithms that focus on maximizing classification accuracy and minimizing tree size (for example via reduced-error pruning). Larger trees can be better for probability estimation, even if the extra size is superfluous for accuracy maximization. We then present the results of a comprehensive set of experiments, testing some straghtforward methods for improving probability-based rankings. We show that using a simple, common smoothing method--the Laplace correction--uniformly improves probability-based rankings. In addition, bagging substantioJly improves the rankings, and is even more effective for this purpose than for improving accuracy. We conclude that PETs, with these simple modifications, should be considered when rankings based on class-membership probability are required.
Learning and Making Decisions When Costs and Probabilities are Both Unknown
- In Proceedings of the Seventh International Conference on Knowledge Discovery and Data Mining
, 2001
"... In many machine learning domains, misclassication costs are dierent for dierent examples, in the same way that class membership probabilities are exampledependent. In these domains, both costs and probabilities are unknown for test examples, so both cost estimators and probability estimators must be ..."
Abstract
-
Cited by 73 (8 self)
- Add to MetaCart
In many machine learning domains, misclassication costs are dierent for dierent examples, in the same way that class membership probabilities are exampledependent. In these domains, both costs and probabilities are unknown for test examples, so both cost estimators and probability estimators must be learned. This paper rst discusses how to make optimal decisions given cost and probability estimates, and then presents decision tree learning methods for obtaining well-calibrated probability estimates. The paper then explains how to obtain unbiased estimators for example-dependent costs, taking into account the diculty that in general, probabilities and costs are not independent random variables, and the training examples for which costs are known are not representative of all examples. The latter problem is called sample selection bias in econometrics. Our solution to it is based on Nobel prize-winning work due to the economist James Heckman. We show that the methods we propose are s...
Linear programming boosting via column generation
- Machine Learning
, 2002
"... 1 Introduction Recent papers [20] have shown that boosting, arcing, and related ensemble methods (hereafter summarized asboosting) can be viewed as margin maximization in function space. By changing the cost function, different ..."
Abstract
-
Cited by 69 (0 self)
- Add to MetaCart
1 Introduction Recent papers [20] have shown that boosting, arcing, and related ensemble methods (hereafter summarized asboosting) can be viewed as margin maximization in function space. By changing the cost function, different
Tree induction vs. logistic regression: A learning-curve analysis
- CEDER WORKING PAPER #IS-01-02, STERN SCHOOL OF BUSINESS
, 2001
"... Tree induction and logistic regression are two standard, off-the-shelf methods for building models for classi cation. We present a large-scale experimental comparison of logistic regression and tree induction, assessing classification accuracy and the quality of rankings based on class-membership pr ..."
Abstract
-
Cited by 50 (16 self)
- Add to MetaCart
Tree induction and logistic regression are two standard, off-the-shelf methods for building models for classi cation. We present a large-scale experimental comparison of logistic regression and tree induction, assessing classification accuracy and the quality of rankings based on class-membership probabilities. We use a learning-curve analysis to examine the relationship of these measures to the size of the training set. The results of the study show several remarkable things. (1) Contrary to prior observations, logistic regression does not generally outperform tree induction. (2) More specifically, and not surprisingly, logistic regression is better for smaller training sets and tree induction for larger data sets. Importantly, this often holds for training sets drawn from the same domain (i.e., the learning curves cross), so conclusions about induction-algorithm superiority on a given domain must be based on an analysis of the learning curves. (3) Contrary to conventional wisdom, tree induction is effective atproducing probability-based rankings, although apparently comparatively less so foragiven training{set size than at making classifications. Finally, (4) the domains on which tree induction and logistic regression are ultimately preferable canbecharacterized surprisingly well by a simple measure of signal-to-noise ratio.
Boosting Applied to Word Sense Disambiguation
- IN PROCEEDINGS OF THE 12TH EUROPEAN CONFERENCE ON MACHINE LEARNING
, 2000
"... In this paper Schapire and Singer's AdaBoost.MH boosting algorithm is applied to the Word Sense Disambiguation (WSD) problem. Initial experiments on a set of 15 selected polysemous words show that the boosting approach surpasses Naive Bayes and Exemplar-based approaches, which represent state-of- ..."
Abstract
-
Cited by 47 (8 self)
- Add to MetaCart
In this paper Schapire and Singer's AdaBoost.MH boosting algorithm is applied to the Word Sense Disambiguation (WSD) problem. Initial experiments on a set of 15 selected polysemous words show that the boosting approach surpasses Naive Bayes and Exemplar-based approaches, which represent state-of-the-art accuracy on supervised WSD. In order to make boosting practical for a real learning domain of thousands of words, several ways of accelerating the algorithm by reducing the feature space are studied. The best variant, which we call LazyBoosting, is tested on the largest sense--tagged corpus available containing 192,800 examples of the 191 most frequent and ambiguous English words. Again, boosting compares favourably to the other benchmark algorithms.
Lightweight Rule Induction
, 2000
"... A lightweight rule induction method is described that generates compact Disjunctive Normal Form (DNF) rules. Each class has an equal numberofunweighted rules. A new example is classified by applying all rules and assigning the example to the class with the most satisfied rules. The induction m ..."
Abstract
-
Cited by 11 (1 self)
- Add to MetaCart
A lightweight rule induction method is described that generates compact Disjunctive Normal Form (DNF) rules. Each class has an equal numberofunweighted rules. A new example is classified by applying all rules and assigning the example to the class with the most satisfied rules. The induction method attempts to minimize the training error with no pruning. An overall design is specified by setting limits on the size and number of rules. During training, cases are adaptively weighted using a simple cumulativeerror method. The induction method is nearly linear in time relative to an increase in the number of induced rules or the number of cases. Experimental results on large benchmark data sets demonstrate that predictive performance can rival the best reported results in the literature.
Distributed Learning on Very Large Data Sets
- In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
, 2000
"... One approach to learning from intractably large data sets is to utilize all the training data by learning models on tractably sized subsets of the data. The subsets of data may be disjoint or partially overlapping. The individual learned models may be combined into a single model or a voting approac ..."
Abstract
-
Cited by 10 (4 self)
- Add to MetaCart
One approach to learning from intractably large data sets is to utilize all the training data by learning models on tractably sized subsets of the data. The subsets of data may be disjoint or partially overlapping. The individual learned models may be combined into a single model or a voting approachmay be used to combine the classi#cations of a set of models. An approach to learning models in parallel from arbitrarily large training data sets and combining them into a classi#er is described. The training sets are disjoint in the work described here. A parallel implementation on the DOE's ASCI Red parallel supercomputer is described. Results with data sets small enough to be handled by a single processor show that data sets can be divided into a moderate number of distinct subsets without degrading classi#er accuracy. Speedup results are shown for a parallel implementation on the ASCI Red with data sets too large to be handled on a single processor. Training sets of size 3 to 50 millio...

