Results 1 -
7 of
7
Learning ensembles from bites: A scalable and accurate approach
"... Bagging and boosting are two popular ensemble methods that typically achieve better accuracy than a single classifier. These techniques have limitations on massive datasets, as the size of the dataset can be a bottleneck. Voting many classifiers built on small subsets of data ("pasting small votes") ..."
Abstract
-
Cited by 19 (2 self)
- Add to MetaCart
Bagging and boosting are two popular ensemble methods that typically achieve better accuracy than a single classifier. These techniques have limitations on massive datasets, as the size of the dataset can be a bottleneck. Voting many classifiers built on small subsets of data ("pasting small votes") is a promising approach for learning from massive datasets, one that can utilize the power of boosting and bagging. We propose a framework for building hundreds or thousands of such classifiers on small subsets of data in a distributed environment. Experiments show this approach is fast, accurate, and scalable.
A Study in Machine Learning from Imbalanced Data for Sentence Boundary Detection in Speech
- Computer Speech and Language
, 2006
"... Enriching speech recognition output with sentence boundaries improves its human readability and enables further processing by downstream language processing modules. We have constructed a hidden Markov model (HMM) system to detect sentence boundaries that uses both prosodic and textual information. ..."
Abstract
-
Cited by 8 (4 self)
- Add to MetaCart
Enriching speech recognition output with sentence boundaries improves its human readability and enables further processing by downstream language processing modules. We have constructed a hidden Markov model (HMM) system to detect sentence boundaries that uses both prosodic and textual information. Since there are more nonsentence boundaries than sentence boundaries in the data, the prosody model, which is implemented as a decision tree classifier, must be constructed to effectively learn from the imbalanced data distribution. To address this problem, we investigate a variety of sampling approaches and a bagging scheme. A pilot study was carried out to select methods to apply to the full NIST sentence boundary evaluation task across two corpora (conversational telephone speech and broadcast news speech), using both human transcriptions and recognition output. In the pilot study, when classification error rate is the performance measure, using the original training set achieves the best performance among the sampling methods, and an ensemble of multiple classifiers from different downsampled training sets achieves
Comparing Pure Parallel Ensemble Creation Techniques Against Bagging
- IN THE THIRD IEEE INTERNATIONAL CONFERENCE ON DATA MININ
, 2003
"... We experimentally evaluate randomization-based approaches to creating an ensemble of decision-tree classifiers. Unlike methods related to boosting, all of the eight approaches considered here create each classifier in an ensemble independently of the other classifiers. Experiments were performed on ..."
Abstract
-
Cited by 7 (4 self)
- Add to MetaCart
We experimentally evaluate randomization-based approaches to creating an ensemble of decision-tree classifiers. Unlike methods related to boosting, all of the eight approaches considered here create each classifier in an ensemble independently of the other classifiers. Experiments were performed on 28 publicly available datasets, using C4.5 release 8 as the base classifier. While each of the other seven approaches has some strengths, we find that none of them is consistently more accurate than standard bagging when tested for statistical significance.
Ensembles of classifiers from spatially disjoint data
- In Proceedings of the Sixth International Conference on Multiple Classifier Systems
, 2005
"... Abstract. We describe an ensemble learning approach that accurately learns from data which has been partitioned according to the arbitrary spatial requirements of a large-scale simulation wherein classifiers may be trained only the data local to a given partition. As a result, the class statistics c ..."
Abstract
-
Cited by 7 (5 self)
- Add to MetaCart
Abstract. We describe an ensemble learning approach that accurately learns from data which has been partitioned according to the arbitrary spatial requirements of a large-scale simulation wherein classifiers may be trained only the data local to a given partition. As a result, the class statistics can vary from partition to partition; some classes may even be missing from some partitions. In order to learn from such data, we combine a fast ensemble learning algorithm with Bayesian decision theory to generate an accurate working model of the simulation. Results from a simulation of an impactor bar crushing a storage canister and from region recognition in face images show that regions of interest are successfully identified. 1
A Comparison of Ensemble Creation Techniques
- IN THE FIFTH INTERNATIONAL CONFERENCE ON MULTIPLE CLASSIFIER SYSTEMS
, 2004
"... We experimentally evaluate bagging and six other randomization-based approaches to creating an ensemble of decision-tree classifiers. Bagging uses randomization to create multiple training sets. Other approaches, such as Randomized C4.5 apply randomization in selecting a test at a given node of a ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
We experimentally evaluate bagging and six other randomization-based approaches to creating an ensemble of decision-tree classifiers. Bagging uses randomization to create multiple training sets. Other approaches, such as Randomized C4.5 apply randomization in selecting a test at a given node of a tree. Then there are approaches, such as random forests and random subspaces, that apply randomization in the selection of attributes to be used in building the tree. On the other hand boosting, as compared here, incrementally builds classifiers by focusing on examples misclassified by existing classifiers. Experiments were performed on 34 publicly available data sets. While each of the other six approaches has some strengths, we find that none of them is consistently more accurate than standard bagging when tested for statistical significance.
1 2 Using classifier ensembles to label spatially disjoint data
, 2007
"... 12 September 2007 Disk Used 11 We describe an ensemble approach to learning from arbitrarily partitioned data. The partitioning comes from the distributed process-12 ing requirements of a large scale simulation. The volume of the data is such that classifiers can train only on data local to a given ..."
Abstract
- Add to MetaCart
12 September 2007 Disk Used 11 We describe an ensemble approach to learning from arbitrarily partitioned data. The partitioning comes from the distributed process-12 ing requirements of a large scale simulation. The volume of the data is such that classifiers can train only on data local to a given par-13 tition. As a result of the partition reflecting the needs of the simulation, the class statistics can vary from partition to partition. Some 14 classes will likely be missing from some partitions. We combine a fast ensemble learning algorithm with probabilistic majority voting 15 in order to learn an accurate classifier from such data. Results from simulations of an impactor bar crushing a storage canister and from 16 facial feature recognition show that regions of interest are successfully identified in spite of the class imbalance in the individual training 17 sets.

