## Online ensemble learning (2001)

Venue: | |

Citations: | 40 - 0 self |

### BibTeX

@TECHREPORT{Oza01onlineensemble,

author = {Nikunj Chandrakant Oza},

title = {Online ensemble learning},

institution = { },

year = {2001}

}

### Years of Citing Articles

### OpenURL

### Abstract

### Citations

3447 |
Probability and Measure
- Billingsley
- 1979
(Show Context)
Citation Context ...robability to � where� all�������������������� � vectors. of � � as ��� � . � ) �������£�) � � converges � ��� � � if the sein is the number of elements in all the The following is Scheffé’s Theorem (=-=Billingsley, 1995-=-). We state the version for discrete distributions because that is all we need. Theorem 1 Define ������£�������to be a sequence of distributions � and to be another distribution such that ����� ����� ... |

3059 |
UCI Repository of machine learning databases
- Blake, Merz
- 1998
(Show Context)
Citation Context ...ew hypothesis updated to reflect the new example with the supplied weight. This algorithm was tested on four machine learning datasets, three of which are part of the UCI Machine Learning Repository (=-=Blake, Keogh, & Merz, 1999-=-), and several branch prediction problems from computer architecture. The main goal of their work was to apply ensembles to branch prediction and similar resource-constrained online domains. For this ... |

2983 |
Learning internal representations by error propagation
- Rummelhart, Hinton, et al.
- 1986
(Show Context)
Citation Context ...rk learning performs nonlinear regression given a training set. The most widely used method for setting the weights in a neural network is the backpropagation y 2 y 3 14salgorithm (Bryson & Ho, 1969; =-=Rumelhart, Hinton, & Williams, 1986-=-). For each training example in the training set, its inputs are presented to the input layer of the network and the predicted outputs are calculated. The difference between each predicted output and ... |

2730 | Bagging predictors
- Breiman
- 1996
(Show Context)
Citation Context ...nsemble’s performance will be perfect because if one network misclassifies an example, then the remaining two networks will correct the error. Two of the most popular ensemble algorithms are bagging (=-=Breiman, 1994-=-) and boost2sing (Freund & Schapire, 1996). Given a training set, bagging generates multiple bootstrapped training sets and calls the base model learning algorithm with each of them to yield a set of ... |

2472 | A decision-theoretic generalization of on-line learning and an application to boosting
- Freund, Schapire
- 1995
(Show Context)
Citation Context ...ope is that subsequent base models correct the mistakes of the previous models. Bagging and boosting are popular among ensemble methods because of their strong theoretical motivations (Breiman, 1994; =-=Freund & Schapire, 1997-=-) and the good experimental results obtained with them (Freund & Schapire, 1996; Bauer & Kohavi, 1999; Dietterich, 2000). Most ensemble learning algorithms including bagging and boosting are batch alg... |

2267 |
Principal Component Analysis
- Jolliffe
- 2002
(Show Context)
Citation Context ...ersity in the pool of base models. A very different approach is used by Merz (1998, 1999)—in this approach, given a set of base models, a combining scheme based on Principal Component Analysis (PCA) (=-=Jolliffe, 1986-=-) is used to try to achieve the best performance that can be achieved from that set of models. This method reduces the weights of base models that are redundant even though they may perform well and i... |

1751 | Experiments with a new boosting algorithm
- Freund, Schapire
- 1996
(Show Context)
Citation Context ...erfect because if one network misclassifies an example, then the remaining two networks will correct the error. Two of the most popular ensemble algorithms are bagging (Breiman, 1994) and boost2sing (=-=Freund & Schapire, 1996-=-). Given a training set, bagging generates multiple bootstrapped training sets and calls the base model learning algorithm with each of them to yield a set of base models. Given a training set of size... |

768 | Boosting the margin: a new explanation for the effectiveness of voting methods
- Schapire, Freund, et al.
- 1998
(Show Context)
Citation Context ...re and more base models even after the training error has gone to zero continues to reduce the test error (Drucker & Cortes, 1996; Quinlan, 1996; Breiman, 1998). There has been some more recent work (=-=Schapire, Freund, Bartlett, & Lee, 1997-=-, 1998) that attempts to explain this phenomenon in terms of the distribution of margins of the training examples, where the margin of an example is the total weighted vote for the correct class minus... |

759 | Hierarchical mixtures of experts and the EM algorithm
- Jordan, Jacobs
- 1994
(Show Context)
Citation Context ...of the classifiers have been examined (Benediktsson, Sveinsson, Ersoy, & �We explain bagging and boosting in more detail later in this chapter. 18sSwain, 1994; Hashem & Schmeiser, 1993; Jacobs, 1995; =-=Jordan & Jacobs, 1994-=-; Lincoln & Skrzypek, 1990; Merz, 1999). Boosting uses a weighted averaging method where each base model’s weight is proportional to its classification accuracy. The combining schemes described so far... |

700 | The strength of weak learnability
- Schapire
- 1990
(Show Context)
Citation Context ...e then derive our online boosting algorithm. Finally, we compare the performances of the two algorithms theoretically and experimentally. 4.1 Earlier Boosting Algorithms The first boosting algorithm (=-=Schapire, 1990-=-) was designed to convert a weak PAClearning algorithm into a strong PAC-learning algorithm (see (Kearns & Vazirani, 1994) for a detailed explanation of the PAC learning model and Schapire’s original ... |

699 | The weighted majority algorithm
- Littlestone, Warmuth
- 1994
(Show Context)
Citation Context ...ry in order for ensembles to achieve better performance. We also discuss online learning, including the motivation for it and the various online learning algorithms. We discuss the Weighted Majority (=-=Littlestone & Warmuth, 1994-=-) and Winnow (Littlestone, 1988) algorithms in more detail because, like online ensemble learning algorithms, they maintain several models and update them in an online manner. However, Weighted Majori... |

682 | Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm
- Littlestone
- 1988
(Show Context)
Citation Context ...er performance. We also discuss online learning, including the motivation for it and the various online learning algorithms. We discuss the Weighted Majority (Littlestone & Warmuth, 1994) and Winnow (=-=Littlestone, 1988-=-) algorithms in more detail because, like online ensemble learning algorithms, they maintain several models and update them in an online manner. However, Weighted Majority and Winnow are designed to p... |

616 |
An Introduction to Computational Learning Theory
- Kearns, Vazirani
- 1994
(Show Context)
Citation Context ...cally and experimentally. 4.1 Earlier Boosting Algorithms The first boosting algorithm (Schapire, 1990) was designed to convert a weak PAClearning algorithm into a strong PAC-learning algorithm (see (=-=Kearns & Vazirani, 1994-=-) for a detailed explanation of the PAC learning model and Schapire’s original boosting algorithm). In the PAC (Probably Approximately Correct) model of learning, a learner has access to a set of labe... |

592 | Solving multiclass learning problems via error-correcting output codes
- Dietterich, Bakiri
- 1995
(Show Context)
Citation Context ...h, 1996) promote diversity by presenting each base model with a different subset of training examples or different weight distributions over the examples. The method of error-correcting output codes (=-=Dietterich & Bakiri, 1995-=-) presents each base model with the same training inputs but different labels—for each base model, the algorithm constructs a random partitioning of the labels into two new labels. The training data w... |

579 | Stacked generalization
- Wolpert
- 1992
(Show Context)
Citation Context ...els (the component models of the ensemble) if the base models perform well on novel examples and tend to make errors on different examples (e.g., (Breiman, 1993; Oza & Tumer, 1999; Tumer & Oza, 1999; =-=Wolpert, 1992-=-)). To see why, let us define�����������to be the three neural networks in the previous example and consider a new example�.If all three networks always agree, whenever������ then is incorrect,������a... |

577 | An empirical comparison of voting classification algorithms: Bagging, boosting, and variants
- Bauer, Kohavi
- 1999
(Show Context)
Citation Context ... popular among ensemble methods because of their strong theoretical motivations (Breiman, 1994; Freund & Schapire, 1997) and the good experimental results obtained with them (Freund & Schapire, 1996; =-=Bauer & Kohavi, 1999-=-; Dietterich, 2000). Most ensemble learning algorithms including bagging and boosting are batch algorithms. That is, they repeatedly read and process the entire set of training examples. They typicall... |

545 |
Neural network ensembles
- Hansen, Salomon
- 1990
(Show Context)
Citation Context ...r methods of training the base models. We can also distinguish methods by the way they combine their base models. Majority voting is one of the most basic methods of combining (Battiti & Colla, 1994; =-=Hansen & Salamon, 1990-=-) and is the method used in bagging. If the classifiers provide probability values, simple averaging is an effective combining method and has received a lot of attention (Lincoln & Skrzypek, 1990; Per... |

468 | An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting and randomization
- Dietterich
- 2000
(Show Context)
Citation Context ...e methods because of their strong theoretical motivations (Breiman, 1994; Freund & Schapire, 1997) and the good experimental results obtained with them (Freund & Schapire, 1996; Bauer & Kohavi, 1999; =-=Dietterich, 2000-=-). Most ensemble learning algorithms including bagging and boosting are batch algorithms. That is, they repeatedly read and process the entire set of training examples. They typically require at least... |

440 | Boosting a weak learning algorithm by majority - Freund - 1995 |

413 | Neural network ensembles, cross validation, and active learning
- Krogh, Vedelsby
- 1995
(Show Context)
Citation Context ... generate a pool of base models that make errors that are as uncorrelated as possible. Methods such as bagging (Breiman, 1994), boosting (Freund & Schapire, 1996)�, and cross-validation partitioning (=-=Krogh & Vedelsby, 1995-=-; Tumer & Ghosh, 1996) promote diversity by presenting each base model with a different subset of training examples or different weight distributions over the examples. The method of error-correcting ... |

399 |
Methods of combining multiple classifiers and their applications to handwriting recognition
- Xu, Krzyzak, et al.
- 1992
(Show Context)
Citation Context ..., 1993; Tumer & Ghosh, 1996). There are also non-linear ensemble schemes include rank-based combining (Al-Ghoneim & Vijaya Kumar, 1995; Ho, Hull, & Srihari, 1994), belief-based methods (Rogova, 1994; =-=Xu, Krzyzak, & Suen, 1992-=-; Yang & Singh, 1994), and orderstatistic combiners (Tumer & Ghosh, 1998; Tumer, 1996). In this thesis, the ensemble methods that we use are bagging and boosting, which we explain now. Bagging Bootstr... |

341 |
Classification and Regression
- Breiman, Friedman, et al.
- 1984
(Show Context)
Citation Context ...trated that ensembles often outperform their base models (the component models of the ensemble) if the base models perform well on novel examples and tend to make errors on different examples (e.g., (=-=Breiman, 1993-=-; Oza & Tumer, 1999; Tumer & Oza, 1999; Wolpert, 1992)). To see why, let us define�����������to be the three neural networks in the previous example and consider a new example�.If all three networks a... |

331 | Decision Combination in Multiple Classifier System
- Ho, Hull, et al.
- 1994
(Show Context)
Citation Context ...depth (Breiman, 1994; Hashem & Schmeiser, 1993; Perrone & Cooper, 1993; Tumer & Ghosh, 1996). There are also non-linear ensemble schemes include rank-based combining (Al-Ghoneim & Vijaya Kumar, 1995; =-=Ho, Hull, & Srihari, 1994-=-), belief-based methods (Rogova, 1994; Xu, Krzyzak, & Suen, 1992; Yang & Singh, 1994), and orderstatistic combiners (Tumer & Ghosh, 1998; Tumer, 1996). In this thesis, the ensemble methods that we use... |

323 | WebWatcher: A Learning Apprentice for the World Wide Web
- Armstrong, Freitag, et al.
- 1995
(Show Context)
Citation Context ...ictors. Each predictor that makes a mistake on that example has its weight reduced in half. Weighted Majority and Winnow have shown promise in the few empirical tests that have been performed (e.g., (=-=Armstrong, Freitag, Joachims, & Mitchell, 1995-=-; Blum, 1997)). These algorithms have also been proven to perform not much worse than the best individual predictor. For example, given a sequence of training examples and the pool of predictors� , if... |

307 | When networks disagree: Ensemble methods for hybrid neural networks
- Perrone, Cooper
- 1993
(Show Context)
Citation Context ...990) and is the method used in bagging. If the classifiers provide probability values, simple averaging is an effective combining method and has received a lot of attention (Lincoln & Skrzypek, 1990; =-=Perrone & Cooper, 1993-=-; Tumer & Ghosh, 1996). Weighted averaging has also been proposed and different methods for computing the weights of the classifiers have been examined (Benediktsson, Sveinsson, Ersoy, & �We explain b... |

294 | Arcing classifiers
- Breiman
- 1998
(Show Context)
Citation Context ..., several experiments have demonstrated that adding more and more base models even after the training error has gone to zero continues to reduce the test error (Drucker & Cortes, 1996; Quinlan, 1996; =-=Breiman, 1998-=-). There has been some more recent work (Schapire, Freund, Bartlett, & Lee, 1997, 1998) that attempts to explain this phenomenon in terms of the distribution of margins of the training examples, where... |

285 | Bagging, Boosting, and C4.5
- Quinlan
- 1996
(Show Context)
Citation Context ... model. However, several experiments have demonstrated that adding more and more base models even after the training error has gone to zero continues to reduce the test error (Drucker & Cortes, 1996; =-=Quinlan, 1996-=-; Breiman, 1998). There has been some more recent work (Schapire, Freund, Bartlett, & Lee, 1997, 1998) that attempts to explain this phenomenon in terms of the distribution of margins of the training ... |

231 | Experience with a Learning Personal Assistant
- Mitchell, Caruana, et al.
- 1994
(Show Context)
Citation Context ...t section. 3.5.4 Online Dataset In this section, we discuss our experiments with a dataset that represents a true online learning scenario. This dataset is from the Calendar APprentice (CAP) project (=-=Mitchell, Caruana, Freitag, McDermott, & Zabowski, 1994-=-). The members of this project designed a personal calendar program that helps users keep track of their meeting schedules. The software provides all the usual functionality of calendar software such ... |

162 | Error correlation and error reduction in ensemble classifiers
- Tumer, Ghosh
- 1996
(Show Context)
Citation Context ...xamples. This is because the class of piecewise linear classifiers has more expressive power than the class of single linear classifier. The intuition that we have just described has been formalized (=-=Tumer & Ghosh, 1996-=-; Tumer, 1996). Ensemble learning can be justified in terms of the bias and variance of the learned model. It has been shown that, as the correlations of the errors made by the base models decrease, t... |

159 |
UCI KDD archive
- Bay
- 2000
(Show Context)
Citation Context ...ng 600 MegaHertz Pentium III processors. 3.5.1 The Data We tested our algorithms on nine UCI datasets (Blake et al., 1999), two datasets (Census Income and Forest Covertype) from the UCI KDD archive (=-=Bay, 1999-=-), and three synthetic datasets. These are batch datasets, i.e., there is no natural order in the data. With these datasets, we use our learning algorithms to generate a hypothesis using a training se... |

147 |
Methods for combining experts’ probability assessments
- Jacobs
- 1995
(Show Context)
Citation Context ...g the weights of the classifiers have been examined (Benediktsson, Sveinsson, Ersoy, & �We explain bagging and boosting in more detail later in this chapter. 18sSwain, 1994; Hashem & Schmeiser, 1993; =-=Jacobs, 1995-=-; Jordan & Jacobs, 1994; Lincoln & Skrzypek, 1990; Merz, 1999). Boosting uses a weighted averaging method where each base model’s weight is proportional to its classification accuracy. The combining s... |

144 | Prediction games and arcing algorithms
- Breiman
- 1999
(Show Context)
Citation Context ... boosting continues to increase the margins, thereby increasing the separation between the examples in the different classes. However, this explanation has been shown experimentally to be incomplete (=-=Breiman, 1997-=-). A theoretical explanation for boosting’s seeming immunity to overfitting has not yet been obtained and is an active area of research. However, this immunity and boosting’s good performance in exper... |

137 | Universal Prediction
- Merhav, Feder
- 1998
(Show Context)
Citation Context ...s than the good predictors, leading to much lower weights for the bad predictors. Eventually only the good predictors would influence the prediction of the entire model. Work in universal prediction (=-=Merhav & Feder, 1998-=-; Singer & Feder, 1999) has yielded algorithms that produce combined predictors that also are proven in the worst case to perform not much worse than the best individual predictor. Additionally, Singe... |

128 | Empirical Support for Winnow and Weighted-Majority based algorithms
- Blum
- 1995
(Show Context)
Citation Context ...hat example has its weight reduced in half. Weighted Majority and Winnow have shown promise in the few empirical tests that have been performed (e.g., (Armstrong, Freitag, Joachims, & Mitchell, 1995; =-=Blum, 1997-=-)). These algorithms have also been proven to perform not much worse than the best individual predictor. For example, given a sequence of training examples and the pool of predictors� , if there is a ... |

126 | Decision tree induction based on efficient tree restructuring
- Utgoff, Berkman, et al.
- 1997
(Show Context)
Citation Context ...me researchers have developed online algorithms for learning traditional machine learning models such as decision trees—in this thesis, we use the lossless online decision tree learning algorithm of (=-=Utgoff, Berkman, & Clouse, 1997-=-). Given an existing decision tree and a new example, this algorithm adds the example to the example sets at the appropriate nonterminal and leaf nodes and then confirms that all the attributes at the... |

112 |
Combining the results of several neural network classifiers
- Rogova
(Show Context)
Citation Context ...rrone & Cooper, 1993; Tumer & Ghosh, 1996). There are also non-linear ensemble schemes include rank-based combining (Al-Ghoneim & Vijaya Kumar, 1995; Ho, Hull, & Srihari, 1994), belief-based methods (=-=Rogova, 1994-=-; Xu, Krzyzak, & Suen, 1992; Yang & Singh, 1994), and orderstatistic combiners (Tumer & Ghosh, 1998; Tumer, 1996). In this thesis, the ensemble methods that we use are bagging and boosting, which we e... |

97 | variance and arcing classifiers
- Breiman
(Show Context)
Citation Context ...are enough to induce noticeable differences among the� base models while leaving their performances reasonably good, then the ensemble will probably perform better than the base models individually. (=-=Breiman, 1996-=-a) defines models as unstable if differences in their training 19s) each� For ,s�� Bagging( ,� ��¦���� ����������������������� ����� ������������������� ������������ Sample With Replacement( ,�) ���©�... |

92 |
Boosting decision trees
- Drucker, Cortes
- 1996
(Show Context)
Citation Context ...ive size of the ensemble model. However, several experiments have demonstrated that adding more and more base models even after the training error has gone to zero continues to reduce the test error (=-=Drucker & Cortes, 1996-=-; Quinlan, 1996; Breiman, 1998). There has been some more recent work (Schapire, Freund, Bartlett, & Lee, 1997, 1998) that attempts to explain this phenomenon in terms of the distribution of margins o... |

86 |
Democracy in neural nets: voting schemes for classification
- Battiti, Colla
- 1994
(Show Context)
Citation Context ...e distinguished by their methods of training the base models. We can also distinguish methods by the way they combine their base models. Majority voting is one of the most basic methods of combining (=-=Battiti & Colla, 1994-=-; Hansen & Salamon, 1990) and is the method used in bagging. If the classifiers provide probability values, simple averaging is an effective combining method and has received a lot of attention (Linco... |

61 |
Synergy of clustering multiple back propagation networks
- Lincoln, Skrzypek
- 1990
(Show Context)
Citation Context ... 1994; Hansen & Salamon, 1990) and is the method used in bagging. If the classifiers provide probability values, simple averaging is an effective combining method and has received a lot of attention (=-=Lincoln & Skrzypek, 1990-=-; Perrone & Cooper, 1993; Tumer & Ghosh, 1996). Weighted averaging has also been proposed and different methods for computing the weights of the classifiers have been examined (Benediktsson, Sveinsson... |

59 | On-line algorithms in machine learning
- Blum
- 1998
(Show Context)
Citation Context ...ample�, the�¤are the predictions of the experts�¤. the� Algorithm:� ing literature are the Weighted Majority Algorithm (Littlestone & Warmuth, 1994) and the Winnow Algorithm (Littlestone, 1988) (see (=-=Blum, 1996-=-) for a brief review of these algorithms). Both the Weighted Majority and Winnow algorithms maintain weights on several predictors and increase or decrease their weights depending on whether the indiv... |

39 | Universal linear prediction by model order weighting
- Singer, Feder
- 1999
(Show Context)
Citation Context ...tors, leading to much lower weights for the bad predictors. Eventually only the good predictors would influence the prediction of the entire model. Work in universal prediction (Merhav & Feder, 1998; =-=Singer & Feder, 1999-=-) has yielded algorithms that produce combined predictors that also are proven in the worst case to perform not much worse than the best individual predictor. Additionally, Singer and Feder (1999) dis... |

32 | Pasting small votes for classification in large databases and online - Breiman - 1999 |

30 | A Principal Components Approach to Combining Regression Estimates
- Merz, Pazzani
- 1997
(Show Context)
Citation Context ...on, Sveinsson, Ersoy, & �We explain bagging and boosting in more detail later in this chapter. 18sSwain, 1994; Hashem & Schmeiser, 1993; Jacobs, 1995; Jordan & Jacobs, 1994; Lincoln & Skrzypek, 1990; =-=Merz, 1999-=-). Boosting uses a weighted averaging method where each base model’s weight is proportional to its classification accuracy. The combining schemes described so far are linear combining techniques, whic... |

27 | Online ensemble learning: An empirical study - Fern, Givan - 2000 |

27 | Tumer “Input Decimation Ensembles: Decorrelation through Dimensonality Reduction - Oza, K - 2001 |

24 | Approximating a function and its derivatives using MSE-optimal linear combinations of trained feedforward neural networks
- Hashem, Schmeiser
- 1993
(Show Context)
Citation Context ...erent methods for computing the weights of the classifiers have been examined (Benediktsson, Sveinsson, Ersoy, & �We explain bagging and boosting in more detail later in this chapter. 18sSwain, 1994; =-=Hashem & Schmeiser, 1993-=-; Jacobs, 1995; Jordan & Jacobs, 1994; Lincoln & Skrzypek, 1990; Merz, 1999). Boosting uses a weighted averaging method where each base model’s weight is proportional to its classification accuracy. T... |

19 |
Tumer K., Dimensionality Reduction Through Classifier Ensembles
- Oza
- 1999
(Show Context)
Citation Context ...embles often outperform their base models (the component models of the ensemble) if the base models perform well on novel examples and tend to make errors on different examples (e.g., (Breiman, 1993; =-=Oza & Tumer, 1999-=-; Tumer & Oza, 1999; Wolpert, 1992)). To see why, let us define�����������to be the three neural networks in the previous example and consider a new example�.If all three networks always agree, whenev... |

16 |
An evidential reasoning approach for multiple-attribute decision making with uncertainty
- Yang, Singh
- 1994
(Show Context)
Citation Context ...). There are also non-linear ensemble schemes include rank-based combining (Al-Ghoneim & Vijaya Kumar, 1995; Ho, Hull, & Srihari, 1994), belief-based methods (Rogova, 1994; Xu, Krzyzak, & Suen, 1992; =-=Yang & Singh, 1994-=-), and orderstatistic combiners (Tumer & Ghosh, 1998; Tumer, 1996). In this thesis, the ensemble methods that we use are bagging and boosting, which we explain now. Bagging Bootstrap Aggregating (bagg... |

14 | Boosting regression estimators
- Avnimelech, Intrator
- 1999
(Show Context)
Citation Context ...ver, online learning algorithms are especially important for time series data. Some work has been done on applying batch boosting to time series classification (Diez & Gonzalez, 2000) and regression (=-=Avnimelech & Intrator, 1998-=-). The work on time series classification assumes that each training example is a time series example of one class. There are other possible time series classification problems. For example, there may... |