## Diversity creation methods: A survey and categorisation (2005)

### Cached

### Download Links

- [www.cs.man.ac.uk]
- [www.cs.bham.ac.uk]
- [www.cs.bham.ac.uk]
- [www.cs.bham.ac.uk]
- DBLP

### Other Repositories/Bibliography

Venue: | Journal of Information Fusion |

Citations: | 88 - 22 self |

### BibTeX

@ARTICLE{Brown05diversitycreation,

author = {Gavin Brown and Jeremy Wyatt and Rachel Harris and Xin Yao},

title = {Diversity creation methods: A survey and categorisation},

journal = {Journal of Information Fusion},

year = {2005},

volume = {6},

pages = {5--20}

}

### Years of Citing Articles

### OpenURL

### Abstract

Ensemble approaches to classification and regression have attracted a great deal of interest in recent years. These methods can be shown both theoretically and empirically to outperform single predictors on a wide range of tasks. One of the elements required for accurate prediction when using an ensemble is recognised to be error “diversity”. However, the exact meaning of this concept is not clear from the literature, particularly for classification tasks. In this paper we first review the varied attempts to provide a formal explanation of error diversity, including several heuristic and qualitative explanations in the literature. For completeness of discussion we include not only the classification literature but also some excerpts of the rather more mature regression literature, which we believe can still provide some insights. We proceed to survey the various techniques used for creating diverse ensembles, and categorise them, forming a preliminary taxonomy of diversity creation methods. As part of this taxonomy we introduce the idea of implicit and explicit diversity creation methods, and three dimensions along which these may be applied. Finally we propose some new directions that may prove fruitful in understanding classification error diversity. 1

### Citations

3625 |
Neural networks. A comprehensive foundation. Second edition
- Haykin
- 1999
(Show Context)
Citation Context ...1] and artificial [50] datasets. Liao and Moody [52] demonstrate an informationtheoretic technique for feature selection, where all input variables are first grouped based on their mutual information =-=[53, p492]-=-. Statistically similar variables are assigned to the same group, and eachsJournal of Information Fusion 6(1), 2005 17 member’s input set is then formed by input variables extracted from different gro... |

2864 |
Genetic Programming: On the Programming of Computers by Means of Natural Selection
- Koza
- 1992
(Show Context)
Citation Context ...c Negative Correlation Learning (RTQRT), based on an alternative penalty term. The term used was: � � � pi = � 1 M� (fi − d) M 4 , (26) i=1 The RTQRT-NC technique was applied to a Genetic Programming =-=[71]-=- system, and shown to outperform standard NC on larger ensembles—it is yet to be explained exactly why this is the case. The standard mean squared error function presents a certain error landscape to ... |

2492 | Bagging predictors
- Breiman
- 1996
(Show Context)
Citation Context ...od approximation to the mean, ET,W {f}. If we have a smaller ensemble, we cannot expect this: our sample mean may be upward or downward biased. In order to correct this, some methods, such as Bagging =-=[10]-=-, construct our networks from different training datasets, allowing us to sample a more representative portion of the space. This illustration assumes that the expected value of our estimator is equal... |

1631 | Experiments with a New Boosting Algorithm
- Freund, Schapire
- 1996
(Show Context)
Citation Context ... Bagging [10] is an implicit method, it randomly samples the training patterns to produce different sets for each network; at no point is a measurement taken to ensure diversity will emerge. Boosting =-=[31]-=- is an explicit method, it directly manipulates the training data distributions to ensure some form of diversity in the base set of classifiers (although it is obviously not guaranteed to be the ‘righ... |

608 |
Neural networks and the bias/variance dilemma
- Geman, Bienenstock, et al.
- 1992
(Show Context)
Citation Context ... of the original proof from [2] were omitted for the authors’ space considerations. However, this can in fact be shown more simply by the same manipulations as used in the bias-variance decomposition =-=[11]-=-, reflecting a strong relationship between the two decompositions. We present this alternative version here: � wi(fi − d) 2 = � wi(fi − fens + fens − d) 2 i i = � i = � i (fens − d) 2 = � i i wifi i �... |

503 |
Neural network ensembles
- Hansen, Salamon
- 1990
(Show Context)
Citation Context ...) is progressing. What seems to be a sticking point for ensemble research is the non-ordinal output case, as we will now illustrate. 2.3.2 Non-Ordinal Outputs In ensemble research, Hansen and Salamon =-=[18]-=- is seen by many as the seminal work on diversity in neural network classification ensembles. They stated that a necessary and sufficient condition for a majority voting 2 ensemble of classifiers to b... |

383 | Neural network ensembles, cross validation, and active learning
- Krogh, Vedelsby
- 1995
(Show Context)
Citation Context ...rally tackled by linearly weighted ensembles. These type of ensembles have a much clearer framework for explaining the role of diversity than voting methods. In particular the Ambiguity decomposition =-=[2]-=- and bias-variance-covariance decomposition [3] provide a solid quantification of diversity for linearly weighted ensembles by connecting it back to an objective error criterion: mean squared error. W... |

278 | Arcing classifiers
- Breiman
- 1998
(Show Context)
Citation Context ...c loss function we used forsJournal of Information Fusion 6(1), 2005 22 the regression context). A number of authors have attempted to define a bias-variance decomposition for zero-one loss functions =-=[56, 79, 80, 81]-=-, each with their own assumptions and shortcomings. Most recently Domingos [82] and James [83] propose general definitions which include the original quadratic loss function as a special case. This le... |

198 |
Combining forecasts: A review and annotated bibliography
- Clemen
- 1989
(Show Context)
Citation Context ...st study in the Machine Learning literature, but the topic has been covered in other research communities for several years, for example in financial forecasting: Bates and Granger [7, 8], and Clemen =-=[9]-=-. As a consequence, the understanding of diversity here is quite mature, as we will now show. First, as an illustrative scenario, consider a single neural network approximating a sine wave; our networ... |

186 | Popular ensemble methods: An empirical study
- Opitz, Maclin
- 1999
(Show Context)
Citation Context ... most common way of generating an ensemble, but is now generally accepted as the least effective method of achieving good diversity; many authors use this as a default benchmark for their own methods =-=[35]-=-. We will first discuss implicit instances of this axis, where weights are generated randomly, and then discuss explicit diversity for this, where networks are directly placed in different parts of th... |

173 | Bias plus variance decomposition for zero-one loss functions
- Kohavi, Wolpert
- 1996
(Show Context)
Citation Context ...c loss function we used forsJournal of Information Fusion 6(1), 2005 22 the regression context). A number of authors have attempted to define a bias-variance decomposition for zero-one loss functions =-=[56, 79, 80, 81]-=-, each with their own assumptions and shortcomings. Most recently Domingos [82] and James [83] propose general definitions which include the original quadratic loss function as a special case. This le... |

155 | Error correlation and error reduction in ensemble classifiers
- Tumer, Ghosh
(Show Context)
Citation Context ...a regression one by choosing to approximate the class posterior probabilities; this allows the theory we have already discussed to apply, and work is progressing in this area, notably Tumer and Ghosh =-=[13]-=- and Roli and Fumera [14, 15]. For the regression context discussed in the previous section, the question can be clearly phrased as “how can we quantify diversity when our predictors output real-value... |

147 | Error-Correcting Output Coding Corrects Bias and Variance
- Kong, Dietterich
- 1995
(Show Context)
Citation Context ... the string produced by the predictor does not exactly match the string representing one of the classes, the Hamming distance is measured to each class, and the closest is chosen. Kong and Dietterich =-=[56]-=- investigate why this technique works. They find that, like Bagging, ECOC reduces the variance of the ensemble, but in addition can correct the bias component. An important point to note for this resu... |

121 | Algorithms for optimal linear combinations of neural networks
- Hashem
- 1997
(Show Context)
Citation Context ...many ad-hoc diversity creation techniques. 2.1 In a Regression Context The first major study on combining regression estimators was by Perrone [5] (in fact at the same time, and independently, Hashem =-=[6]-=- developed many of the same results). This was the first study in the Machine Learning literature, but the topic has been covered in other research communities for several years, for example in financ... |

107 | Pruning adaptive boosting, in
- Margineantu, Dietterich
- 1997
(Show Context)
Citation Context ...een the diversity of the ensemble and its accuracy. Diversity alone is a poor predictor of the ensemble accuracy.” [20] An alternative measure of diversity was advocated by Margineantu and Dietterich =-=[30]-=-, the kappa-statistic, κ. Using the coincidence matrix as before, kappa is defined as: κ = 2(ac − bd) (a + b)(c + d) + (a + c)(b + d) Margineantu and Dietterich [30] produced kappa-error plots for Ada... |

105 | Generating accurate and diverse members of a neural-network ensemble
- Opitz, Shavlik
- 1996
(Show Context)
Citation Context ...der though, that it may be the case that the problem of choosing “compatible” network topologies to place together in an ensemble is simply too hard for a human. Opitz and Shavlik’s Addemup algorithm =-=[58]-=-, used an evolutionary algorithm to optimise the network topologies composing the ensemble. Addemup trains with standard backpropagation, then selects groups of networks with a good error diversity ac... |

88 | Error - correcting output codes: A general method for improving multiclass inductive learning programs
- Dietterich, Bakiri
- 1991
(Show Context)
Citation Context ...4] combines information from the fossil record with sunspot time series data to predict future sunspot fluctuations. Most of the methods we have discussed manipulate input data. Dietterich and Bakiri =-=[55]-=- manipulate the output targets with Error-Correcting Output Coding. Each output class in the problem is represented by a binary string, chosen such that it is orthogonal (or as close as possible) from... |

81 | Improving Regression Estimation: Averaging Methods for Variance Reduction with Extensions to General Convex Measure Optimization
- Perrone
- 1993
(Show Context)
Citation Context ... following section propose a novel way to categorise the many ad-hoc diversity creation techniques. 2.1 In a Regression Context The first major study on combining regression estimators was by Perrone =-=[5]-=- (in fact at the same time, and independently, Hashem [6] developed many of the same results). This was the first study in the Machine Learning literature, but the topic has been covered in other rese... |

75 | Analysis of decision boundaries in linearly combined neural classifiers
- Tumer, Ghosh
- 1996
(Show Context)
Citation Context ... work by Tumer and Ghosh, on combining posterior probability estimates (ordinal values), and then turn to considering the harder question of non-ordinal outputs. 2.3.1 Ordinal Outputs Tumer and Ghosh =-=[16, 13]-=- provided a theoretical framework for analysing the simple averaging combination rule when our predictor outputs are estimates of the posterior probabilities of each class, as in figure 3. For a one d... |

74 | Making use of population information in evolutionary artificial neural networks
- Yao, Liu
- 1998
(Show Context)
Citation Context ...the maximum amount of the hypothesis space is being explored in order to find the best single individual. In spite of these differences, some researchers have found interesting parallels. Yao and Liu =-=[75]-=- evolves a population of neural networks, using fitness sharing techniques to encourage diversity, then combines the entire population as an ensemble instead of just picking the best individual. Khare... |

70 |
C.J.: Measures of Diversity in Classifier Ensembles
- Kuncheva, Whitaker
(Show Context)
Citation Context ...iple classifier systems. Yet this is not easy, and our understanding of classifier error diversity is still incomplete. While we have many measures of diversity from the numerical taxonomy literature =-=[1]-=-, we do not yet have a complete grounded framework; neither do we have a useful guide through the myriad of techniques by which we could create such error diversity. While ensemble approaches to class... |

68 | Ensemble learning using decorrelated neural networks - Rosen - 1996 |

59 | Bootstrapping with noise: An effective regularization technique
- Raviv, Intrator
- 1996
(Show Context)
Citation Context ...y applies a random transformation to the features, yet Sharkey shows an ensemble using this technique can outperform an ensemble of classifiers using only the non-transformed data. Intrator and Raviv =-=[46]-=- report that simply adding Gaussian noise to the input data can help. They create a bootstrap resample, like Bagging, but then add a small amount of noise to the input vector. Several ensembles are th... |

56 | Diversity versus Quality in Classification Ensembles Based on Feature Sélection
- Cunningham, Camey
(Show Context)
Citation Context ...ur. This suggested improvement does not, however, give any indication as to which members of the ensemble are responsible for which proportions of the different levels of error. Carney and Cunningham =-=[22]-=- suggested an entropy-based measure, though this does not allow calculation of an individual’s contribution to overall diversity. Zenobi and Cunningham [23] proposed a measure of classification Ambigu... |

56 | Feature sélection for ensembles
- Opitz
- 1999
(Show Context)
Citation Context ...y term as a penalty in the error function of each network. This means we can optimise ensemble performance by tuning the emphasis on diversity in the error function used the strength parameter. Opitz =-=[78]-=- selected feature subsets for the ensemble members to train on, using a Genetic Algorithm (GA) with an Ambiguity-based fitness function; this showed gains over Bagging and Adaboost on several classifi... |

52 |
On bias, variance, 0/1 loss and the curse of dimensionality
- Friedman
- 1997
(Show Context)
Citation Context ...c loss function we used forsJournal of Information Fusion 6(1), 2005 22 the regression context). A number of authors have attempted to define a bias-variance decomposition for zero-one loss functions =-=[56, 79, 80, 81]-=-, each with their own assumptions and shortcomings. Most recently Domingos [82] and James [83] propose general definitions which include the original quadratic loss function as a special case. This le... |

49 |
Combining Artificial Neural Nets: Ensemble and Modular Multi-net Sy stems
- Sharkey
- 1999
(Show Context)
Citation Context ...it methods deterministically choose different paths in the space. In addition to this high level dichotomy, there are several other possible dimensions for ensuring diversity in the ensemble. Sharkey =-=[32]-=- proposed that a neural network ensemble could be made to exhibit diversity by influencing one of four things: the initial weights, the training data used, the architecture of the networks, and the tr... |

49 | Constructing diverse classifier ensembles using artificial training examples
- Melville, Mooney
- 2003
(Show Context)
Citation Context ...ding to a pre-defined threshold, or an increase in overall ensemble error. In this case a new feature subset is generated and another predictor trained. The DECORATE algorithm, by Melville and Mooney =-=[49]-=- utilises the same metric to decide whether to accept or reject predictors to be added to the ensemble. Predictors here are generated by training on the original data, plus a ‘diversity set’ of artifi... |

48 | Engineering multiversion neural-net Systems
- Partridge, Yales
- 1996
(Show Context)
Citation Context ...propagation, seems not to be an effective stand-alone method for generating error diversity in an ensemble of neural networks. These observations are supported by a number of other studies. Partridge =-=[37, 38]-=- conducted several experiments on large (> 150, 000 patterns) synthetic data sets, and concludes that after network type, training set structure, and number of hidden units, the random initialization ... |

44 | Experiments with Classifier Combination Rules
- Duin, Tax
- 2000
(Show Context)
Citation Context ...iques which transform features are termed distortion methods [42]. N training patterns K features Space of all possible features Figure 4: Space of possible training sets for an ensemble Duin and Tax =-=[43]-=- find that combining the results of one type of classifier on different feature sets is far more effective than combining the results of different classifiers on one feature set. They conclude that th... |

44 | A Unified Bias-Variance Decomposition and its Applications
- Domingos
- 2000
(Show Context)
Citation Context ...ext). A number of authors have attempted to define a bias-variance decomposition for zero-one loss functions [56, 79, 80, 81], each with their own assumptions and shortcomings. Most recently Domingos =-=[82]-=- and James [83] propose general definitions which include the original quadratic loss function as a special case. This leads us naturally to ask the question, does there exist an analogue to the bias-... |

41 | Speciation as automatic categorical modularization
- Darwen, Yao
- 1997
(Show Context)
Citation Context ....3.2 Evolutionary Methods The term “diversity” is also often used in the evolutionary computation literature, in the context of maintaining diversity in the population of individuals you are evolving =-=[72, 73, 74]-=-. This has a very different meaning to the concept of diversity as discussed in this article. In evolutionary algorithms, we wish to maintain diversity in order to ensure we have explored a large area... |

40 |
Generalization error of ensemble estimators
- Ueda, Nakano
- 1996
(Show Context)
Citation Context ...ether an ensemble is performing well. In addition, we suggested directions to take in understanding a more formal grounding for diversity, around studies of the bias-variance-covariance decomposition =-=[3]-=- and the generalised bias-variance decomposition [83] for zero-one loss. This is the subject of our current research. The main contribution of this article has been a thorough survey and categorisatio... |

40 | A constructive algorithm for training cooperative neural network ensembles
- Islam, Yao, et al.
(Show Context)
Citation Context ...nsemble. Addemup trains with standard backpropagation, then selects groups of networks with a good error diversity according to the measurement of diversity. Another recently proposed algorithm, CNNE =-=[59]-=-, constructively builds an ensemble, monitoring diversity during the process. CNNE simultaneously designs the ensemble architecture along with training of individual NNs, whilst directly encouraging e... |

38 | Relationships between combination methods and measures of diversity in combining classifiers
- Shipp, Kuncheva
- 2002
(Show Context)
Citation Context ...th the ensemble output, and 0 otherwise. The overall ensemble Ambiguity is: Ā = 1 N N� n=1 1 M M� ai(xn) (18) The vast majority of empirical evidence examining classifier diversity is due to Kuncheva =-=[24, 25, 20, 1, 26, 27, 28, 29]-=-. These studies have explored several measures of diversity from the numerical taxonomy literature. Kuncheva’s work emphasizes the existence of two styles of measuring diversity, pairwise and non-pair... |

38 | Combining the predictions of multiple classifiers: using competitive learning to initialize neural networks
- Maclin, Shavlik
- 1995
(Show Context)
Citation Context ...rting point in hypothesis space. There are very few explicit methods for this, where randomisation of weights does not occur; the literature on this topic is disappointingly small. Maclin and Shavlik =-=[40]-=- present an approach to initializing neural network weights that uses competitive learning to create networks that are initialised far from the origin of weight space, thereby potentially increasing t... |

36 |
Combining Forecasts —Twenty Years Later
- Granger
- 1989
(Show Context)
Citation Context ...). This was the first study in the Machine Learning literature, but the topic has been covered in other research communities for several years, for example in financial forecasting: Bates and Granger =-=[7, 8]-=-, and Clemen [9]. As a consequence, the understanding of diversity here is quite mature, as we will now show. First, as an illustrative scenario, consider a single neural network approximating a sine ... |

36 | Diversity in Neural Network Ensembles
- Brown
- 2004
(Show Context)
Citation Context ...stand what portions of the bias-variance-covariance decomposition correspond to the Ambiguity term and which portions to the ’average individual error’ term. After some manipulations (for details see =-=[12]-=-) we can show: E{ 1 M E{ 1 M � i � i (fi − 〈d〉) 2 } = (E{ ¯ f} − 〈d〉) 2 + 1 M (fi − ¯ f) 2 } = 1 M � i � E (fi − E{ ¯ f}) 2� = bias( ¯ f) 2 + Ω (9) � i E{(fi − E{ ¯ f}) 2 } − E{( ¯ f − E{ ¯ f}) 2 } = ... |

33 | Using diversity in preparing ensembles of classifiers based on différente feature subsets lo minimize generalization error
- Zenobi, Cunningham
- 2001
(Show Context)
Citation Context ...rent levels of error. Carney and Cunningham [22] suggested an entropy-based measure, though this does not allow calculation of an individual’s contribution to overall diversity. Zenobi and Cunningham =-=[23]-=- proposed a measure of classification Ambiguity. The Ambiguity of the ith classifier, averaged over N patterns, is Ai = 1 N� ai(xn) (17) N n=1sJournal of Information Fusion 6(1), 2005 12 where ai(xn) ... |

29 | Theoretical foundations of linear and order statistics combiners for neural pattern classifiers
- Tumer, Ghosh
- 1995
(Show Context)
Citation Context ...� 1 + δ(M − 1) � where M is the number of classifiers. Eadd is the expected added error of the individual classifiers: they are assumed to have the same error. The δ is a correlation coefficient (see =-=[17]-=- for details) measuring the correlation between errors in approximating the posterior probabilities, therefore this M (13) (14)sJournal of Information Fusion 6(1), 2005 9 is a direct measure of divers... |

29 | Randomizing outputs to increase prédiction accuracy
- Breiman
- 2000
(Show Context)
Citation Context ...termine the optimal level. On test data, they show significant improvements on synthetic and medical datasets. So far we have discussed how the input patterns could be resampled or distorted; Breiman =-=[47]-=- proposed adding noise to the outputs in the training data. This technique showed significant improvements over Bagging on 15 natural and artificial datasets; however when comparing to AdaBoost [31], ... |

29 |
Ensemble learning via negative correlation, Neural Networks 12
- Liu, Yao
- 1999
(Show Context)
Citation Context ... a regularisation term: R = pi = (fi − ¯ f) � (fj − ¯ f) (25) where ¯ f is the average output of the whole ensemble of M networks at the previous timestep. NC has seen a number of empirical successes =-=[33, 66, 67]-=-, consistently outperforming a simple ensemble system. In previous work [68] we formalised certain aspects of the NC algorithm, showing it could be applied to any learning machine that could minimise ... |

29 | Variance and bias for general loss functions
- James
- 2003
(Show Context)
Citation Context ...of authors have attempted to define a bias-variance decomposition for zero-one loss functions [56, 79, 80, 81], each with their own assumptions and shortcomings. Most recently Domingos [82] and James =-=[83]-=- propose general definitions which include the original quadratic loss function as a special case. This leads us naturally to ask the question, does there exist an analogue to the bias-variance-covari... |

26 | Ten measures of diversity in classifier ensembles: limits for two classifiers
- Kuncheva, Whitaker
- 2001
(Show Context)
Citation Context ...th the ensemble output, and 0 otherwise. The overall ensemble Ambiguity is: Ā = 1 N N� n=1 1 M M� ai(xn) (18) The vast majority of empirical evidence examining classifier diversity is due to Kuncheva =-=[24, 25, 20, 1, 26, 27, 28, 29]-=-. These studies have explored several measures of diversity from the numerical taxonomy literature. Kuncheva’s work emphasizes the existence of two styles of measuring diversity, pairwise and non-pair... |

25 |
Improving committee diagnosis with resampling techniques
- Parmanto, Munro, et al.
- 1996
(Show Context)
Citation Context ...des that after network type, training set structure, and number of hidden units, the random initialization of weights is the least effective method for generating diversity. Parmanto, Munro and Doyle =-=[39]-=- used one synthetic dataset and two medical diagnosis datasets to compare 10-fold cross-validation, Bagging, and random weight initializations; again the random weights method comes in last place. The... |

25 | Input decimation ensembles: decorrelation through dimensionality reduction
- Oza, Tumer
- 2001
(Show Context)
Citation Context ...error is not reduced, a new diversity set is produced and a new predictor trained. The algorithm terminates after a desired ensemble size or a specified number of iterations is reached. Oza and Tumer =-=[50]-=- present Input Decimation Ensembles, which seeks to reduce the correlations among individual estimators by using different subsets of the input features. Feature selection is achieved by calculating t... |

25 |
Negatively correlated neural networks can produce best ensembles
- Liu, Yao
- 1997
(Show Context)
Citation Context ... a regularisation term: R = pi = (fi − ¯ f) � (fj − ¯ f) (25) where ¯ f is the average output of the whole ensemble of M networks at the previous timestep. NC has seen a number of empirical successes =-=[33, 66, 67]-=-, consistently outperforming a simple ensemble system. In previous work [68] we formalised certain aspects of the NC algorithm, showing it could be applied to any learning machine that could minimise ... |

21 | The "test and sélect" approach to ensemble combination
- Sharkey, Sharkey
- 2000
(Show Context)
Citation Context ...of the features. This can be viewed in our diagram as using a different plane, moving in the space of all possible features. The data techniques which transform features are termed distortion methods =-=[42]-=-. N training patterns K features Space of all possible features Figure 4: Space of possible training sets for an ensemble Duin and Tax [43] find that combining the results of one type of classifier on... |

20 | That elusive diversity in classifier ensembles
- Kuncheva
(Show Context)
Citation Context ...sJournal of Information Fusion 6(1), 2005 10 outputs. Although the bound is not very tight, Kuncheva comments that it can be regarded as a piece of “that yet missing more general theory of diversity” =-=[20]-=-. We will comment further on the nature of this problem in section 4. From this point onwards, when referring to “classification error diversity”, it can be assumed that we are referring to this diffi... |

19 |
Multi-net Systems
- Sharkey
- 1999
(Show Context)
Citation Context ...fier combination techniques into coverage optimisation and decision optimisation—the diversity creation methods we have described would seem to come under the branch of coverage optimisation. Sharkey =-=[77]-=- proposed a categorisation scheme for multi-network architectures. An architecture is categorised on whether it is competitive or cooperative, and whether it is top-down or bottom-up. A competitive ar... |