## A Study of Ensemble of Hybrid Networks

### BibTeX

@MISC{Regularization_astudy,

author = {Strong Regularization and Shimon Cohen and Nathan Intrator},

title = {A Study of Ensemble of Hybrid Networks},

year = {}

}

### OpenURL

### Abstract

www.cs.tau.ac.il/˜nin Abstract. We study various ensemble methods for hybrid neural networks. The hybrid networks are composed of radial and projection units and are trained using a deterministic algorithm that completely defines the parameters of the network for a given data set. Thus, there is no random selection of the initial (and final) parameters as in other training algorithms. Network independent is achieved by using bootstrap and boosting methods as well as random input sub-space sampling. The fusion methods are evaluated on several classification benchmark data-sets. A novel MDL based fusion method appears to reduce the variance of the classification scheme and sometimes be superior in its overall performance. 1

### Citations

9231 |
Elements of Information Theory
- Cover, Thomas
- 1990
(Show Context)
Citation Context ...ℓ(M,D) =ℓ(D|M)+ℓ(M). (2) According to Shannon’s theory, to encode a random variable X with a known distribution by the minimum number of bits, a realization of x has to be encoded by − log(p(x)) bits =-=[18,6]-=-. Thus, the description length is: ℓ(M,D) =− log(p(D|M)) − log(p(M)), (3) where p(M|D) is the probability of the output data given the model, and p(M) is a − priori model probability. Typically the MD... |

7146 |
A mathematical theory of communication
- Shannon
- 1948
(Show Context)
Citation Context ...ℓ(M,D) =ℓ(D|M)+ℓ(M). (2) According to Shannon’s theory, to encode a random variable X with a known distribution by the minimum number of bits, a realization of x has to be encoded by − log(p(x)) bits =-=[18,6]-=-. Thus, the description length is: ℓ(M,D) =− log(p(D|M)) − log(p(M)), (3) where p(M|D) is the probability of the output data given the model, and p(M) is a − priori model probability. Typically the MD... |

2765 | Bagging predictors
- Breiman
- 1996
(Show Context)
Citation Context ...le one. Fusion of experts has been studied extensively recently. One of the main results is that experts have to be partially independent for the fusion to be effective [13,14]. The bagging algorithm =-=[1]-=- can be used to de-correlate between classifiers as well as to obtain some performance measure on the accuracy of the classifiers using the “out of bag” sub-set of the data. Another technique Arcing –... |

1664 | Random forests
- Breiman
(Show Context)
Citation Context ...ch the errors on the training data-sets are used to train more specific classifiers. Sub-sampling of the input space as well as the training patters is extensively used in the random forest algorithm =-=[3]-=-. A different flavor of combination of classifiers use dynamic class combination (DCS) [11] and Classifiers Local Accuracy (CLA) in order to select the best classifier when making a predication. ⋆ Cor... |

1212 |
Pattern Recognition and Neural Networks
- Ripley
- 1996
(Show Context)
Citation Context ...d by criminological investigation. At the scene of the crime, the glass left can be used as evidence if it is correctly identified! Ripley’s best result on this data-set is 80% correct classification =-=[15]-=-. The Iris data-set [8] contains three classes, each with 50 instances. The classes refer to a type of iris plant. Each pattern is composed of four attributes. We used ten folds of cross validation in... |

1093 |
The use of multiple measurements in taxonomic problems
- Fisher
- 1936
(Show Context)
Citation Context ...stigation. At the scene of the crime, the glass left can be used as evidence if it is correctly identified! Ripley’s best result on this data-set is 80% correct classification [15]. The Iris data-set =-=[8]-=- contains three classes, each with 50 instances. The classes refer to a type of iris plant. Each pattern is composed of four attributes. We used ten folds of cross validation in order to estimate the ... |

435 |
A universal prior for integers and estimation by minimum description length
- Rissanen
- 1983
(Show Context)
Citation Context ...for weighting the different experts for optimal combination. In the MDL formulation, the coding of the data is combined with the coding of the model itself to provide the full description of the data =-=[16]-=-. MDL can be formulated for an imaginary communication channel, in which a sender observes the data D and thus can estimate its distribution, and form an appropriate model for that distribution. The s... |

308 | ªWhen Networks Disagree: Ensemble Methods for Hybrid
- Perrone, Cooper
- 1993
(Show Context)
Citation Context ...nce the performance over a single one. Fusion of experts has been studied extensively recently. One of the main results is that experts have to be partially independent for the fusion to be effective =-=[13,14]-=-. The bagging algorithm [1] can be used to de-correlate between classifiers as well as to obtain some performance measure on the accuracy of the classifiers using the “out of bag” sub-set of the data.... |

132 | Keeping neural networks simple by minimizing the description length of the weights
- Hinton, Camp
- 1993
(Show Context)
Citation Context ...shorter description length. In this work we combine the experts by using the description length as a weight for the convex combination230 Shimon Cohen and Nathan Intrator in Eq. (1). Hinton and Camp =-=[12]-=- used zero-mean Gaussian distribution for the neural network weights. We follow this idea, and define the simplest Gaussian model prior, 1 p(M) = (2π) 1/2 ∑d i=1 exp(− βd w2 i 2β2 ), (4) where d is th... |

61 | Bootstrapping with Noise: An Effective Regularization Technique
- Raviv, Intrator
- 1996
(Show Context)
Citation Context ...nce the performance over a single one. Fusion of experts has been studied extensively recently. One of the main results is that experts have to be partially independent for the fusion to be effective =-=[13,14]-=-. The bagging algorithm [1] can be used to de-correlate between classifiers as well as to obtain some performance measure on the accuracy of the classifiers using the “out of bag” sub-set of the data.... |

26 |
Speaker Normalisation for Automatic Speech Recognition
- Deterding
- 1990
(Show Context)
Citation Context ... iris plant. Each pattern is composed of four attributes. We used ten folds of cross validation in order to estimate the performance of the different classifiers. The Deterding vowel recognition data =-=[7,9]-=- is a widely studied benchmark. This problem may be more indicative of a real-world modeling problem. The data consists of auditory features of steady state vowels spoken by British English speakers. ... |

17 | Square unit augmented, radially extended, multilayer perceptrons
- Flake
- 1998
(Show Context)
Citation Context ... iris plant. Each pattern is composed of four attributes. We used ten folds of cross validation in order to estimate the performance of the different classifiers. The Deterding vowel recognition data =-=[7,9]-=- is a widely studied benchmark. This problem may be more indicative of a real-world modeling problem. The data consists of auditory features of steady state vowels spoken by British English speakers. ... |

17 | Analysis of linear and order statistics combiners for fusion of imbalanced classifiers
- Roli, Fumera
- 2002
(Show Context)
Citation Context ...selection of the class with maximum number of votes in the ensemble. II. Convex Combination: The second strategy relies on a convex combination using the error values from the first stage of training =-=[17]-=-. Let ei be the classification error of the i ′ th classifier. We set the weight of this classifier as follows: 1/ek ak = ∑M i=1 1/ei , (11) where M is the number of classifiers in the ensemble. The o... |

14 | Automatic Model Selection in a Hybrid Perceptron/ Radial Network
- Cohen, Intrator
(Show Context)
Citation Context ...sification scheme and sometimes be superior in its overall performance. 1 Introduction Hybrid neural networks that are composed of radial basis functions and perceptrons have been recently introduced =-=[5,4]-=-. Such networks employ a deterministic algorithm that computes the initial parameters from the training data. Thus, networks that have been trained on the same data-set produce the same solution and t... |

14 | Dynamic classifier selection
- Giacinto, Roli
(Show Context)
Citation Context ...ion using the error values from the first stage of training [17]. Let ei be the classification error of the i ′ th classifier. We set the weight of this classifier as follows: 1/ek ak = ∑M i=1 1/ei , =-=(11)-=- where M is the number of classifiers in the ensemble. The output of the ensemble is define as in Eq. (1). III. Convex Combination: The third strategy relies on a convex combination using the error va... |

13 | A Hybrid Projection Based and Radial Basis Function Architecture
- Cohen, Intrator
- 2000
(Show Context)
Citation Context ...sification scheme and sometimes be superior in its overall performance. 1 Introduction Hybrid neural networks that are composed of radial basis functions and perceptrons have been recently introduced =-=[5,4]-=-. Such networks employ a deterministic algorithm that computes the initial parameters from the training data. Thus, networks that have been trained on the same data-set produce the same solution and t... |

5 |
Arcing classifiers. Annals of Statistics, 26(3):801–849
- Breiman
- 1998
(Show Context)
Citation Context ... of the classifiers using the “out of bag” sub-set of the data. Another technique Arcing – adaptive re-weighting and combining – refers to reusing or selecting data in order to improve classification =-=[2]-=-. One popular arcing procedure is AdaBoost [10], in which the errors on the training data-sets are used to train more specific classifiers. Sub-sampling of the input space as well as the training patt... |

1 |
A decision theorethic generalization of on-line learning and application to boosing
- Freund, Schapire
- 1995
(Show Context)
Citation Context ...-set of the data. Another technique Arcing – adaptive re-weighting and combining – refers to reusing or selecting data in order to improve classification [2]. One popular arcing procedure is AdaBoost =-=[10]-=-, in which the errors on the training data-sets are used to train more specific classifiers. Sub-sampling of the input space as well as the training patters is extensively used in the random forest al... |