## Theoretical Foundations Of Linear And Order Statistics Combiners For Neural Pattern Classifiers (1996)

Venue: | IEEE Transactions on neural networks |

Citations: | 30 - 5 self |

### BibTeX

@TECHREPORT{Tumer96theoreticalfoundations,

author = {Kagan Tumer and Joydeep Ghosh},

title = {Theoretical Foundations Of Linear And Order Statistics Combiners For Neural Pattern Classifiers},

institution = {IEEE Transactions on neural networks},

year = {1996}

}

### OpenURL

### Abstract

: Several researchers have experimentally shown that substantial improvements can be obtained in difficult pattern recognition problems by combining or integrating the outputs of multiple classifiers. This paper provides an analytical framework to quantify the improvements in classification results due to combining. The results apply to both linear combiners and the order statistics combiners introduced in this paper. We show that combining networks in output space reduces the variance of the actual decision region boundaries around the optimum boundary. For linear combiners, we show that in the absence of classifier bias, the added classification error is proportional to the boundary variance. For non-linear combiners, we show analytically that the selection of the median, the maximum and in general the ith order statistic improves classifier performance. The analysis presented here facilitates the understanding of the relationships among error rates, classifier boundary distributions...

### Citations

5374 |
C4.5: Programs for Machine Learning
- Quinlan
- 1993
(Show Context)
Citation Context ...tterns, of which 350 is used for training. An MLP with one hidden layer of 10 units, and an RBF network with 8 kernels is used with this data. The CARD1 data set consists of credit approval decisions =-=[35, 36]-=-. 51 inputs are used to determine whether or not to approve the credit card application of a customer. There are 690 examples in this set, and 345 are used for training. The MLP has one hidden layer w... |

4142 |
Pattern Classification and Scene Analysis
- Duda, Hart
(Show Context)
Citation Context ...ximate the Bayesian decision boundaries to arbitrary precision, therefore providing similar generalizations [14]. However, often only a limited portion of the pattern space is available or observable =-=[11, 12]-=-. Given a finite and noisy data set, different classifiers typically provide different generalizations by realizing different decision boundaries [16]. For example, when classification is performed us... |

3612 | Induction of decision trees
- Quinlan
- 1986
(Show Context)
Citation Context ...tterns, of which 350 is used for training. An MLP with one hidden layer of 10 units, and an RBF network with 8 kernels is used with this data. The CARD1 data set consists of credit approval decisions =-=[35, 36]-=-. 51 inputs are used to determine whether or not to approve the credit card application of a customer. There are 690 examples in this set, and 345 are used for training. The MLP has one hidden layer w... |

2889 |
Introduction to Statistical Pattern Recognition, second edition
- Fukunaga
- 1990
(Show Context)
Citation Context ...ximate the Bayesian decision boundaries to arbitrary precision, therefore providing similar generalizations [14]. However, often only a limited portion of the pattern space is available or observable =-=[11, 12]-=-. Given a finite and noisy data set, different classifiers typically provide different generalizations by realizing different decision boundaries [16]. For example, when classification is performed us... |

2727 | Bagging predictors
- Breiman
- 1996
(Show Context)
Citation Context ...ducing the improvements due to each extra classifier. This has been recently observed by some researchers, such as Breiman, who developed bootstrap methods for achieving independence among estimators =-=[8, 9]-=-, and by Jacobs [24]. The number of classifiers that yield the best results depends on a number of factors, including the number of feature sets extracted from the data, their dimensionality, and the ... |

643 |
Neural networks and the bias/variance dilemma
- Geman, Bienenstock, et al.
- 1992
(Show Context)
Citation Context ...ue gauge of performance [26, 48]. Given infinite training data, consistent classifiers approximate the Bayesian decision boundaries to arbitrary precision, therefore providing similar generalizations =-=[14]-=-. However, often only a limited portion of the pattern space is available or observable [11, 12]. Given a finite and noisy data set, different classifiers typically provide different generalizations b... |

579 | Stacked generalization
- Wolpert
- 1992
(Show Context)
Citation Context ... dimensional patterns. The concept of combining appeared in the neural network literature as early as 1965 [30], and has subsequently been studied in several forms, including "stacked generalizat=-=ion" [49]. Combining has also-=- been studied in other fields such as econometrics, under the name "forecast combining" [18], or machine learning where it is called "evidence combination" [3, 13]. The overall arc... |

429 |
Order Statistics
- David, Nagaraja
- 2003
(Show Context)
Citation Context ...th order statistic, the cumulative distribution function gives the probability that exactly i of the chosen X's are less than or equal to x. The probability density function of X i:N is then given by =-=[10]-=-: f X i:N (x) = N ! (i \Gamma 1)! (N \Gamma i)! [F X (x)] i\Gamma1 [1 \Gamma FX (x)] N \Gammai f X (x) : (26) 11 This general form however, cannot always be computed in closed form. Therefore, obtaini... |

399 |
Methods of combining multiple classifiers and their applications to handwriting recognition
- Xu, Krzyzak, et al.
- 1992
(Show Context)
Citation Context ...lid lines leading to f ind represent the decision of a specific classifier, while the dashed lines lead to f comb , the output of the combiner. beliefs in the Dempster-Shafer sense are also available =-=[37, 39, 50, 51]-=-. Combiners have also been successfully applied a multitude of real world problems [5, 7, 17, 25, 41, 52]. A survey of leading combining techniques, along with experimental results is given in [15, 17... |

340 |
Stacked regressions
- Breiman
- 1996
(Show Context)
Citation Context ...ducing the improvements due to each extra classifier. This has been recently observed by some researchers, such as Breiman, who developed bootstrap methods for achieving independence among estimators =-=[8, 9]-=-, and by Jacobs [24]. The number of classifiers that yield the best results depends on a number of factors, including the number of feature sets extracted from the data, their dimensionality, and the ... |

330 | Decision combination in multiple classifier systems
- Ho, Hull, et al.
- 1994
(Show Context)
Citation Context ...mbining techniques have been mathematically analyzed for regression problems [21, 32], but not for classification. Some researchers have investigated non-linear combiners using rank-based information =-=[1, 23]-=-, or voting schemes [4]. Methods for combining 2 f ind f comb Classifier 1 Classifier N Classifier m Feature Set 2 Feature Set 1 Feature Set M Raw Data from Observed Phenomenon Combiner Figure 1: Comb... |

306 | When Networks Disagree: Ensemble Method for Neural Networks
- Perrone, Coopler
- 1993
(Show Context)
Citation Context ...ly trained on different feature sets, provide the combined output f comb . Currently, the most popular way of combining multiple classifiers is via simple averaging of the corresponding output values =-=[19, 27, 33, 46]-=-. Weighted averaging has also been proposed, along with different methods of computing the proper classifier weights [6, 21, 24, 27]. Such linear combining techniques have been mathematically analyzed... |

282 |
Neural network classifiers estimate Bayesian a posterior probabilities
- Richard, Lippmann
- 1991
(Show Context)
Citation Context ...at are trained to minimize a cross-entropy or mean square error (MSE) function, given "one-of-L" desired output patterns, approximate the a posteriori probability densities of the correspond=-=ing class [38]-=-. In particular, the MSE is shown to be equivalent to: MSE = K 1 + X i Z x D i (x) (p(C i jx) \Gamma f i (x)) 2 dx where K 1 and D i (x) depend on the class distributions only, f i (x) is the output o... |

167 |
A First Course in Order Statistics
- Arnold, Balakrishnan, et al.
- 1992
(Show Context)
Citation Context ...e, obtaining the expected value of a function of x using Equation 26 is not always possible. However, the first two moments of the density function are widely available for a variety of distributions =-=[2]-=-. These moments can be used to compute the expected values of certain specific functions, e.g. polynomials of order less than two. 4.2 Combining Unbiased Classifiers through OS Now, let us turn our at... |

147 |
Methods for combining experts’ probability assessments
- Jacobs
- 1995
(Show Context)
Citation Context ...sifiers is via simple averaging of the corresponding output values [19, 27, 33, 46]. Weighted averaging has also been proposed, along with different methods of computing the proper classifier weights =-=[6, 21, 24, 27]-=-. Such linear combining techniques have been mathematically analyzed for regression problems [21, 32], but not for classification. Some researchers have investigated non-linear combiners using rank-ba... |

138 |
Multisurface method of pattern separation for medical diagnosis applied to breast cytology
- Wolberg, Mangasarian
- 1990
(Show Context)
Citation Context ... on the test set, and the standard deviation on those values based on 20 runs. CANCER1 is based on breast cancer data, obtained from the University of Wisconsin Hospitals, from Dr. William H. Wolberg =-=[28, 47]. This set has -=-9 inputs, 2 outputs and 699 9 Available at URL ftp://ftp.ira.uka.de/pub/papers/techreports/1994/1994-21.ps.Z. 10 Proben1 results reported here correspond to the "pivot" and "no-shortcut... |

132 | Optimal linear combinations of neural networks
- Hashem
- 1997
(Show Context)
Citation Context ...assifiers are used, because in some cases the differences in performance can compromise the results. In such cases, using weighted averaging can provide an alternative to the combiners discussed here =-=[20, 24]-=-. Another method that aims at reducing correlations consists of training classifiers on different features extracted from the same underlying data. Although the same network type is used, the network ... |

115 |
Learning by being told and learning from examples: An experiments1 comparison of the two methods of knowledge acquisition in the context of developing an expert system for soybean disease diagnosis. Pohcy Anaiysis and
- Michalski, Chilausky
- 1980
(Show Context)
Citation Context ...hidden layer of 15 units, and RBF networks with 20 kernels are selected for this data set. The SOYBEAN1 data set consists of 19 classes of soybean, which have to be classified using 82 input features =-=[29]-=-. There are 683 patterns in this set, of which 342 are used for training. MLPs with one hidden layer with 40 units, and RBF networks with 40 kernels are selected. Table 12: Combining Results for GENE1... |

112 |
Combining the results of several neural network classifiers
- Rogova
(Show Context)
Citation Context ...lid lines leading to f ind represent the decision of a specific classifier, while the dashed lines lead to f comb , the output of the combiner. beliefs in the Dempster-Shafer sense are also available =-=[37, 39, 50, 51]-=-. Combiners have also been successfully applied a multitude of real world problems [5, 7, 17, 25, 41, 52]. A survey of leading combining techniques, along with experimental results is given in [15, 17... |

105 |
PROBEN1 – A Set of Benchmarks and Benchmarking Rules for Neural Network Training Algorithms, Universitaet Karlsruhe
- Prechelt
- 1994
(Show Context)
Citation Context ...sed, which are also the cases with the lowest classifier correlations. 22 6.2 Proben1 Benchmarks In this section, examples from the Proben1 benchmark set 9 are used to study the benefits of combining =-=[34]-=-. Table 8 shows the test set error rate for both the MLP and the RBF classifiers on six different data sets taken from the Proben1 benchmarks 10 . Table 8: Performance of Individual Classifiers on the... |

92 |
Using the adap learning algorithm to forecast the onset of diabetes mellitus
- Smith, Everhart, et al.
- 1988
(Show Context)
Citation Context ...h 20 units, and the RBF network has 20 kernels. The DIABETES1 data set is based on personal data of the Pima Indians obtained from the National Institute of Diabetes and Digestive and Kidney Diseases =-=[43]-=-. The binary output determines whether or not the subjects show signs of diabetes according to the World Health Organization. The input consists of 8 attributes, and there are 768 examples in this set... |

86 |
Democracy in neural nets: voting schemes for classification
- Battiti, Colla
- 1994
(Show Context)
Citation Context ...n mathematically analyzed for regression problems [21, 32], but not for classification. Some researchers have investigated non-linear combiners using rank-based information [1, 23], or voting schemes =-=[4]-=-. Methods for combining 2 f ind f comb Classifier 1 Classifier N Classifier m Feature Set 2 Feature Set 1 Feature Set M Raw Data from Observed Phenomenon Combiner Figure 1: Combining Strategy. The sol... |

80 |
Learning Machines: Foundations of Trainable Pattern Classifying Systems
- Nilsson
- 1965
(Show Context)
Citation Context ...hose that involve a large amount of noise, limited number of training data, or unusually high dimensional patterns. The concept of combining appeared in the neural network literature as early as 1965 =-=[30], and has subsequent-=-ly been studied in several forms, including "stacked generalization" [49]. Combining has also been studied in other fields such as econometrics, under the name "forecast combining"... |

77 |
Hybrid System for Protein Secondary Structure Prediction
- Zhang, Mesirov, et al.
- 1992
(Show Context)
Citation Context ...lead to f comb , the output of the combiner. beliefs in the Dempster-Shafer sense are also available [37, 39, 50, 51]. Combiners have also been successfully applied a multitude of real world problems =-=[5, 7, 17, 25, 41, 52]-=-. A survey of leading combining techniques, along with experimental results is given in [15, 17]. Combining techniques such as majority voting can generally be applied to any type of classifier, while... |

73 | Pattern recognition via linear programming: Theory and application to medical diagnosis
- Mangasarian, Setiono, et al.
- 1990
(Show Context)
Citation Context ... on the test set, and the standard deviation on those values based on 20 runs. CANCER1 is based on breast cancer data, obtained from the University of Wisconsin Hospitals, from Dr. William H. Wolberg =-=[28, 47]. This set has -=-9 inputs, 2 outputs and 699 9 Available at URL ftp://ftp.ira.uka.de/pub/papers/techreports/1994/1994-21.ps.Z. 10 Proben1 results reported here correspond to the "pivot" and "no-shortcut... |

70 |
Training knowledge-based neural networks to recognize genes in DNA sequences
- Noordewier, Towell, et al.
- 1991
(Show Context)
Citation Context ...en layer with 10 units, and RBF networks with 10 kernels are selected for this data set. The GENE1 is based on intron/exon boundary detection, or the detection of splice 24 junctions in DNA sequences =-=[31, 44]-=-. 120 inputs are used to determine whether a DNA section is a donor, an acceptor or neither. There are 3175 examples, of which 1588 are used for training. The MLP architecture consists of a single hid... |

69 | Computational methods for a mathematical theory of evidence
- Barnett
- 1981
(Show Context)
Citation Context ...ked generalization" [49]. Combining has also been studied in other fields such as econometrics, under the name "forecast combining" [18], or machine learning where it is called "ev=-=idence combination" [3, 13]-=-. The overall architecture of the combiner form studied in this paper is shown in Figure 1. The output of an individual classifier using a single feature set is given by f ind . Multiple classifiers, ... |

61 |
Synergy of clustering multiple back propagation networks
- Lincoln, Skrzypek
- 1990
(Show Context)
Citation Context ...ly trained on different feature sets, provide the combined output f comb . Currently, the most popular way of combining multiple classifiers is via simple averaging of the corresponding output values =-=[19, 27, 33, 46]-=-. Weighted averaging has also been proposed, along with different methods of computing the proper classifier weights [6, 21, 24, 27]. Such linear combining techniques have been mathematically analyzed... |

57 | Interpretation of Artificial Neural Networks
- Towell, Shavlik
- 1992
(Show Context)
Citation Context ...en layer with 10 units, and RBF networks with 10 kernels are selected for this data set. The GENE1 is based on intron/exon boundary detection, or the detection of splice 24 junctions in DNA sequences =-=[31, 44]-=-. 120 inputs are used to determine whether a DNA section is a donor, an acceptor or neither. There are 3175 examples, of which 1588 are used for training. The MLP architecture consists of a single hid... |

56 |
A statistical approach to learning and generalization in layered neural networks
- Levin, Tishby, et al.
- 1989
(Show Context)
Citation Context ...mine the classification performance. This ability to meaningfully respond to novel patterns, or generalize, is an important aspect of a classifier system and in essence, the true gauge of performance =-=[26, 48]-=-. Given infinite training data, consistent classifiers approximate the Bayesian decision boundaries to arbitrary precision, therefore providing similar generalizations [14]. However, often only a limi... |

45 |
An inference technique for integrating knowledge from disparate sources. These proceedings
- Garvey, Lowrance, et al.
(Show Context)
Citation Context ...ked generalization" [49]. Combining has also been studied in other fields such as econometrics, under the name "forecast combining" [18], or machine learning where it is called "ev=-=idence combination" [3, 13]-=-. The overall architecture of the combiner form studied in this paper is shown in Figure 1. The output of an individual classifier using a single feature set is given by f ind . Multiple classifiers, ... |

37 |
Improving the accuracy of an artificial neural network using multiple differently trained networks
- Baxt
- 1992
(Show Context)
Citation Context ...lead to f comb , the output of the combiner. beliefs in the Dempster-Shafer sense are also available [37, 39, 50, 51]. Combiners have also been successfully applied a multitude of real world problems =-=[5, 7, 17, 25, 41, 52]-=-. A survey of leading combining techniques, along with experimental results is given in [15, 17]. Combining techniques such as majority voting can generally be applied to any type of classifier, while... |

33 | Structural adaptation and generalization in supervised feedforward networks
- Gosh, Tumer
- 1994
(Show Context)
Citation Context ... the pattern space is available or observable [11, 12]. Given a finite and noisy data set, different classifiers typically provide different generalizations by realizing different decision boundaries =-=[16]-=-. For example, when classification is performed using a multilayered, feed-forward artificial neural network, different weight initializations, or different architectures (number of hidden units, hidd... |

30 |
Secondary structure prediction: combination of hree different methods. Protein Eng. 2: Wedaman et aLAccessory Subunit of Kinesin-ll 379 o n
- Biou, Gibrat, et al.
- 1988
(Show Context)
Citation Context ...lead to f comb , the output of the combiner. beliefs in the Dempster-Shafer sense are also available [37, 39, 50, 51]. Combiners have also been successfully applied a multitude of real world problems =-=[5, 7, 17, 25, 41, 52]-=-. A survey of leading combining techniques, along with experimental results is given in [15, 17]. Combining techniques such as majority voting can generally be applied to any type of classifier, while... |

24 | Approximating a function and its derivatives using MSE-optimal linear combinations of trained feedforward neural networks
- Hashem, Schmeiser
- 1993
(Show Context)
Citation Context ...sifiers is via simple averaging of the corresponding output values [19, 27, 33, 46]. Weighted averaging has also been proposed, along with different methods of computing the proper classifier weights =-=[6, 21, 24, 27]-=-. Such linear combining techniques have been mathematically analyzed for regression problems [21, 32], but not for classification. Some researchers have investigated non-linear combiners using rank-ba... |

23 |
Combining Forecasts-Twenty Years Later
- Granger
- 1989
(Show Context)
Citation Context ... and has subsequently been studied in several forms, including "stacked generalization" [49]. Combining has also been studied in other fields such as econometrics, under the name "forec=-=ast combining" [18], or machi-=-ne learning where it is called "evidence combination" [3, 13]. The overall architecture of the combiner form studied in this paper is shown in Figure 1. The output of an individual classifie... |

21 |
A Mathematical Theory of Generalization
- Wolpert
- 1989
(Show Context)
Citation Context ...mine the classification performance. This ability to meaningfully respond to novel patterns, or generalize, is an important aspect of a classifier system and in essence, the true gauge of performance =-=[26, 48]-=-. Given infinite training data, consistent classifiers approximate the Bayesian decision boundaries to arbitrary precision, therefore providing similar generalizations [14]. However, often only a limi... |

19 |
Estimation of location and scale parameters by order statistics from singly and doubly censored samples
- Sarhan, Greenberg
- 1958
(Show Context)
Citation Context ...distribution of b. For most distributions, ff can be found in tabulated form [2]. For example, Table 1 provides ff values for all three os combiners, up to 15 classifiers, for a Gaussian distribution =-=[2, 40]-=-. Returning to the error calculation, we have: M os 1 = 0, and M os 2 = oe 2 b os , providing: E os add = sM os 2 2 = soe 2 b os 2 = sffoe 2 b 2 = ff E add : (33) Equation 33 shows that the reduction ... |

18 | Evidence combination techniques for robust classification of short-duration oceanic signals
- Ghosh, Beck, et al.
- 1992
(Show Context)
Citation Context ...50, 51]. Combiners have also been successfully applied a multitude of real world problems [5, 7, 17, 25, 41, 52]. A survey of leading combining techniques, along with experimental results is given in =-=[15, 17]-=-. Combining techniques such as majority voting can generally be applied to any type of classifier, while others rely on specific outputs, or specific interpretations of the output. For example, the co... |

16 |
An evidential reasoning approach for multiple-attribute decision making with uncertainty
- Yang, Singh
- 1994
(Show Context)
Citation Context ...lid lines leading to f ind represent the decision of a specific classifier, while the dashed lines lead to f comb , the output of the combiner. beliefs in the Dempster-Shafer sense are also available =-=[37, 39, 50, 51]-=-. Combiners have also been successfully applied a multitude of real world problems [5, 7, 17, 25, 41, 52]. A survey of leading combining techniques, along with experimental results is given in [15, 17... |

15 | Integration of neural classifiers for passive sonar signals
- Ghosh, Tumer, et al.
- 1996
(Show Context)
Citation Context |

14 |
Multiple binary decision tree classifiers
- Shlien
- 1990
(Show Context)
Citation Context |

13 |
Least squares learning and approximation of posterior probabilities on classification problems by neural network modcls
- Shoemaker, Carlin, et al.
- 1991
(Show Context)
Citation Context ...) depend on the class distributions only, f i (x) is the output of the node 3 representing class i given an output x, p(C i jx) denotes the posterior probability and the summation is over all classes =-=[42]-=-. Thus minimizing the MSE is equivalent to a weighted least squares fit of the network outputs to the corresponding posterior probabilities. Despite the increasing body of experimental results showing... |

12 |
Learning ranks with neural networks (Invited paper
- Al-Ghoneim, Kumar
- 1995
(Show Context)
Citation Context ...mbining techniques have been mathematically analyzed for regression problems [21, 32], but not for classification. Some researchers have investigated non-linear combiners using rank-based information =-=[1, 23]-=-, or voting schemes [4]. Methods for combining 2 f ind f comb Classifier 1 Classifier N Classifier m Feature Set 2 Feature Set 1 Feature Set M Raw Data from Observed Phenomenon Combiner Figure 1: Comb... |

12 |
Learning from what’s been learned: Supervised learning in multi-neural network systems
- Perrone, Cooper
(Show Context)
Citation Context ...also been proposed, along with different methods of computing the proper classifier weights [6, 21, 24, 27]. Such linear combining techniques have been mathematically analyzed for regression problems =-=[21, 32]-=-, but not for classification. Some researchers have investigated non-linear combiners using rank-based information [1, 23], or voting schemes [4]. Methods for combining 2 f ind f comb Classifier 1 Cla... |

10 |
Integration of neural networks and decision tree classifiers for automated cytology screening
- Lee, Hwang, et al.
- 1991
(Show Context)
Citation Context |

9 |
The Meta-PI Network: building distributed representations for robust multisource pattern recognition
- Hampshire, Waibel
- 1992
(Show Context)
Citation Context ...ly trained on different feature sets, provide the combined output f comb . Currently, the most popular way of combining multiple classifiers is via simple averaging of the corresponding output values =-=[19, 27, 33, 46]-=-. Weighted averaging has also been proposed, along with different methods of computing the proper classifier weights [6, 21, 24, 27]. Such linear combining techniques have been mathematically analyzed... |

9 |
Probabilistic interpretation for MYCIN's uncertainty factors.” Uncertainty in
- Heckerman
- 1986
(Show Context)
Citation Context ...ations of the output. For example, the confidence factors method found in machine learning literature relies on the interpretation of the outputs as the belief that a pattern belongs to a given class =-=[22]-=-. The rationale for averaging, on the other hand, is based on the result that the outputs of parametric classifiers that are trained to minimize a cross-entropy or mean square error (MSE) function, gi... |

7 |
Parallel consensual neural networks with optimally weighted outputs
- Benediktsson, Sveinsson, et al.
- 1994
(Show Context)
Citation Context ...sifiers is via simple averaging of the corresponding output values [19, 27, 33, 46]. Weighted averaging has also been proposed, along with different methods of computing the proper classifier weights =-=[6, 21, 24, 27]-=-. Such linear combining techniques have been mathematically analyzed for regression problems [21, 32], but not for classification. Some researchers have investigated non-linear combiners using rank-ba... |

6 |
Boundary variance reduction for improved classification through hybrid networks (Invited paper
- Turner, Ghosh
- 1995
(Show Context)
Citation Context |