Results 1 - 10
of
55
Wrappers for feature subset selection
- ARTIFICIAL INTELLIGENCE
, 1997
"... In the feature subset selection problem, a learning algorithm is faced with the problem of selecting a relevant subset of features upon which to focus its attention, while ignoring the rest. To achieve the best possible performance with a particular learning algorithm on a particular training set, a ..."
Abstract
-
Cited by 775 (3 self)
- Add to MetaCart
In the feature subset selection problem, a learning algorithm is faced with the problem of selecting a relevant subset of features upon which to focus its attention, while ignoring the rest. To achieve the best possible performance with a particular learning algorithm on a particular training set, a feature subset selection method should consider how the algorithm and the training set interact. We explore the relation between optimal feature subset selection and relevance. Our wrapper method searches for an optimal feature subset tailored to a particular algorithm and a domain. We study the strengths and weaknesses of the wrapper approach and show a series of improved designs. We compare the wrapper approach to induction without feature subset selection and to Relief, a filter approach to feature subset selection. Significant improvement in accuracy is achieved for some datasets for the two families of induction algorithms used: decision trees and
Popular ensemble methods: an empirical study
- Journal of Artificial Intelligence Research
, 1999
"... An ensemble consists of a set of individually trained classifiers (such as neural networks or decision trees) whose predictions are combined when classifying novel instances. Previous research has shown that an ensemble is often more accurate than any of the single classifiers in the ensemble. Baggi ..."
Abstract
-
Cited by 151 (3 self)
- Add to MetaCart
An ensemble consists of a set of individually trained classifiers (such as neural networks or decision trees) whose predictions are combined when classifying novel instances. Previous research has shown that an ensemble is often more accurate than any of the single classifiers in the ensemble. Bagging (Breiman, 1996c) and Boosting (Freund & Schapire, 1996; Schapire, 1990) are two relatively new but popular methods for producing ensembles. In this paper we evaluate these methods on 23 data sets using both neural networks and decision trees as our classification algorithm. Our results clearly indicate a number of conclusions. First, while Bagging is almost always more accurate than a single classifier, it is sometimes much less accurate than Boosting. On the other hand, Boosting can create ensembles that are less accurate than a single classifier – especially when using neural networks. Analysis indicates that the performance of the Boosting methods is dependent on the characteristics of the data set being examined. In fact, further results show that Boosting ensembles may overfit noisy data sets, thus decreasing its performance. Finally, consistent with previous studies, our work suggests that most of the gain in an ensemble’s performance comes in the first few classifiers combined; however, relatively large gains can be seen up to 25 classifiers when Boosting decision trees. 1.
Bias plus variance decomposition for zero-one loss functions
- In Machine Learning: Proceedings of the Thirteenth International Conference
, 1996
"... We present a bias-variance decomposition of expected misclassi cation rate, the most commonly used loss function in supervised classi cation learning. The bias-variance decomposition for quadratic loss functions is well known and serves as an important tool for analyzing learning algorithms, yet no ..."
Abstract
-
Cited by 144 (3 self)
- Add to MetaCart
We present a bias-variance decomposition of expected misclassi cation rate, the most commonly used loss function in supervised classi cation learning. The bias-variance decomposition for quadratic loss functions is well known and serves as an important tool for analyzing learning algorithms, yet no decomposition was o ered for the more commonly used zero-one (misclassi cation) loss functions until the recent work of Kong & Dietterich (1995) and Breiman (1996). Their decomposition su ers from some major shortcomings though (e.g., potentially negative variance), which our decomposition avoids. We show that, in practice, the naive frequency-based estimation of the decomposition terms is by itself biased and show how to correct for this bias. We illustrate the decomposition on various algorithms and datasets from the UCI repository. 1
Error-Correcting Output Coding Corrects Bias and Variance
- In Proceedings of the Twelfth International Conference on Machine Learning
, 1995
"... Previous research has shown that a technique called error-correcting output coding (ECOC) can dramatically improve the classification accuracy of supervised learning algorithms that learn to classify data points into one of k AE 2 classes. This paper presents an investigation of why the ECOC techniq ..."
Abstract
-
Cited by 131 (5 self)
- Add to MetaCart
Previous research has shown that a technique called error-correcting output coding (ECOC) can dramatically improve the classification accuracy of supervised learning algorithms that learn to classify data points into one of k AE 2 classes. This paper presents an investigation of why the ECOC technique works, particularly when employed with decision-tree learning algorithms. It shows that the ECOC method--- like any form of voting or committee---can reduce the variance of the learning algorithm. Furthermore---unlike methods that simply combine multiple runs of the same learning algorithm---ECOC can correct for errors caused by the bias of the learning algorithm. Experiments show that this bias correction ability relies on the non-local behavior of C4.5. 1 Introduction Error-correcting output coding (ECOC) is a method for applying binary (two-class) learning algorithms to solve k-class supervised learning problems. It works by converting the k-class supervised learning problem into a la...
Error Reduction through Learning Multiple Descriptions
, 1996
"... . Learning multiple descriptions for each class in the data has been shown to reduce generalization error but the amount of error reduction varies greatly from domain to domain. This paper presents a novel empirical analysis that helps to understand this variation. Our hypothesis is that the amount ..."
Abstract
-
Cited by 114 (3 self)
- Add to MetaCart
. Learning multiple descriptions for each class in the data has been shown to reduce generalization error but the amount of error reduction varies greatly from domain to domain. This paper presents a novel empirical analysis that helps to understand this variation. Our hypothesis is that the amount of error reduction is linked to the "degree to which the descriptions for a class make errors in a correlated manner." We present a precise and novel definition for this notion and use twenty-nine data sets to show that the amount of observed error reduction is negatively correlated with the degree to which the descriptions make errors in a correlated manner. We empirically show that it is possible to learn descriptions that make less correlated errors in domains in which many ties in the search evaluation measure (e.g. information gain) are experienced during learning. The paper also presents results that help to understand when and why multiple descriptions are a help (irrelevant attribute...
The Lack of A Priori Distinctions Between Learning Algorithms
, 1996
"... This is the first of two papers that use off-training set (OTS) error to investigate the assumption -free relationship between learning algorithms. This first paper discusses the senses in which there are no a priori distinctions between learning algorithms. (The second paper discusses the senses in ..."
Abstract
-
Cited by 103 (5 self)
- Add to MetaCart
This is the first of two papers that use off-training set (OTS) error to investigate the assumption -free relationship between learning algorithms. This first paper discusses the senses in which there are no a priori distinctions between learning algorithms. (The second paper discusses the senses in which there are such distinctions.) In this first paper it is shown, loosely speaking, that for any two algorithms A and B, there are "as many" targets (or priors over targets) for which A has lower expected OTS error than B as vice-versa, for loss functions like zero-one loss. In particular, this is true if A is cross-validation and B is "anti-cross-validation" (choose the learning algorithm with largest cross-validation error). This paper ends with a discussion of the implications of these results for computational learning theory. It is shown that one can not say: if empirical misclassification rate is low; the Vapnik-Chervonenkis dimension of your generalizer is small; and the trainin...
An empirical evaluation of bagging and boosting
- In Proceedings of the Fourteenth National Conference on Artificial Intelligence
, 1997
"... An ensemble consists of a set of independently trained classi ers (such as neural networks or decision trees) whose predictions are combined when classifying novel instances. Previous research has shown that an ensemble as a whole is often more accurate than any of the single classiers in the ensemb ..."
Abstract
-
Cited by 80 (6 self)
- Add to MetaCart
An ensemble consists of a set of independently trained classi ers (such as neural networks or decision trees) whose predictions are combined when classifying novel instances. Previous research has shown that an ensemble as a whole is often more accurate than any of the single classiers in the ensemble. Bagging (Breiman 1996a) and Boosting (Freund & Schapire 1996) are two relatively new but popular methods for producing ensembles. In this paper we evaluate these methods using both neural networks and decision trees as our classi cation algorithms. Our results clearly showtwo important facts. The rst is that even though Bagging almost always produces a better classi er than any of its individual component classi ers and is relatively impervious to over tting, it does not generalize any better than a baseline neural-network ensemble method. The second is that Boosting is apowerful technique that can usually produce better ensembles than Bagging � however, it is more susceptible to noise and can quickly over t a data set.
Making Use of Population Information in Evolutionary Artificial Neural Networks
, 1998
"... This paper is concerned with the simultaneous evolution of artificial neural network (ANN) architectures and weights. The current practice in evolving ANNs is to choose the best ANN in the last generation as the final result. This paper proposes a different approach to form the final result by combi ..."
Abstract
-
Cited by 65 (22 self)
- Add to MetaCart
This paper is concerned with the simultaneous evolution of artificial neural network (ANN) architectures and weights. The current practice in evolving ANNs is to choose the best ANN in the last generation as the final result. This paper proposes a different approach to form the final result by combining all the individuals in the last generation in order to make best use of all the information contained in the whole population. This approach regards a population of ANNs as an ensemble and uses a combination method to integrate them. Although there has been some work on integrating ANN modules [2], [3], little has been done in evolutionary learning to make best use of its population information. Four linear combination methods have been investigated in this paper to illustrate our ideas. Three real world data sets have been used in our experimental studies, which show that the recursive least square (RLS) algorithm always produces an integrated system that outperforms the best individua...
Ensemble Learning using Decorrelated Neural Networks
- Connection Science
, 1996
"... We describe a decorrelation network training method for improving the quality of regression learning in "ensemble " neural networks that are composed of linear combinations of individual neural networks. In this method, individual networks are trained by backpropagation to not only reproduce a desir ..."
Abstract
-
Cited by 63 (0 self)
- Add to MetaCart
We describe a decorrelation network training method for improving the quality of regression learning in "ensemble " neural networks that are composed of linear combinations of individual neural networks. In this method, individual networks are trained by backpropagation to not only reproduce a desired output, but also to have their errors be linearly decorrelated with the other networks. Outputs from the individual networks are then linearly combined to produce the output of the ensemble network. We demonstrate the performances of decorrelated network training on learning the "3 Parity" logic function, a noisy sine function, and a one dimensional nonlinear function, and compare the results with the ensemble networks composed of independently trained individual networks (without decorrelation training). Empirical results show that when individual networks are forced to be decorrelated with one another the resulting ensemble neural networks have lower mean squared errors than the ensembl...
Diversity creation methods: A survey and categorisation
- Journal of Information Fusion
, 2005
"... Ensemble approaches to classification and regression have attracted a great deal of interest in recent years. These methods can be shown both theoretically and empirically to outperform single predictors on a wide range of tasks. One of the elements required for accurate prediction when using an ens ..."
Abstract
-
Cited by 63 (18 self)
- Add to MetaCart
Ensemble approaches to classification and regression have attracted a great deal of interest in recent years. These methods can be shown both theoretically and empirically to outperform single predictors on a wide range of tasks. One of the elements required for accurate prediction when using an ensemble is recognised to be error “diversity”. However, the exact meaning of this concept is not clear from the literature, particularly for classification tasks. In this paper we first review the varied attempts to provide a formal explanation of error diversity, including several heuristic and qualitative explanations in the literature. For completeness of discussion we include not only the classification literature but also some excerpts of the rather more mature regression literature, which we believe can still provide some insights. We proceed to survey the various techniques used for creating diverse ensembles, and categorise them, forming a preliminary taxonomy of diversity creation methods. As part of this taxonomy we introduce the idea of implicit and explicit diversity creation methods, and three dimensions along which these may be applied. Finally we propose some new directions that may prove fruitful in understanding classification error diversity. 1

