Results 1 - 10
of
30
A tutorial on learning with Bayesian networks
- Learning in Graphical Models
, 1995
"... A companion set of lecture slides is available at ..."
Abstract
-
Cited by 710 (4 self)
- Add to MetaCart
A companion set of lecture slides is available at
An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants
- MACHINE LEARNING
, 1999
"... Methods for voting classification algorithms, such as Bagging and AdaBoost, have been shown to be very successful in improving the accuracy of certain classifiers for artificial and real-world datasets. We review these algorithms and describe a large empirical study comparing several variants in co ..."
Abstract
-
Cited by 449 (2 self)
- Add to MetaCart
Methods for voting classification algorithms, such as Bagging and AdaBoost, have been shown to be very successful in improving the accuracy of certain classifiers for artificial and real-world datasets. We review these algorithms and describe a large empirical study comparing several variants in conjunction with a decision tree inducer (three variants) and a Naive-Bayes inducer.
The purpose of the study is to improve our understanding of why and
when these algorithms, which use perturbation, reweighting, and
combination techniques, affect classification error. We provide a
bias and variance decomposition of the error to show how different
methods and variants influence these two terms. This allowed us to
determine that Bagging reduced variance of unstable methods, while
boosting methods (AdaBoost and Arc-x4) reduced both the bias and
variance of unstable methods but increased the variance for Naive-Bayes,
which was very stable. We observed that Arc-x4 behaves differently
than AdaBoost if reweighting is used instead of resampling,
indicating a fundamental difference. Voting variants, some of which
are introduced in this paper, include: pruning versus no pruning,
use of probabilistic estimates, weight perturbations (Wagging), and
backfitting of data. We found that Bagging improves when
probabilistic estimates in conjunction with no-pruning are used, as
well as when the data was backfit. We measure tree sizes and show
an interesting positive correlation between the increase in the
average tree size in AdaBoost trials and its success in reducing the
error. We compare the mean-squared error of voting methods to
non-voting methods and show that the voting methods lead to large
and significant reductions in the mean-squared errors. Practical
problems that arise in implementing boosting algorithms are
explored, including numerical instabilities and underflows. We use
scatterplots that graphically show how AdaBoost reweights instances,
emphasizing not only "hard" areas but also outliers and noise.
Classification by Pairwise Coupling
, 1998
"... We discuss a strategy for polychotomous classification that involves estimating class probabilities for each pair of classes, and then coupling the estimates together. The coupling model is similar to the Bradley-Terry method for paired comparisons. We study the nature of the class probability estim ..."
Abstract
-
Cited by 210 (0 self)
- Add to MetaCart
We discuss a strategy for polychotomous classification that involves estimating class probabilities for each pair of classes, and then coupling the estimates together. The coupling model is similar to the Bradley-Terry method for paired comparisons. We study the nature of the class probability estimates that arise, and examine the performance of the procedure in real and simulated datasets. Classifiers used include linear discriminants, nearest neighbors, adaptive nonlinear methods, and the support vector machine. Department of Statistics, Sequoia Hall, Stanford University, Stanford California 94305; trevor@playfair.stanford.edu y Department of Preventive Medicine and Biostatistics, and Department of Statistics; tibs@utstat.toronto.edu 1 Introduction We consider the discrimination problem with K classes and N training observations. The training observations consist of predictor measurements x = (x 1 ; x 2 ; : : : x p ) on p predictors and the known class memberships. Our goal is...
A Bayesian approach to learning Bayesian networks with local structure
- In Proceedings of Thirteenth Conference on Uncertainty in Artificial Intelligence
, 1997
"... Recently several researchers have investigated techniques for using data to learn Bayesian networks containing compact representations for the conditional probability distributions (CPDs) stored at each node. The majority of this work has concentrated on using decision-tree representations for the C ..."
Abstract
-
Cited by 152 (13 self)
- Add to MetaCart
Recently several researchers have investigated techniques for using data to learn Bayesian networks containing compact representations for the conditional probability distributions (CPDs) stored at each node. The majority of this work has concentrated on using decision-tree representations for the CPDs. In addition, researchers typically apply non-Bayesian (or asymptotically Bayesian) scoring functions such as MDL to evaluate the goodness-of-fit of networks to the data. In this paper we investigate a Bayesian approach to learning Bayesian networks that contain the more general decision-graph representations of the CPDs. First, we describe how to evaluate the posterior probability— that is, the Bayesian score—of such a network, given a database of observed cases. Second, we describe various search spaces that can be used, in conjunction with a scoring function and a search procedure, to identify one or more high-scoring networks. Finally, we present an experimental evaluation of the search spaces, using a greedy algorithm and a Bayesian scoring function. 1
Diversity creation methods: A survey and categorisation
- Journal of Information Fusion
, 2005
"... Ensemble approaches to classification and regression have attracted a great deal of interest in recent years. These methods can be shown both theoretically and empirically to outperform single predictors on a wide range of tasks. One of the elements required for accurate prediction when using an ens ..."
Abstract
-
Cited by 63 (18 self)
- Add to MetaCart
Ensemble approaches to classification and regression have attracted a great deal of interest in recent years. These methods can be shown both theoretically and empirically to outperform single predictors on a wide range of tasks. One of the elements required for accurate prediction when using an ensemble is recognised to be error “diversity”. However, the exact meaning of this concept is not clear from the literature, particularly for classification tasks. In this paper we first review the varied attempts to provide a formal explanation of error diversity, including several heuristic and qualitative explanations in the literature. For completeness of discussion we include not only the classification literature but also some excerpts of the rather more mature regression literature, which we believe can still provide some insights. We proceed to survey the various techniques used for creating diverse ensembles, and categorise them, forming a preliminary taxonomy of diversity creation methods. As part of this taxonomy we introduce the idea of implicit and explicit diversity creation methods, and three dimensions along which these may be applied. Finally we propose some new directions that may prove fruitful in understanding classification error diversity. 1
Data mining for hypertext: A tutorial survey
- ACM SIGKDD Explorations
, 2000
"... With over 800 million pages covering most areas of human endeavor, the World-wide Web is a fertile ground for data mining research to make a difference to the effectiveness of information search. Today, Web surfers access the Web through two dominant interfaces: clicking on hyperlinks and searching ..."
Abstract
-
Cited by 61 (0 self)
- Add to MetaCart
With over 800 million pages covering most areas of human endeavor, the World-wide Web is a fertile ground for data mining research to make a difference to the effectiveness of information search. Today, Web surfers access the Web through two dominant interfaces: clicking on hyperlinks and searching via keyword queries. This process is often tentative and unsatisfactory. Better support is needed for expressing one's information need and dealing with a search result in more structured ways than available now. Data mining and machine learning have significant roles to play towards this end. In this paper we will survey recent advances in learning and mining problems related to hypertext in general and the Web in particular. We will review the continuum of supervised to semi-supervised to unsupervised learning problems, highlight the specific challenges which distinguish data mining in the hypertext domain from data mining in the context of data warehouses, and summarize the key areas of ...
Tree induction vs. logistic regression: A learning-curve analysis
- CEDER WORKING PAPER #IS-01-02, STERN SCHOOL OF BUSINESS
, 2001
"... Tree induction and logistic regression are two standard, off-the-shelf methods for building models for classi cation. We present a large-scale experimental comparison of logistic regression and tree induction, assessing classification accuracy and the quality of rankings based on class-membership pr ..."
Abstract
-
Cited by 50 (16 self)
- Add to MetaCart
Tree induction and logistic regression are two standard, off-the-shelf methods for building models for classi cation. We present a large-scale experimental comparison of logistic regression and tree induction, assessing classification accuracy and the quality of rankings based on class-membership probabilities. We use a learning-curve analysis to examine the relationship of these measures to the size of the training set. The results of the study show several remarkable things. (1) Contrary to prior observations, logistic regression does not generally outperform tree induction. (2) More specifically, and not surprisingly, logistic regression is better for smaller training sets and tree induction for larger data sets. Importantly, this often holds for training sets drawn from the same domain (i.e., the learning curves cross), so conclusions about induction-algorithm superiority on a given domain must be based on an analysis of the learning curves. (3) Contrary to conventional wisdom, tree induction is effective atproducing probability-based rankings, although apparently comparatively less so foragiven training{set size than at making classifications. Finally, (4) the domains on which tree induction and logistic regression are ultimately preferable canbecharacterized surprisingly well by a simple measure of signal-to-noise ratio.
Bias, Variance and Prediction Error for Classification Rules
, 1996
"... We study the notions of bias and variance for classification rules. Following Efron (1978) we develop a decomposition of prediction error into its natural components. Then we derive bootstrap estimates of these components and illustrate how they can be used to describe the error behaviour of a class ..."
Abstract
-
Cited by 30 (1 self)
- Add to MetaCart
We study the notions of bias and variance for classification rules. Following Efron (1978) we develop a decomposition of prediction error into its natural components. Then we derive bootstrap estimates of these components and illustrate how they can be used to describe the error behaviour of a classifier in practice. In the process we also obtain a bootstrap estimate of the error of a "bagged" classifier. Keywords: classification, prediction error, bias, variance, bootstrap 1 Introduction This article concerns classification rules that have been constructed from a set of training data. The training set X = (x 1 ; x 2 ; \Delta \Delta \Delta ; x n ) consists of n observations x i = (t i ; g i ), with t i being the predictor or feature vector and g i being the response, taking values in f1; 2; : : : Kg. On the basis of X the Addresses: tibs@utstat.toronto.edu; http://www.utstat.toronto.edu/¸tibs statistician constructs a classification rule C(t; X ). Our objective here is to unde...
Diversity in Neural Network Ensembles
, 2004
"... We study the issue of error diversity in ensembles of neural networks. In ensembles of regression estimators, the measurement of diversity can be formalised as the Bias-Variance-Covariance decomposition. In ensembles of classifiers, there is no neat theory in the literature to date. Our objective is ..."
Abstract
-
Cited by 30 (3 self)
- Add to MetaCart
We study the issue of error diversity in ensembles of neural networks. In ensembles of regression estimators, the measurement of diversity can be formalised as the Bias-Variance-Covariance decomposition. In ensembles of classifiers, there is no neat theory in the literature to date. Our objective is to understand how to precisely define, measure, and create diverse errors for both cases. As a focal point we study one algorithm, Negative Correlation (NC) Learning which claimed, and showed empirical evidence, to enforce useful error diversity, creating neural network ensembles with very competitive performance on both classification and regression problems. With the lack of a solid understanding of its dynamics, we engage in a theoretical and empirical investigation. In an initial empirical stage, we demonstrate the application of an evolutionary search algorithm to locate the optimal value for λ, the configurable parameter in NC. We observe the behaviour of the optimal parameter under different ensemble architectures and datasets; we note a high degree of unpredictability, and embark on a more formal investigation. During the theoretical investigations, we find that NC succeeds due to exploiting the

