Results 1 - 10
of
82
Automatic Construction of Decision Trees from Data: A Multi-Disciplinary Survey
- Data Mining and Knowledge Discovery
, 1997
"... Decision trees have proved to be valuable tools for the description, classification and generalization of data. Work on constructing decision trees from data exists in multiple disciplines such as statistics, pattern recognition, decision theory, signal processing, machine learning and artificial ne ..."
Abstract
-
Cited by 224 (1 self)
- Add to MetaCart
(Show Context)
Decision trees have proved to be valuable tools for the description, classification and generalization of data. Work on constructing decision trees from data exists in multiple disciplines such as statistics, pattern recognition, decision theory, signal processing, machine learning and artificial neural networks. Researchers in these disciplines, sometimes working on quite different problems, identified similar issues and heuristics for decision tree construction. This paper surveys existing work on decision tree construction, attempting to identify the important issues involved, directions the work has taken and the current state of the art. Keywords: classification, tree-structured classifiers, data compaction 1. Introduction Advances in data collection methods, storage and processing technology are providing a unique challenge and opportunity for automated data exploration techniques. Enormous amounts of data are being collected daily from major scientific projects e.g., Human Genome...
Efficient Algorithms for Online Decision Problems
- J. Comput. Syst. Sci
, 2003
"... In an online decision problem, one makes a sequence of decisions without knowledge of the future. Tools from learning such as Weighted Majority and its many variants [13, 18, 4] demonstrate that online algorithms can perform nearly as well as the best single decision chosen in hindsight, even when t ..."
Abstract
-
Cited by 192 (3 self)
- Add to MetaCart
In an online decision problem, one makes a sequence of decisions without knowledge of the future. Tools from learning such as Weighted Majority and its many variants [13, 18, 4] demonstrate that online algorithms can perform nearly as well as the best single decision chosen in hindsight, even when there are exponentially many possible decisions. However, the naive application of these algorithms is inefficient for such large problems. For some problems with nice structure, specialized efficient solutions have been developed [10, 16, 17, 6, 3].
Universal prediction
- IEEE TRANSACTIONS ON INFORMATION THEORY
, 1998
"... This paper consists of an overview on universal prediction from an information-theoretic perspective. Special attention is given to the notion of probability assignment under the self-information loss function, which is directly related to the theory of universal data compression. Both the probabili ..."
Abstract
-
Cited by 176 (16 self)
- Add to MetaCart
This paper consists of an overview on universal prediction from an information-theoretic perspective. Special attention is given to the notion of probability assignment under the self-information loss function, which is directly related to the theory of universal data compression. Both the probabilistic setting and the deterministic setting of the universal prediction problem are described with emphasis on the analogy and the differences between results in the two settings.
Using and combining predictors that specialize
- In 29th STOC
, 1997
"... Abstract. We study online learning algorithms that predict by combining the predictions of several subordinate prediction algorithms, sometimes called “experts. ” These simple algorithms belong to the multiplicative weights family of algorithms. The performance of these algorithms degrades only loga ..."
Abstract
-
Cited by 116 (16 self)
- Add to MetaCart
(Show Context)
Abstract. We study online learning algorithms that predict by combining the predictions of several subordinate prediction algorithms, sometimes called “experts. ” These simple algorithms belong to the multiplicative weights family of algorithms. The performance of these algorithms degrades only logarithmically with the number of experts, making them particularly useful in applications where the number of experts is very large. However, in applications such as text categorization, it is often natural for some of the experts to abstain from making predictions on some of the instances. We show how to transform algorithms that assume that all experts are always awake to algorithms that do not require this assumption. We also show how to derive corresponding loss bounds. Our method is very general, and can be applied to a large family of online learning algorithms. We also give applications to various prediction models including decision graphs and “switching ” experts. 1
PAC-Bayesian Model Averaging
- In Proceedings of the Twelfth Annual Conference on Computational Learning Theory
, 1999
"... PAC-Bayesian learning methods combine the informative priors of Bayesian methods with distribution-free PAC guarantees. Building on earlier methods for PAC-Bayesian model selection, this paper presents a method for PAC-Bayesian model averaging. The main result is a bound on generalization error of a ..."
Abstract
-
Cited by 99 (3 self)
- Add to MetaCart
(Show Context)
PAC-Bayesian learning methods combine the informative priors of Bayesian methods with distribution-free PAC guarantees. Building on earlier methods for PAC-Bayesian model selection, this paper presents a method for PAC-Bayesian model averaging. The main result is a bound on generalization error of an arbitrary weighted mixture of concepts that depends on the empirical error of that mixture and the KLdivergence of the mixture from the prior. A simple characterization is also given for the error bound achieved by the optimal weighting. 1
Competitive on-line statistics
- International Statistical Review
, 1999
"... A radically new approach to statistical modelling, which combines mathematical techniques of Bayesian statistics with the philosophy of the theory of competitive on-line algorithms, has arisen over the last decade in computer science (to a large degree, under the influence of Dawid’s prequential sta ..."
Abstract
-
Cited by 97 (15 self)
- Add to MetaCart
(Show Context)
A radically new approach to statistical modelling, which combines mathematical techniques of Bayesian statistics with the philosophy of the theory of competitive on-line algorithms, has arisen over the last decade in computer science (to a large degree, under the influence of Dawid’s prequential statistics). In this approach, which we call “competitive on-line statistics”, it is not assumed that data are generated by some stochastic mechanism; the bounds derived for the performance of competitive on-line statistical procedures are guaranteed to hold (and not just hold with high probability or on the average). This paper reviews some results in this area; the new material in it includes the proofs for the performance of the Aggregating Algorithm in the problem of linear regression with square loss. Keywords: Bayes’s rule, competitive on-line algorithms, linear regression, prequential statistics, worst-case analysis.
Tracking the Best Disjunction
- Machine Learning
, 1995
"... . Littlestone developed a simple deterministic on-line learning algorithm for learning k-literal disjunctions. This algorithm (called Winnow) keeps one weight for each of the n variables and does multiplicative updates to its weights. We develop a randomized version of Winnow and prove bounds for a ..."
Abstract
-
Cited by 87 (13 self)
- Add to MetaCart
. Littlestone developed a simple deterministic on-line learning algorithm for learning k-literal disjunctions. This algorithm (called Winnow) keeps one weight for each of the n variables and does multiplicative updates to its weights. We develop a randomized version of Winnow and prove bounds for an adaptation of the algorithm for the case when the disjunction may change over time. In this case a possible target disjunction schedule T is a sequence of disjunctions (one per trial) and the shift size is the total number of literals that are added/removed from the disjunctions as one progresses through the sequence. We develop an algorithm that predicts nearly as well as the best disjunction schedule for an arbitrary sequence of examples. This algorithm that allows us to track the predictions of the best disjunction is hardly more complex than the original version. However the amortized analysis needed for obtaining worst-case mistake bounds requires new techniques. In some cases our low...
PAC-Bayesian stochastic model selection
- Machine Learning
, 2003
"... Abstract PAC-Bayesian learning methods combine the informative priors of Bayesian methods with distribution-free PAC guarantees. Stochastic model selection predicts a class label by stochastically sampling a classifier according to a "posterior distribution " on classifiers. This p ..."
Abstract
-
Cited by 76 (2 self)
- Add to MetaCart
(Show Context)
Abstract PAC-Bayesian learning methods combine the informative priors of Bayesian methods with distribution-free PAC guarantees. Stochastic model selection predicts a class label by stochastically sampling a classifier according to a "posterior distribution " on classifiers. This paper gives a PAC-Bayesian performance guarantee for stochastic model selection that is superior to analogous guarantees for deterministic model selection. The guarantee is stated in terms of the training error of the stochastic classifier and the KL-divergence of the posterior from the prior. It is shown that the posterior optimizing the performance guarantee is a Gibbs distribution. Simpler posterior distributions are also derived that have nearly optimal performance guarantees.
Path Kernels and Multiplicative Updates
- JOURNAL OF MACHINE LEARNING RESEARCH
, 2003
"... Kernels are typically applied to linear algorithms whose weight vector is a linear combination of the feature vectors of the examples. On-line versions of these algorithms are sometimes called "additive updates" because they add a multiple of the last feature vector to the current weight ..."
Abstract
-
Cited by 75 (11 self)
- Add to MetaCart
Kernels are typically applied to linear algorithms whose weight vector is a linear combination of the feature vectors of the examples. On-line versions of these algorithms are sometimes called "additive updates" because they add a multiple of the last feature vector to the current weight vector. In this
The Perceptron algorithm vs. Winnow: linear vs. logarithmic mistake bounds when few input variables are relevant
"... This paper addresses the familiar problem of predicting with a linear classifier . The ..."
Abstract
-
Cited by 60 (8 self)
- Add to MetaCart
This paper addresses the familiar problem of predicting with a linear classifier . The