Results 11 - 20
of
90
An introduction to boosting and leveraging
- Advanced Lectures on Machine Learning, LNCS
, 2003
"... ..."
Using and Combining Predictors That Specialize
, 1997
"... . We study online learning algorithms that predict by combining the predictions of several subordinate prediction algorithms, sometimes called "experts." These simple algorithms belong to the multiplicative weights family of algorithms. The performance of these algorithms degrades only logarithmical ..."
Abstract
-
Cited by 76 (11 self)
- Add to MetaCart
. We study online learning algorithms that predict by combining the predictions of several subordinate prediction algorithms, sometimes called "experts." These simple algorithms belong to the multiplicative weights family of algorithms. The performance of these algorithms degrades only logarithmically with the number of experts, making them particularly useful in applications where the number of experts is very large. However, in applications such as text categorization, it is often natural for some of the experts to abstain from making predictions on some of the instances. We show how to transform algorithms that assume that all experts are always awake to algorithms that do not require this assumption. We also show how to derive corresponding loss bounds. Our method is very general, and can be applied to a large family of online learning algorithms. We also give applications to various prediction models including decision graphs and "switching" experts. 1 Introduction We study onlin...
On-line portfolio selection using multiplicative updates
- Mathematical Finance
, 1998
"... We present an on-line investment algorithm which achieves almost the same wealth as the best constant-rebalanced portfolio determined in hindsight from the actual market outcomes. The algorithm employs a multiplicative update rule derived using a framework introduced by Kivinen and Warmuth. Our algo ..."
Abstract
-
Cited by 67 (10 self)
- Add to MetaCart
We present an on-line investment algorithm which achieves almost the same wealth as the best constant-rebalanced portfolio determined in hindsight from the actual market outcomes. The algorithm employs a multiplicative update rule derived using a framework introduced by Kivinen and Warmuth. Our algorithm is very simple to implement and requires only constant storage and computing time per stock ineach trading period. We tested the performance of our algorithm on real stock data from the New York Stock Exchange accumulated during a 22-year period. On this data, our algorithm clearly outperforms the best single stock aswell as Cover's universal portfolio selection algorithm. We also present results for the situation in which the We present an on-line investment algorithm which achieves almost the same wealth as the best constant-rebalanced portfolio investment strategy. The algorithm employsamultiplicative update rule derived using a framework introduced by Kivinen and Warmuth [20]. Our algorithm is very simple to implement and its time and storage requirements grow linearly in the number of stocks.
Tracking the Best Disjunction
- Machine Learning
, 1995
"... . Littlestone developed a simple deterministic on-line learning algorithm for learning k-literal disjunctions. This algorithm (called Winnow) keeps one weight for each of the n variables and does multiplicative updates to its weights. We develop a randomized version of Winnow and prove bounds for a ..."
Abstract
-
Cited by 64 (11 self)
- Add to MetaCart
. Littlestone developed a simple deterministic on-line learning algorithm for learning k-literal disjunctions. This algorithm (called Winnow) keeps one weight for each of the n variables and does multiplicative updates to its weights. We develop a randomized version of Winnow and prove bounds for an adaptation of the algorithm for the case when the disjunction may change over time. In this case a possible target disjunction schedule T is a sequence of disjunctions (one per trial) and the shift size is the total number of literals that are added/removed from the disjunctions as one progresses through the sequence. We develop an algorithm that predicts nearly as well as the best disjunction schedule for an arbitrary sequence of examples. This algorithm that allows us to track the predictions of the best disjunction is hardly more complex than the original version. However the amortized analysis needed for obtaining worst-case mistake bounds requires new techniques. In some cases our low...
A New Approximate Maximal Margin Classification Algorithm
- JOURNAL OF MACHINE LEARNING RESEARCH
, 2001
"... A new incremental learning algorithm is described which approximates the maximal margin hyperplane w.r.t. norm p 2 for a set of linearly separable data. Our algorithm, called alma p (Approximate Large Margin algorithm w.r.t. norm p), takes O (p 1) 2 2 corrections to separate the data wi ..."
Abstract
-
Cited by 60 (5 self)
- Add to MetaCart
A new incremental learning algorithm is described which approximates the maximal margin hyperplane w.r.t. norm p 2 for a set of linearly separable data. Our algorithm, called alma p (Approximate Large Margin algorithm w.r.t. norm p), takes O (p 1) 2 2 corrections to separate the data with p-norm margin larger than (1 ) , where is the (normalized) p-norm margin of the data. alma p avoids quadratic (or higher-order) programming methods. It is very easy to implement and is as fast as on-line algorithms, such as Rosenblatt's Perceptron algorithm. We performed extensive experiments on both real-world and artificial datasets. We compared alma 2 (i.e., alma p with p = 2) to standard Support vector Machines (SVM) and to two incremental algorithms: the Perceptron algorithm and Li and Long's ROMMA. The accuracy levels achieved by alma 2 are superior to those achieved by the Perceptron algorithm and ROMMA, but slightly inferior to SVM's. On the other hand, alma 2 is quite faster and easier to implement than standard SVM training algorithms. When learning sparse target vectors, alma p with p > 2 largely outperforms Perceptron-like algorithms, such as alma 2 .
Tutorial on Practical Prediction Theory for Classification
, 2005
"... We discuss basic prediction theory and it's impact on classification success evaluation, implications for learning algorithm design, and uses in learning algorithm execution. This tutorial is meant to be a comprehensive compilation of results which are both theoretically rigorous and practically use ..."
Abstract
-
Cited by 59 (4 self)
- Add to MetaCart
We discuss basic prediction theory and it's impact on classification success evaluation, implications for learning algorithm design, and uses in learning algorithm execution. This tutorial is meant to be a comprehensive compilation of results which are both theoretically rigorous and practically useful. There are two important implications...
Sequential Prediction of Individual Sequences Under General Loss Functions
- IEEE Transactions on Information Theory
, 1998
"... We consider adaptive sequential prediction of arbitrary binary sequences when the performance is evaluated using a general loss function. The goal is to predict on each individual sequence nearly as well as the best prediction strategy in a given comparison class of (possibly adaptive) prediction st ..."
Abstract
-
Cited by 58 (7 self)
- Add to MetaCart
We consider adaptive sequential prediction of arbitrary binary sequences when the performance is evaluated using a general loss function. The goal is to predict on each individual sequence nearly as well as the best prediction strategy in a given comparison class of (possibly adaptive) prediction strategies, called experts. By using a general loss function, we generalize previous work on universal prediction, forecasting, and data compression. However, here we restrict ourselves to the case when the comparison class is finite. For a given sequence, we define the regret as the total loss on the entire sequence suffered by the adaptive sequential predictor, minus the total loss suffered by the predictor in the comparison class that performs best on that particular sequence. We show that for a large class of loss functions, the minimax regret is either \Theta(log N) or \Omega\Gamma p ` log N ), depending on the loss function, where N is the number of predictors in the comparison class a...
Relative Loss Bounds for Multidimensional Regression Problems
- MACHINE LEARNING
, 2001
"... We study on-line generalized linear regression with multidimensional outputs, i.e., neural networks with multiple output nodes but no hidden nodes. We allow at the final layer transfer functions such as the softmax function that need to consider the linear activations to all the output neurons. The ..."
Abstract
-
Cited by 55 (11 self)
- Add to MetaCart
We study on-line generalized linear regression with multidimensional outputs, i.e., neural networks with multiple output nodes but no hidden nodes. We allow at the final layer transfer functions such as the softmax function that need to consider the linear activations to all the output neurons. The weight vectors used to produce the linear activations are represented indirectly by maintaining separate parameter vectors. We get the weight vector by applying a particular parameterization function to the parameter vector. Updating the parameter vectors upon seeing new examples is done additively, as in the usual gradient descent update. However, by using a nonlinear parameterization function between the parameter vectors and the weight vectors, we can make the resulting update of the weight vector quite different from a true gradient descent update. To analyse such updates, we define a notion of a matching loss function and apply it both to the transfer function and to the parameterization function. The loss function that matches the transfer function is used to measure the goodness of the predictions of the algorithm. The loss function that matches the parameterization function can be used both as a measure of divergence between models in motivating the update rule of the algorithm and as a measure of progress in analyzing its relative performance compared to an arbitrary fixed model. As a result, we have a unified treatment that generalizes earlier results for the gradient descent and exponentiated gradient algorithms to multidimensional outputs, including multiclass logistic regression.
Boosting as Entropy Projection
, 1999
"... We consider the AdaBoost procedure for boosting weak learners. In AdaBoost, a key step is choosing a new distribution on the training examples based on the old distribution and the mistakes made by the present weak hypothesis. We show how AdaBoost 's choice of the new distribution can be seen ..."
Abstract
-
Cited by 51 (8 self)
- Add to MetaCart
We consider the AdaBoost procedure for boosting weak learners. In AdaBoost, a key step is choosing a new distribution on the training examples based on the old distribution and the mistakes made by the present weak hypothesis. We show how AdaBoost 's choice of the new distribution can be seen as an approximate solution to the following problem: Find a new distribution that is closest to the old distribution subject to the constraint that the new distribution is orthogonal to the vector of mistakes of the current weak hypothesis. The distance (or divergence) between distributions is measured by the relative entropy. Alternatively, we could say that AdaBoost approximately projects the distribution vector onto a hyperplane dened by the mistake vector. We show that this new view of AdaBoost as an entropy projection is dual to the usual view of AdaBoost as minimizing the normalization factors of the updated distributions.
Path Kernels and Multiplicative Updates
- Journal of Machine Learning Research
, 2003
"... Kernels are typically applied to linear algorithms whose weight vector is a linear combination of the feature vectors of the examples. On-line versions of these algorithms are sometimes called "additive updates" because they add a multiple of the last feature vector to the current weight vector. ..."
Abstract
-
Cited by 51 (6 self)
- Add to MetaCart
Kernels are typically applied to linear algorithms whose weight vector is a linear combination of the feature vectors of the examples. On-line versions of these algorithms are sometimes called "additive updates" because they add a multiple of the last feature vector to the current weight vector.

