Results 1  10
of
12
A DecisionTheoretic Generalization of onLine Learning and an Application to Boosting
, 1997
"... In the first part of the paper we consider the problem of dynamically apportioning resources among a set of options in a worstcase online framework. The model we study can be interpreted as a broad, abstract extension of the wellstudied online prediction model to a general decisiontheoretic set ..."
Abstract

Cited by 2307 (59 self)
 Add to MetaCart
In the first part of the paper we consider the problem of dynamically apportioning resources among a set of options in a worstcase online framework. The model we study can be interpreted as a broad, abstract extension of the wellstudied online prediction model to a general decisiontheoretic setting. We show that the multiplicative weightupdate rule of Littlestone and Warmuth [20] can be adapted to this model yielding bounds that are slightly weaker in some cases, but applicable to a considerably more general class of learning problems. We show how the resulting learning algorithm can be applied to a variety of problems, including gambling, multipleoutcome prediction, repeated games and prediction of points in R n . In the second part of the paper we apply the multiplicative weightupdate technique to derive a new boosting algorithm. This boosting algorithm does not require any prior knowledge about the performance of the weak learning algorithm. We also study generalizations of...
How to Use Expert Advice
 JOURNAL OF THE ASSOCIATION FOR COMPUTING MACHINERY
, 1997
"... We analyze algorithms that predict a binary value by combining the predictions of several prediction strategies, called experts. Our analysis is for worstcase situations, i.e., we make no assumptions about the way the sequence of bits to be predicted is generated. We measure the performance of the ..."
Abstract

Cited by 317 (66 self)
 Add to MetaCart
We analyze algorithms that predict a binary value by combining the predictions of several prediction strategies, called experts. Our analysis is for worstcase situations, i.e., we make no assumptions about the way the sequence of bits to be predicted is generated. We measure the performance of the algorithm by the difference between the expected number of mistakes it makes on the bit sequence and the expected number of mistakes made by the best expert on this sequence, where the expectation is taken with respect to the randomization in the predictions. We show that the minimum achievable difference is on the order of the square root of the number of mistakes of the best expert, and we give efficient algorithms that achieve this. Our upper and lower bounds have matching leading constants in most cases. We then show howthis leads to certain kinds of pattern recognition/learning algorithms with performance bounds that improve on the best results currently known in this context. We also compare our analysis to the case in which log loss is used instead of the expected number of mistakes.
Exponentiated Gradient Versus Gradient Descent for Linear Predictors
 Information and Computation
, 1995
"... this paper, we concentrate on linear predictors . To any vector u 2 R ..."
Abstract

Cited by 247 (12 self)
 Add to MetaCart
this paper, we concentrate on linear predictors . To any vector u 2 R
Regret in the Online Decision Problem
, 1999
"... At each point in time a decision maker must choose a decision. The payoff in a period from the decision chosen depends on the decision as well as the state of the world that obtains at that time. The difficulty is that the decision must be made in advance of any knowledge, even probabilistic, about ..."
Abstract

Cited by 115 (2 self)
 Add to MetaCart
At each point in time a decision maker must choose a decision. The payoff in a period from the decision chosen depends on the decision as well as the state of the world that obtains at that time. The difficulty is that the decision must be made in advance of any knowledge, even probabilistic, about which state of the world will obtain. A range of problems from a variety of disciplines can be framed in this way. In this
Predicting Nearly as Well as the Best Pruning of a Decision Tree
 Machine Learning
, 1995
"... . Many algorithms for inferring a decision tree from data involve a twophase process: First, a very large decision tree is grown which typically ends up "overfitting" the data. To reduce overfitting, in the second phase, the tree is pruned using one of a number of available methods. The final tre ..."
Abstract

Cited by 71 (5 self)
 Add to MetaCart
. Many algorithms for inferring a decision tree from data involve a twophase process: First, a very large decision tree is grown which typically ends up "overfitting" the data. To reduce overfitting, in the second phase, the tree is pruned using one of a number of available methods. The final tree is then output and used for classification on test data. In this paper, we suggest an alternative approach to the pruning phase. Using a given unpruned decision tree, we present a new method of making predictions on test data, and we prove that our algorithm's performance will not be "much worse" (in a precise technical sense) than the predictions made by the best reasonably small pruning of the given decision tree. Thus, our procedure is guaranteed to be competitive (in terms of the quality of its predictions) with any pruning algorithm. We prove that our procedure is very efficient and highly robust. Our method can be viewed as a synthesis of two previously studied techniques. First, we ...
Online Prediction and Conversion Strategies
 Machine Learning
, 1994
"... We study the problem of deterministically predicting boolean values by combining the boolean predictions... ..."
Abstract

Cited by 50 (18 self)
 Add to MetaCart
We study the problem of deterministically predicting boolean values by combining the boolean predictions...
Analysis of two gradientbased algorithms for online regression
 Journal of Computer and System Sciences
, 1999
"... In this paper we present a new analysis of two algorithms, Gradient Descent and Exponentiated Gradient, for solving regression problems in the online framework. Both these algorithms compute a prediction that depends linearly on the current instance, and then update the coefficients of this linear ..."
Abstract

Cited by 40 (5 self)
 Add to MetaCart
In this paper we present a new analysis of two algorithms, Gradient Descent and Exponentiated Gradient, for solving regression problems in the online framework. Both these algorithms compute a prediction that depends linearly on the current instance, and then update the coefficients of this linear combination according to the gradient of the loss function. However, the two algorithms have distinctive ways of using the gradient information for updating the coefficients. For each algorithm, we show general regression bounds for any convex loss function. Furthermore, we show special bounds for the absolute and the square loss functions, thus extending previous results by Kivinen and Warmuth. In the nonlinear regression case, we show general bounds for pairs of transfer and loss functions satisfying a certain condition. We apply this result to the Hellinger loss and the entropic loss in case of logistic regression (similar results, but only for the entropic loss, were also obtained by Helmbold et al. using a different analysis.) Finally, we describe the connection between our approach and a general family of gradientbased algorithms proposed by Warmuth et al. in recent works. 1999 Academic Press 1.
Tracking the Best Regressor
 In Proc. 11th Annu. Conf. on Comput. Learning Theory
, 1998
"... In most of the online learning research the total online loss of the algorithm is compared to the total loss of the best offline predictor u from a comparison class of predictors. We call such bounds static bounds. The interesting feature of these bounds is that they hold for an arbitrary sequenc ..."
Abstract

Cited by 18 (6 self)
 Add to MetaCart
In most of the online learning research the total online loss of the algorithm is compared to the total loss of the best offline predictor u from a comparison class of predictors. We call such bounds static bounds. The interesting feature of these bounds is that they hold for an arbitrary sequence of examples. Recently some work has been done where the comparison vector u t at each trial t is allowed to change with time, and the total online loss of the algorithm is compared to the sum of the losses of u t at each trial plus the total "cost" for shifting to successive comparison vectors. This is to model situations in which the examples change over time and different predictors from the comparison class are best for different segments of the sequence of examples. We call such bounds shifting bounds. Shifting bounds still hold for arbitrary sequences of examples and also for arbitrary partitions. The algorithm does not know the offline partition and the sequence of predictors that i...
On Bayes Methods for Online Boolean Prediction
, 1997
"... This paper proposes a general framework, based on weighting schemes, within which the Bayes method applied to online Boolean prediction can be studied. By applying standard tools in Bayes theory we propose an improved variant of the Weighted Majority algorithm for deterministic prediction. The mist ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
This paper proposes a general framework, based on weighting schemes, within which the Bayes method applied to online Boolean prediction can be studied. By applying standard tools in Bayes theory we propose an improved variant of the Weighted Majority algorithm for deterministic prediction. The mistake bound of our variant is asymptotically equal to the mistake bound of Weighted Majority when the latter has additional side information to optimally tune its update factor. We also show general bounds on the number of prediction mistakes made by conservative versions of Bayesian algorithms. Specific instances of our bounds match bounds previously shown for different online prediction algorithms proposed in the past. Finally, we study a generalization of these methods to randomized predictions.
A Dynamic Disk SpinDown Technique for Mobile Computing
, 1997
"... This thesis addresses the problem of deciding when to spin down the disk of a mobile computer in order to extend battery life. Since one of the most critical resources in mobile computing environments is battery life, good energy conservation methods can dramatically increase the utility of mobile s ..."
Abstract
 Add to MetaCart
This thesis addresses the problem of deciding when to spin down the disk of a mobile computer in order to extend battery life. Since one of the most critical resources in mobile computing environments is battery life, good energy conservation methods can dramatically increase the utility of mobile systems. A simple and efficient algorithm based on machine learning techniques is applied to the disk spindown problem. Experimental results are based on traces collected from HP C2474s disks. Using this data, the algorithm outperforms several algorithms that are theoretically optimal under various worstcase assumptions, other adaptive algorithms, as well as the best fixed timeout strategy. In particular, the algorithm reduces the power consumption of the disk to about half (depending on the disk's properties) of the energy consumed by a one minute fixed timeout. Since the algorithm adapts to usage patterns, it uses as little as 88% of the energy consumed by the best fixed timeout comput...