Results 1 
3 of
3
How to Use Expert Advice
 JOURNAL OF THE ASSOCIATION FOR COMPUTING MACHINERY
, 1997
"... We analyze algorithms that predict a binary value by combining the predictions of several prediction strategies, called experts. Our analysis is for worstcase situations, i.e., we make no assumptions about the way the sequence of bits to be predicted is generated. We measure the performance of the ..."
Abstract

Cited by 317 (65 self)
 Add to MetaCart
We analyze algorithms that predict a binary value by combining the predictions of several prediction strategies, called experts. Our analysis is for worstcase situations, i.e., we make no assumptions about the way the sequence of bits to be predicted is generated. We measure the performance of the algorithm by the difference between the expected number of mistakes it makes on the bit sequence and the expected number of mistakes made by the best expert on this sequence, where the expectation is taken with respect to the randomization in the predictions. We show that the minimum achievable difference is on the order of the square root of the number of mistakes of the best expert, and we give efficient algorithms that achieve this. Our upper and lower bounds have matching leading constants in most cases. We then show howthis leads to certain kinds of pattern recognition/learning algorithms with performance bounds that improve on the best results currently known in this context. We also compare our analysis to the case in which log loss is used instead of the expected number of mistakes.
How Well do Bayes Methods Work for OnLine Prediction of {±1} values?
 In Proceedings of the Third NEC Symposium on Computation and Cognition. SIAM
, 1992
"... We look at sequential classification and regression problems in which f\Sigma1glabeled instances are given online, one at a time, and for each new instance, before seeing the label, the learning system must either predict the label, or estimate the probability that the label is +1. We look at the ..."
Abstract

Cited by 18 (11 self)
 Add to MetaCart
We look at sequential classification and regression problems in which f\Sigma1glabeled instances are given online, one at a time, and for each new instance, before seeing the label, the learning system must either predict the label, or estimate the probability that the label is +1. We look at the performance of Bayes method for this task, as measured by the total number of mistakes for the classification problem, and by the total log loss (or information gain) for the regression problem. Our results are given by comparing the performance of Bayes method to the performance of a hypothetical "omniscient scientist" who is able to use extra information about the labeling process that would not be available in the standard learning protocol. The results show that Bayes methods perform only slightly worse than the omniscient scientist in many cases. These results generalize previous results of Haussler, Kearns and Schapire, and Opper and Haussler. 1 Introduction Several recent papers in...
Combining model selection procedures for online prediction
 Sankhya A
, 2001
"... SUMMARY. Here we give a technique for online prediction that uses different model selection principles (MSP’s) at different times. The central idea is that each MSP is associated with a collection of models for which it is best suited. This means one can use the data to choose an MSP. Then, the MSP ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
SUMMARY. Here we give a technique for online prediction that uses different model selection principles (MSP’s) at different times. The central idea is that each MSP is associated with a collection of models for which it is best suited. This means one can use the data to choose an MSP. Then, the MSP chosen is used with the data to choose a model, and the parameters of the model are estimated so that predictions can be made. Depending on the degree of discrepancy between the predicted values and the actual outcomes one may update the parameters within a model, reuse the MSP to rechoose the model and estimate its parameters, or start all over again rechoosing the MSP. Our main formal result is a theorem which gives conditions under which our technique performs better than always using the same MSP. We also discuss circumstances under which dropping data points may lead to better predictions. 1.