Results 1 
6 of
6
How to Use Expert Advice
 JOURNAL OF THE ASSOCIATION FOR COMPUTING MACHINERY
, 1997
"... We analyze algorithms that predict a binary value by combining the predictions of several prediction strategies, called experts. Our analysis is for worstcase situations, i.e., we make no assumptions about the way the sequence of bits to be predicted is generated. We measure the performance of the ..."
Abstract

Cited by 317 (66 self)
 Add to MetaCart
We analyze algorithms that predict a binary value by combining the predictions of several prediction strategies, called experts. Our analysis is for worstcase situations, i.e., we make no assumptions about the way the sequence of bits to be predicted is generated. We measure the performance of the algorithm by the difference between the expected number of mistakes it makes on the bit sequence and the expected number of mistakes made by the best expert on this sequence, where the expectation is taken with respect to the randomization in the predictions. We show that the minimum achievable difference is on the order of the square root of the number of mistakes of the best expert, and we give efficient algorithms that achieve this. Our upper and lower bounds have matching leading constants in most cases. We then show howthis leads to certain kinds of pattern recognition/learning algorithms with performance bounds that improve on the best results currently known in this context. We also compare our analysis to the case in which log loss is used instead of the expected number of mistakes.
Mutual Information, Metric Entropy, and Cumulative Relative Entropy Risk
 Annals of Statistics
, 1996
"... Assume fP ` : ` 2 \Thetag is a set of probability distributions with a common dominating measure on a complete separable metric space Y . A state ` 2 \Theta is chosen by Nature. A statistician gets n independent observations Y 1 ; : : : ; Y n from Y distributed according to P ` . For each time ..."
Abstract

Cited by 39 (2 self)
 Add to MetaCart
Assume fP ` : ` 2 \Thetag is a set of probability distributions with a common dominating measure on a complete separable metric space Y . A state ` 2 \Theta is chosen by Nature. A statistician gets n independent observations Y 1 ; : : : ; Y n from Y distributed according to P ` . For each time t between 1 and n, based on the observations Y 1 ; : : : ; Y t\Gamma1 , the statistician produces an estimated distribution P t for P ` , and suffers a loss L(P ` ; P t ). The cumulative risk for the statistician is the average total loss up to time n. Of special interest in information theory, data compression, mathematical finance, computational learning theory and statistical mechanics is the special case when the loss L(P ` ; P t ) is the relative entropy between the true distribution P ` and the estimated distribution P t . Here the cumulative Bayes risk from time 1 to n is the mutual information between the random parameter \Theta and the observations Y 1 ; : : : ;...
How Well do Bayes Methods Work for OnLine Prediction of {±1} values?
 In Proceedings of the Third NEC Symposium on Computation and Cognition. SIAM
, 1992
"... We look at sequential classification and regression problems in which f\Sigma1glabeled instances are given online, one at a time, and for each new instance, before seeing the label, the learning system must either predict the label, or estimate the probability that the label is +1. We look at the ..."
Abstract

Cited by 18 (11 self)
 Add to MetaCart
We look at sequential classification and regression problems in which f\Sigma1glabeled instances are given online, one at a time, and for each new instance, before seeing the label, the learning system must either predict the label, or estimate the probability that the label is +1. We look at the performance of Bayes method for this task, as measured by the total number of mistakes for the classification problem, and by the total log loss (or information gain) for the regression problem. Our results are given by comparing the performance of Bayes method to the performance of a hypothetical "omniscient scientist" who is able to use extra information about the labeling process that would not be available in the standard learning protocol. The results show that Bayes methods perform only slightly worse than the omniscient scientist in many cases. These results generalize previous results of Haussler, Kearns and Schapire, and Opper and Haussler. 1 Introduction Several recent papers in...
Mutual Information, Metric Entropy, and Risk in Estimation of Probability Distributions
, 1996
"... Assume fP ` : ` 2 \Thetag is a set of probability distributions with a common dominating measure on a complete separable metric space Y . A state ` 2 \Theta is chosen by Nature. A statistician gets n independent observations Y 1 ; : : : ; Y n from Y distributed according to P ` . For each time ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
Assume fP ` : ` 2 \Thetag is a set of probability distributions with a common dominating measure on a complete separable metric space Y . A state ` 2 \Theta is chosen by Nature. A statistician gets n independent observations Y 1 ; : : : ; Y n from Y distributed according to P ` . For each time t between 1 and n, based on the observations Y 1 ; : : : ; Y t\Gamma1 , the statistician produces an estimated distribution P t for P ` , and suffers a loss L(P ` ; P t ). The cumulative risk for the statistician is the average total loss up to time n. Of special interest in information theory, data compression, mathematical finance, computational learning theory and statistical mechanics is the special case when the loss L(P ` ; P t ) is the relative entropy between the true distribution P ` and the estimated distribution P t . Here the cumulative Bayes risk from time 1 to n is the mutual information between the random parameter \Theta and the observations Y 1 ; : : : ;...
Generalization errors of the simple perceptron
 Journal of Physics A
, 1998
"... Abstract. To find an exact form for the generalization error of a learning machine is an open ..."
Abstract

Cited by 3 (3 self)
 Add to MetaCart
Abstract. To find an exact form for the generalization error of a learning machine is an open
Estimating Exact Form of Generalisation Errors
"... Abstract. A novel approach to estimate generalisation errors of the simple perceptron of the worst case is introduced. It is well known that the generaiisation error of the simple perceptron is of the form d # with an unknown constant d which depends only on the dimension of inputs, where t is the n ..."
Abstract
 Add to MetaCart
Abstract. A novel approach to estimate generalisation errors of the simple perceptron of the worst case is introduced. It is well known that the generaiisation error of the simple perceptron is of the form d # with an unknown constant d which depends only on the dimension of inputs, where t is the number of learned examples. Based upon extreme value theory in statistics we obtain an exact form of the generalisation error of the simple perceptron. The method introduced in this paper opens up new possibilities to consider generalisation errors of a class of neural networks. 1