Results 1  10
of
17
How to Use Expert Advice
 JOURNAL OF THE ASSOCIATION FOR COMPUTING MACHINERY
, 1997
"... We analyze algorithms that predict a binary value by combining the predictions of several prediction strategies, called experts. Our analysis is for worstcase situations, i.e., we make no assumptions about the way the sequence of bits to be predicted is generated. We measure the performance of the ..."
Abstract

Cited by 317 (66 self)
 Add to MetaCart
We analyze algorithms that predict a binary value by combining the predictions of several prediction strategies, called experts. Our analysis is for worstcase situations, i.e., we make no assumptions about the way the sequence of bits to be predicted is generated. We measure the performance of the algorithm by the difference between the expected number of mistakes it makes on the bit sequence and the expected number of mistakes made by the best expert on this sequence, where the expectation is taken with respect to the randomization in the predictions. We show that the minimum achievable difference is on the order of the square root of the number of mistakes of the best expert, and we give efficient algorithms that achieve this. Our upper and lower bounds have matching leading constants in most cases. We then show howthis leads to certain kinds of pattern recognition/learning algorithms with performance bounds that improve on the best results currently known in this context. We also compare our analysis to the case in which log loss is used instead of the expected number of mistakes.
The minimum description length principle in coding and modeling
 IEEE Trans. Inform. Theory
, 1998
"... Abstract — We review the principles of Minimum Description Length and Stochastic Complexity as used in data compression and statistical modeling. Stochastic complexity is formulated as the solution to optimum universal coding problems extending Shannon’s basic source coding theorem. The normalized m ..."
Abstract

Cited by 305 (12 self)
 Add to MetaCart
Abstract — We review the principles of Minimum Description Length and Stochastic Complexity as used in data compression and statistical modeling. Stochastic complexity is formulated as the solution to optimum universal coding problems extending Shannon’s basic source coding theorem. The normalized maximized likelihood, mixture, and predictive codings are each shown to achieve the stochastic complexity to within asymptotically vanishing terms. We assess the performance of the minimum description length criterion both from the vantage point of quality of data compression and accuracy of statistical inference. Context tree modeling, density estimation, and model selection in Gaussian linear regression serve as examples. Index Terms—Complexity, compression, estimation, inference, universal modeling.
Sequential Prediction of Individual Sequences Under General Loss Functions
 IEEE Transactions on Information Theory
, 1998
"... We consider adaptive sequential prediction of arbitrary binary sequences when the performance is evaluated using a general loss function. The goal is to predict on each individual sequence nearly as well as the best prediction strategy in a given comparison class of (possibly adaptive) prediction st ..."
Abstract

Cited by 75 (7 self)
 Add to MetaCart
We consider adaptive sequential prediction of arbitrary binary sequences when the performance is evaluated using a general loss function. The goal is to predict on each individual sequence nearly as well as the best prediction strategy in a given comparison class of (possibly adaptive) prediction strategies, called experts. By using a general loss function, we generalize previous work on universal prediction, forecasting, and data compression. However, here we restrict ourselves to the case when the comparison class is finite. For a given sequence, we define the regret as the total loss on the entire sequence suffered by the adaptive sequential predictor, minus the total loss suffered by the predictor in the comparison class that performs best on that particular sequence. We show that for a large class of loss functions, the minimax regret is either \Theta(log N) or \Omega\Gamma p ` log N ), depending on the loss function, where N is the number of predictors in the comparison class a...
Mutual Information, Metric Entropy, and Cumulative Relative Entropy Risk
 Annals of Statistics
, 1996
"... Assume fP ` : ` 2 \Thetag is a set of probability distributions with a common dominating measure on a complete separable metric space Y . A state ` 2 \Theta is chosen by Nature. A statistician gets n independent observations Y 1 ; : : : ; Y n from Y distributed according to P ` . For each time ..."
Abstract

Cited by 39 (2 self)
 Add to MetaCart
Assume fP ` : ` 2 \Thetag is a set of probability distributions with a common dominating measure on a complete separable metric space Y . A state ` 2 \Theta is chosen by Nature. A statistician gets n independent observations Y 1 ; : : : ; Y n from Y distributed according to P ` . For each time t between 1 and n, based on the observations Y 1 ; : : : ; Y t\Gamma1 , the statistician produces an estimated distribution P t for P ` , and suffers a loss L(P ` ; P t ). The cumulative risk for the statistician is the average total loss up to time n. Of special interest in information theory, data compression, mathematical finance, computational learning theory and statistical mechanics is the special case when the loss L(P ` ; P t ) is the relative entropy between the true distribution P ` and the estimated distribution P t . Here the cumulative Bayes risk from time 1 to n is the mutual information between the random parameter \Theta and the observations Y 1 ; : : : ;...
A General Minimax Result for Relative Entropy
 IEEE Trans. Inform. Theory
, 1996
"... : Suppose Nature picks a probability measure P ` on a complete separable metric space X at random from a measurable set P \Theta = fP ` : ` 2 \Thetag. Then, without knowing `, a statistician picks a measure Q on X. Finally, the statistician suffers a loss D(P ` jjQ), the relative entropy between P ..."
Abstract

Cited by 35 (2 self)
 Add to MetaCart
: Suppose Nature picks a probability measure P ` on a complete separable metric space X at random from a measurable set P \Theta = fP ` : ` 2 \Thetag. Then, without knowing `, a statistician picks a measure Q on X. Finally, the statistician suffers a loss D(P ` jjQ), the relative entropy between P ` and Q. We show that the minimax and maximin values of this game are always equal, and there is always a minimax strategy in the closure of the set of all Bayes strategies. This generalizes previous results of Gallager, and Davisson and LeonGarcia. Index terms: minimax theorem, minimax redundancy, minimax risk, Bayes risk, relative entropy, KullbackLeibler divergence, density estimation, source coding, channel capacity, computational learning theory 1 Introduction Consider a sequential estimation game in which a statistician is given n independent observations Y 1 ; : : : ; Yn distributed according to an unknown distribution ~ P ` chosen at random by Nature from the set f ~ P ` : ` 2 \...
Convergence and Loss Bounds for Bayesian Sequence Prediction
 In
, 2003
"... The probability of observing $x_t$ at time $t$, given past observations $x_1...x_{t1}$ can be computed with Bayes rule if the true generating distribution $\mu$ of the sequences $x_1x_2x_3...$ is known. If $\mu$ is unknown, but known to belong to a class $M$ one can base ones prediction on the Baye ..."
Abstract

Cited by 22 (21 self)
 Add to MetaCart
The probability of observing $x_t$ at time $t$, given past observations $x_1...x_{t1}$ can be computed with Bayes rule if the true generating distribution $\mu$ of the sequences $x_1x_2x_3...$ is known. If $\mu$ is unknown, but known to belong to a class $M$ one can base ones prediction on the Bayes mix $\xi$ defined as a weighted sum of distributions $ u\in M$. Various convergence results of the mixture posterior $\xi_t$ to the true posterior $\mu_t$ are presented. In particular a new (elementary) derivation of the convergence $\xi_t/\mu_t\to 1$ is provided, which additionally gives the rate of convergence. A general sequence predictor is allowed to choose an action $y_t$ based on $x_1...x_{t1}$ and receives loss $\ell_{x_t y_t}$ if $x_t$ is the next symbol of the sequence. No assumptions are made on the structure of $\ell$ (apart from being bounded) and $M$. The Bayesoptimal prediction scheme $\Lambda_\xi$ based on mixture $\xi$ and the Bayesoptimal informed prediction scheme $\Lambda_\mu$ are defined and the total loss $L_\xi$ of $\Lambda_\xi$ is bounded in terms of the total loss $L_\mu$ of $\Lambda_\mu$. It is shown that $L_\xi$ is bounded for bounded $L_\mu$ and $L_\xi/L_\mu\to 1$ for $L_\mu\to \infty$. Convergence of the instantaneous losses is also proven.
Adaptive Mixtures of Probabilistic Transducers
 Neural Computation
, 1996
"... We describe and analyze a mixture model for supervised learning of probabilistic transducers. We devise an online learning algorithm that efficiently infers the structure and estimates the parameters of each probabilistic transducer in the mixture. Theoretical analysis and comparative simulations i ..."
Abstract

Cited by 19 (3 self)
 Add to MetaCart
We describe and analyze a mixture model for supervised learning of probabilistic transducers. We devise an online learning algorithm that efficiently infers the structure and estimates the parameters of each probabilistic transducer in the mixture. Theoretical analysis and comparative simulations indicate that the learning algorithm tracks the best transducer from an arbitrarily large (possibly infinite) pool of models. We also present an application of the model for inducing a noun phrase recognizer. 1 Introduction Supervised learning of probabilistic mappings between temporal sequences is an important goal of natural data analysis and classification with a broad range of applications, including handwriting and speech recognition, natural language processing and biological sequence analysis. Research efforts in supervised learning of probabilistic mappings have been almost exclusively focused on estimating the parameters of a predefined model. For example, Giles et al. (1992) used a...
Bounds for Predictive Errors in the Statistical Mechanics of Supervised Learning
 Physical Review letters
, 1995
"... Within a Bayesian framework, by generalizing inequalities known from statistical mechanics, we calculate general upper and lower bounds for a cumulative entropic error, which measures the success in the supervised learning of an unknown rule from examples. This performance measure is equivalent to t ..."
Abstract

Cited by 16 (3 self)
 Add to MetaCart
Within a Bayesian framework, by generalizing inequalities known from statistical mechanics, we calculate general upper and lower bounds for a cumulative entropic error, which measures the success in the supervised learning of an unknown rule from examples. This performance measure is equivalent to the mutual information between the data and the parameter that specifies the rule to be learnt. Both bounds match asymptotically, when the number m of observed data grows large. Under mild conditions, we find that the information gain from observing a new example decreases universally like d=m. Here d is a dimension that is defined from the scaling of small volumes with respect to a suitable distance in the space of rules. PACS numbers: 87.10, 05.90 Understanding a neural network's ability to infer an unknown rule from a set of examples has become a fascinating topic in Statistical Mechanics. Using techniques developed in the physics of disordered systems, exact learning curves, which measu...
General bounds on the mutual information between a parameter and n conditionally independent observations
 In Proceedings of the Seventh Annual ACM Workshop on Computational Learning Theory
, 1995
"... Each parameter in an abstract parameter space is associated with a di erent probability distribution on a set Y. A parameter is chosen at random from according to some a priori distribution on, and n conditionally independent random variables Y n = Y1�:::Yn are observed with common distribution dete ..."
Abstract

Cited by 15 (5 self)
 Add to MetaCart
Each parameter in an abstract parameter space is associated with a di erent probability distribution on a set Y. A parameter is chosen at random from according to some a priori distribution on, and n conditionally independent random variables Y n = Y1�:::Yn are observed with common distribution determined by. We obtain bounds on the mutual information between the random variable, giving the choice of parameter, and the random variable Y n, giving the sequence of observations. We also bound the supremum of the mutual information, over choices of the prior distribution on. These quantities have applications in density estimation, computational learning theory, universal coding, hypothesis testing, and portfolio selection theory. The bounds are given in terms of the metric and information dimensions of the parameter space with respect to the Hellinger distance. 1
Improved Lower Bounds for Learning from Noisy Examples: an InformationTheoretic Approach
 Proc eedings of the 11th Annual Conference on Computational Learning Theory
, 1998
"... This paper presents a general informationtheoretic approach for obtaining lower bounds on the number of examples needed to PAC learn in the presence of noise. This approach deals directly with the fundamental information quantities, avoiding a Bayesian analysis. The technique is applied to severa ..."
Abstract

Cited by 11 (0 self)
 Add to MetaCart
This paper presents a general informationtheoretic approach for obtaining lower bounds on the number of examples needed to PAC learn in the presence of noise. This approach deals directly with the fundamental information quantities, avoiding a Bayesian analysis. The technique is applied to several different models, illustrating its generality and power. The resulting bounds add logarithmic factors to (or improve the constants in) previously known lower bounds. 1