Results 1 -
9 of
9
How to Use Expert Advice
- JOURNAL OF THE ASSOCIATION FOR COMPUTING MACHINERY
, 1997
"... We analyze algorithms that predict a binary value by combining the predictions of several prediction strategies, called experts. Our analysis is for worst-case situations, i.e., we make no assumptions about the way the sequence of bits to be predicted is generated. We measure the performance of the ..."
Abstract
-
Cited by 267 (60 self)
- Add to MetaCart
We analyze algorithms that predict a binary value by combining the predictions of several prediction strategies, called experts. Our analysis is for worst-case situations, i.e., we make no assumptions about the way the sequence of bits to be predicted is generated. We measure the performance of the algorithm by the difference between the expected number of mistakes it makes on the bit sequence and the expected number of mistakes made by the best expert on this sequence, where the expectation is taken with respect to the randomization in the predictions. We show that the minimum achievable difference is on the order of the square root of the number of mistakes of the best expert, and we give efficient algorithms that achieve this. Our upper and lower bounds have matching leading constants in most cases. We then show howthis leads to certain kinds of pattern recognition/learning algorithms with performance bounds that improve on the best results currently known in this context. We also compare our analysis to the case in which log loss is used instead of the expected number of mistakes.
Universal prediction of individual sequences
- IEEE Transactions on Information Theory
, 1992
"... Abstruct-The problem of predicting the next outcome of an individual binary sequence using finite memory, is considered. The finite-state predictability of an infinite sequence is defined as the minimum fraction of prediction errors that can be made by any finite-state (FS) predictor. It is proved t ..."
Abstract
-
Cited by 129 (7 self)
- Add to MetaCart
Abstruct-The problem of predicting the next outcome of an individual binary sequence using finite memory, is considered. The finite-state predictability of an infinite sequence is defined as the minimum fraction of prediction errors that can be made by any finite-state (FS) predictor. It is proved that this FS pre-dictability can be attained by universal sequential prediction schemes. Specifically, an efficient prediction procedure based on the incremental parsing procedure of the Lempel-Ziv data com-pression algorithm is shown to achieve asymptotically the FS predictability. Finally, some relations between compressibility and predictability are pointed out, and the predictability is proposed as an additional measure of the complexity of a sequence. Index Terms-Predictability, compressibility, complexity, fi-nite-state machines, Lempel- Ziv algorithm.
Using and Combining Predictors That Specialize
, 1997
"... . We study online learning algorithms that predict by combining the predictions of several subordinate prediction algorithms, sometimes called "experts." These simple algorithms belong to the multiplicative weights family of algorithms. The performance of these algorithms degrades only logarithmical ..."
Abstract
-
Cited by 76 (11 self)
- Add to MetaCart
. We study online learning algorithms that predict by combining the predictions of several subordinate prediction algorithms, sometimes called "experts." These simple algorithms belong to the multiplicative weights family of algorithms. The performance of these algorithms degrades only logarithmically with the number of experts, making them particularly useful in applications where the number of experts is very large. However, in applications such as text categorization, it is often natural for some of the experts to abstain from making predictions on some of the instances. We show how to transform algorithms that assume that all experts are always awake to algorithms that do not require this assumption. We also show how to derive corresponding loss bounds. Our method is very general, and can be applied to a large family of online learning algorithms. We also give applications to various prediction models including decision graphs and "switching" experts. 1 Introduction We study onlin...
Agnostic Online Learning
"... We study learnability of hypotheses classes in agnostic online prediction models. The analogous question in the PAC learning model [Valiant, 1984] was addressed by Haussler [1992] and others, who showed that the VC dimension characterization of the sample complexity of learnability extends to the ag ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
We study learnability of hypotheses classes in agnostic online prediction models. The analogous question in the PAC learning model [Valiant, 1984] was addressed by Haussler [1992] and others, who showed that the VC dimension characterization of the sample complexity of learnability extends to the agnostic (or ”unrealizable”) setting. In his influential work, Littlestone [1988] described a combinatorial characterization of hypothesis classes that are learnable in the online model. We extend Littlestone’s results in two aspects. First, while Littlestone only dealt with the realizable case, namely, assuming there exists a hypothesis in the class that perfectly explains the entire data, we derive results for the non-realizable (agnostic) case as well. In particular, we describe several models of non-realizable data and derive upper and lower bounds on the achievable regret. Second, we extend the theory to include margin-based hypothesis classes, in which the prediction of each hypothesis is accompanied by a confidence value. We demonstrate how the newly developed theory seamlessly yields novel online regret bounds for the important class of large margin linear separators. 1
Sequential prediction and ranking in universal context modeling and data compression
- IEEE Trans. Inform. Theory
, 1997
"... Prediction is one of the oldest and most successful tools in the data compression practitioner's toolbox. It is particularly useful in situations where the data (e.g., a digital image) originates from a natural physical process (e.g., sensed light), and the data samples (e.g., real numbers) represen ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
Prediction is one of the oldest and most successful tools in the data compression practitioner's toolbox. It is particularly useful in situations where the data (e.g., a digital image) originates from a natural physical process (e.g., sensed light), and the data samples (e.g., real numbers) represent a continuously varying physical magnitude (e.g., brightness). In these cases, the value of the next sample can often be accurately
Universal filtering via prediction
- IEEE TRANS. INFORM. THEORY
, 2007
"... We consider the filtering problem, where a finite-alphabet individual sequence is corrupted by a discrete memoryless channel, and the goal is to causally estimate each sequence component based on the past and present noisy observations. We establish a correspondence between the filtering problem and ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
We consider the filtering problem, where a finite-alphabet individual sequence is corrupted by a discrete memoryless channel, and the goal is to causally estimate each sequence component based on the past and present noisy observations. We establish a correspondence between the filtering problem and the problem of prediction of individual sequences which leads to the following result: Given an arbitrary finite set of filters, there exists a filter which performs, with high probability, essentially as well as the best in the set, regardless of the underlying noiseless individual sequence. We use this relationship between the problems to derive a filter guaranteed of attaining the “finite-state filterability” of any individual sequence by leveraging results from the prediction problem.
Theory and Applications of Predictors That Specialize
"... . We study online learning algorithms that predict by combining the predictions of several subordinate prediction algorithms, sometimes called "experts." These simple algorithms belong to the multiplicative weights family of algorithms. The performance of these algorithms degrades only logarithmical ..."
Abstract
- Add to MetaCart
. We study online learning algorithms that predict by combining the predictions of several subordinate prediction algorithms, sometimes called "experts." These simple algorithms belong to the multiplicative weights family of algorithms. The performance of these algorithms degrades only logarithmically with the number of experts, making them particularly useful in applications where the number of experts is very large. However, in applications such as text categorization, it is often natural for some of the experts to abstain from making predictions on some of the instances. We show how to transform algorithms that assume that all experts are always awake to algorithms that do not require this assumption. We also show how to derive corresponding loss bounds. Our method is very general, and can be applied to a large family of online learning algorithms. We also give applications to various prediction models including decision graphs and "switching" experts. 1 Introduction We study onlin...
Toyota Technological Institute—Chicago
, 2008
"... We study a fundamental question. What classes of hypotheses are learnable in the online learning model? The analogous question in the PAC learning model [12] was addressed by Vapnik and others [13], who showed that the VC dimension characterizes the learnability of a hypothesis class. In his influen ..."
Abstract
- Add to MetaCart
We study a fundamental question. What classes of hypotheses are learnable in the online learning model? The analogous question in the PAC learning model [12] was addressed by Vapnik and others [13], who showed that the VC dimension characterizes the learnability of a hypothesis class. In his influential work, Littlestone [9] studied the online learnability of hypothesis classes, but only in the realizable case, namely, assuming that there exists a hypothesis in the class that perfectly explains the entire data. In this paper we study the online learnability in the agnostic case, namely, no hypothesis perfectly predicts the entire data, and our goal is to minimize regret. We first present an impossibility result, discovered by Cover in the context of universal prediction of individual sequences, which implies that even a class whose Littlestone’s dimension is only 1, is not learnable in the agnostic online learning model. We then overcome the impossibility result by allowing randomized predictions, and show that in this case Littlestone’s dimension does capture the learnability of hypotheses classes in the agnostic online learning model. 1
Individual Sequence Prediction using Memory-efficient Context Trees 1
"... Abstract — Context trees are a popular and effective tool for tasks such as compression, sequential prediction, and language modeling. We present an algebraic perspective of context trees for the task of individual sequence prediction. Our approach stems from a generalization of the notion of margin ..."
Abstract
- Add to MetaCart
Abstract — Context trees are a popular and effective tool for tasks such as compression, sequential prediction, and language modeling. We present an algebraic perspective of context trees for the task of individual sequence prediction. Our approach stems from a generalization of the notion of margin used for linear predictors. By exporting the concept of margin to context trees, we are able to cast the individual sequence prediction problem as the task of finding a linear separator in a Hilbert space, and to apply techniques from machine learning and online optimization to this problem. Our main contribution is a memory efficient adaptation of the Perceptron algorithm for individual sequence prediction. We name our algorithm the Shallow Perceptron and prove a shifting mistake bound, which relates its performance with the performance of any sequence of context trees. We also prove that the Shallow Perceptron grows a context tree at a rate that is upperbounded by its mistake-rate, which imposes an upperbound on the size of the trees grown by our algorithm. Index Terms — context trees, online learning, Perceptron, shifting bounds I.

