## Tracking the best expert (1995)

### Cached

### Download Links

- [www.cse.ucsc.edu]
- [www.soe.ucsc.edu]
- [www.cse.ucsc.edu]
- [www.cse.ucsc.edu]
- [ftp.cse.ucsc.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | In Proceedings of the 12th International Conference on Machine Learning |

Citations: | 194 - 17 self |

### BibTeX

@INPROCEEDINGS{Herbster95trackingthe,

author = {Mark Herbster and Manfred and K. Warmuth and Gerhard Widmer and Miroslav Kubat},

title = {Tracking the best expert},

booktitle = {In Proceedings of the 12th International Conference on Machine Learning},

year = {1995},

pages = {286--294},

publisher = {Morgan Kaufmann}

}

### Years of Citing Articles

### OpenURL

### Abstract

Abstract. We generalize the recent relative loss bounds for on-line algorithms where the additional loss of the algorithm on the whole sequence of examples over the loss of the best expert is bounded. The generalization allows the sequence to be partitioned into segments, and the goal is to bound the additional loss of the algorithm over the sum of the losses of the best experts for each segment. This is to model situations in which the examples change and different experts are best for certain segments of the sequence of examples. In the single segment case, the additional loss is proportional to log n, where n is the number of experts and the constant of proportionality depends on the loss function. Our algorithms do not produce the best partition; however the loss bound shows that our predictions are close to those of the best partition. When the number of segments is k +1and the sequence is of length ℓ, we can bound the additional loss of our algorithm over the best partition by O(k log n + k log(ℓ/k)). For the case when the loss per trial is bounded by one, we obtain an algorithm whose additional loss over the loss of the best partition is independent of the length of the sequence. The additional loss becomes O(k log n + k log(L/k)), where L is the loss of the best partition with k +1segments. Our algorithms for tracking the predictions of the best expert are simple adaptations of Vovk’s original algorithm for the single best expert case. As in the original algorithms, we keep one weight per expert, and spend O(1) time per weight in each trial.

### Citations

8563 |
Elements of Information Theory
- Cover, Thomas
- 1991
(Show Context)
Citation Context .... If α ∗ is interpreted as the probability that a shift occurs on any of the ℓ − 1 trials, then the term (ℓ − 1) [H(α ∗ )+D(α ∗ ‖α)] corresponds to the expected optimal code length (see Chapter 5 of (=-=Cover & Thomas, 1991-=-)) if we code the shifts with the estimate α instead of the true probability α ∗ . This bound is thus an example of the close similarity between prediction and coding as brought out by many papers (e.... |

2308 | A desicion-theoretic generalization of online learning and an application to boosting. EuroCOLT
- Freund, Schapire
- 1995
(Show Context)
Citation Context ...ameters, and thus it must be tuned. In practice, the tuning of j may be produced by numerical minimization of the upper bounds. However, we use a tuning of j produced by Freund and Schapire (Freund & =-=Schapire, 1997-=-). Theorem 4 (Lemma 4 (Freund & Schapire, 1997)) Suppose 0sPsP and 0 ! QsQ. Let j = g(sP=sQ), where g(z) = ln(1 + p 2=z); then Pj +Q 1 \Gamma e \GammajsP + q 2sPsQ+ Q: 20 M. HERBSTER AND M. K. WARMUTH... |

669 | The weighted majority algorithm
- LITTLESTONE, WARMUTH
- 1994
(Show Context)
Citation Context ...ght per expert, representing the belief in the expert's prediction, and then decreases the weight as a function of the loss of the expert. Previous work of Vovk (Vovk, 1998) and others (Littlestone & =-=Warmuth, 1994-=-; Haussler, Kivinen & Warmuth, 1998) has produced an algorithm for which there is an upper bound on the additional loss of the algorithm over the loss of the best expert. Algorithms that compare again... |

314 | How to use expert advice
- Cesa-Bianchi, Freund, et al.
- 1997
(Show Context)
Citation Context ...ln W t+1 W t : Hence, since W 1 = 1, L(S; A) = ` X t=1 L(y t ;sy t )s\Gammac ln W `+1s\Gammac ln w `+1;i : So far we have used the same basic technique as in (Littlestone & Warmuth, 1994; Vovk, 1995; =-=Cesa-Bianchi et al., 1997-=-; Haussler et al., 1998), i.e., c ln W t becomes the potential function in an amortized analysis. In the static expert case (when j = 1=c) the final weights have the form w s `+1;i = e \GammaL(S;E i )... |

247 | Exponentiated gradient versus gradient descent for linear predictors
- KIVINEN, WARMUTH
- 1997
(Show Context)
Citation Context ...xamples are algorithms for learning k-literal disjunction over n variables [Lit88, Lit89] and algorithms whose additional loss bound over the loss of the best linear combination of experts is bounded =-=[KW95]-=-. We hope that some of our methods will be useful for tracking the best disjunction or the best linear combination of experts. 2 PRELIMINARIES Let ` denote the number of trials and n denote the number... |

246 |
Aggregating strategies
- VOVK
- 1990
(Show Context)
Citation Context ...r problem domain. It simply keeps one weight per expert representing the belief in the expert's prediction and then decreases the weight as a function of the loss of the expert. Previous work of Vovk =-=[Vov90]-=- and others [HKW94] has produced an algorithm with an upper bound on the additional loss of the algorithm over the best expert of the form c L ln n for a large class of loss functions, where c L is a ... |

155 | Universal prediction of individual sequences - Feder, Merhav, et al. - 1992 |

155 | A dynamic disk spin-down technique for mobile computing, in
- Helmbold, Long, et al.
- 1996
(Show Context)
Citation Context ... functions for which the original Static-expert Algorithm of Vovk was developed (Vovk, 1998; Haussler et al., 1998). Our share updates have been applied experimentally for predicting disk idle times (=-=Helmbold et al., 1996-=-) and for the on-line management of investment portfolios (Singer, 1997). In addition, a reduction has been shown between expert and metrical task systems algorithms (Blum & Burch, 1997). The Share Up... |

134 |
Additive versus exponentiated gradient updates for linear prediction
- Kivinen, Warmuth
- 1997
(Show Context)
Citation Context ...absolute loss and the discrete loss 1 (counting prediction mistakes), which are treated as special cases (Littlestone & Warmuth, 1994; Vovk, 1995; Cesa-Bianchi, Freund, Haussler, Helmbold, Schapire & =-=Warmuth, 1997-=-). For example, if the loss function is the square or relative entropy loss, then c = 1 2 or c = 1, respectively (see Section 2 for definitions of the loss functions). In the paper we consider a modif... |

108 |
Mistake bounds and logarithmic linear-threshold learning algorithms
- Littlestone
- 1989
(Show Context)
Citation Context ...), which only use the Loss Update, and are the basis of this work, a number of such algorithms have been developed. Examples are algorithms for learning linear threshold functions (Littlestone, 1988; =-=Littlestone, 1989-=-), and algorithms whose additional loss bound over the loss of the best linear combination of experts or sigmoided linear combination of experts is bounded (Kivinen & Warmuth, 1997; Helmbold, Kivinen ... |

107 |
Learning when irrelevant attributes abound: a new linear-threshold algorithm
- Littlestone
- 1988
(Show Context)
Citation Context ...one & Warmuth, 1994), which only use the Loss Update, and are the basis of this work, a number of such algorithms have been developed. Examples are algorithms for learning linear threshold functions (=-=Littlestone, 1988-=-; Littlestone, 1989), and algorithms whose additional loss bound over the loss of the best linear combination of experts or sigmoided linear combination of experts is bounded (Kivinen & Warmuth, 1997;... |

103 | A game of prediction with expert advice
- Vovk
- 1995
(Show Context)
Citation Context ...roblem domain. It simply keeps one weight per expert, representing the belief in the expert's prediction, and then decreases the weight as a function of the loss of the expert. Previous work of Vovk (=-=Vovk, 1998-=-) and others (Littlestone & Warmuth, 1994; Haussler, Kivinen & Warmuth, 1998) has produced an algorithm for which there is an upper bound on the additional loss of the algorithm over the loss of the b... |

92 | Using and combining predictors that specialize - Freund, Schapire, et al. - 1997 |

74 | Sequential prediction of individual sequences under general loss functions
- Haussler, Kivinen, et al.
- 1998
(Show Context)
Citation Context ...the loss is the discrete loss (i.e., counting mistakes). In contrast our methods work for the same general class of continuous loss functions that the Static-expert algorithms can handle (Vovk, 1998; =-=Haussler et al., 1998-=-). This class includes all common loss functions such as the square loss, the relative entropy loss, and the hellinger loss. For this class there are tight bounds on the additional loss (Haussler et a... |

72 | Tracking the best disjunction
- Auer, Warmuth
- 1998
(Show Context)
Citation Context ...lief in the expert's prediction, and then decreases the weight as a function of the loss of the expert. Previous work of Vovk (Vovk, 1998) and others (Littlestone & Warmuth, 1994; Haussler, Kivinen & =-=Warmuth, 1998-=-) has produced an algorithm for which there is an upper bound on the additional loss of the algorithm over the loss of the best expert. Algorithms that compare against the loss of the best expert are ... |

51 | Tight worst-case loss bounds for predicting with expert advice
- Haussler, Kivinen, et al.
- 1995
(Show Context)
Citation Context ...t simply keeps one weight per expert representing the belief in the expert's prediction and then decreases the weight as a function of the loss of the expert. Previous work of Vovk [Vov90] and others =-=[HKW94]-=- has produced an algorithm with an upper bound on the additional loss of the algorithm over the best expert of the form c L ln n for a large class of loss functions, where c L is a constant which only... |

45 |
Derandomizing stochastic prediction strategies
- Vovk
- 1999
(Show Context)
Citation Context ...Blum & Burch, 1997). The Share Update has been used successfully in the new domain of metrical task systems. A natural probabilistic interpretation of the Share algorithms has recently been given in (=-=Vovk, 1997-=-). In any particular application of the Share algorithms, it is necessary to consider how to choose the parameter ff. Theoretical techniques exist for the Fixed-share Algorithm for eliminating the nee... |

34 |
Learning Probabilistic Prediction Functions
- DeSantis, Markowski, et al.
- 1992
(Show Context)
Citation Context ...f0; 1g and the relative entropy loss simplifies to the standard log loss, it is easily seen thatsy t = 1 W t P n i=1 w t;i x t;i can be used to witness that the log loss function is (1; 1)-realizable =-=[DMW88]-=-. Otherwise we Initialization: Set the weights to initial values w s 1;1 = : : : = w s 1;n = 1=n. Parameters: 0 ! c; j and 0sffs1. Prediction: Let v t;i = w s t;i =W t , where W t = P n i=1 w s t;i . ... |

15 | Worst-case loss bounds for sigmoided linear neurons - Helmbold, Kivinen, et al. - 1995 |

2 |
Tracking the best expert II. Unpublished Manuscript
- Herbster
- 1997
(Show Context)
Citation Context ...ugh the bounds produced this way are not always optimal. Another method incorporates a prior distribution on all possible values of ff. For the sake of simplicity we have not discussed these methods (=-=Herbster, 1997-=-; Vovk, 1997; Singer, 1997) in this paper. 28 M. HERBSTER AND M. K. WARMUTH Acknowledgments We would like to thank Peter Auer, Phillip Long, Robert Schapire, and Volodya Vovk for valuable discussions.... |

1 |
On-line learning and the metrical task system
- Blum
- 1997
(Show Context)
Citation Context ...idle times (Helmbold et al., 1996) and for the on-line management of investment portfolios (Singer, 1997). In addition, a reduction has been shown between expert and metrical task systems algorithms (=-=Blum & Burch, 1997-=-). The Share Update has been used successfully in the new domain of metrical task systems. A natural probabilistic interpretation of the Share algorithms has recently been given in (Vovk, 1997). In an... |

1 |
Towards realistic and competitive portfolio selection algorithms
- Singer
- 1997
(Show Context)
Citation Context ...(Vovk, 1998; Haussler et al., 1998). Our share updates have been applied experimentally for predicting disk idle times (Helmbold et al., 1996) and for the on-line management of investment portfolios (=-=Singer, 1997-=-). In addition, a reduction has been shown between expert and metrical task systems algorithms (Blum & Burch, 1997). The Share Update has been used successfully in the new domain of metrical task syst... |

1 |
Predicting with the dot-product in the experts framework. Unpublished Manuscript
- Warmuth
- 1997
(Show Context)
Citation Context ... absolute loss and the discrete loss1 (counting prediction mistakes), which are treated as special cases (Littlestone & Warmuth, 1994; Vovk, 1995; Cesa-Bianchi, Freund, Haussler, Helmbold, Schapire & =-=Warmuth, 1997-=-). For example, if the loss function is the square or relative entropy loss, then c = 1 2 or c =1, respectively (see Section 2 for definitions of the loss functions). In the paper we consider a modifi... |