## Tracking a Small Set of Experts by Mixing Past Posteriors (2002)

### Cached

### Download Links

Venue: | JOURNAL OF MACHINE LEARNING RESEARCH |

Citations: | 61 - 10 self |

### BibTeX

@ARTICLE{Bousquet02trackinga,

author = {Olivier Bousquet and Manfred K. Warmuth},

title = { Tracking a Small Set of Experts by Mixing Past Posteriors},

journal = {JOURNAL OF MACHINE LEARNING RESEARCH},

year = {2002},

volume = {3},

pages = {363--396}

}

### Years of Citing Articles

### OpenURL

### Abstract

In this paper, we examine on-line learning problems in which the target concept is allowed to change over time. In each trial a master algorithm receives predictions from a large set of n experts. Its goal is to predict almost as well as the best sequence of such experts chosen off-line by partitioning the training sequence into k + 1 sections and then choosing the best expert for each section. We build on methods developed by Herbster and Warmuth and consider an open problem posed by Freund where the experts in the best partition are from a small pool of size m. Since k >> m, the best expert shifts back and forth between the experts of the small pool. We propose algorithms that solve this open problem by mixing the past posteriors maintained by the master algorithm. We relate the number of bits needed for encoding the best partition to the loss bounds of the algorithms. Instead of paying log n for choosing the best expert in each section we first pay log bits in the bounds for identifying the pool of m experts and then log m bits per new section. In the bounds we also pay twice for encoding the boundaries of the sections.

### Citations

11502 |
Computers and Intractability, A Guide to the Theory of NPCompleteness
- Garey, Johnson
- 1979
(Show Context)
Citation Context ...yt for each trial t. Question: Is there a partition of the T trials with k shifts from a pool of m convex combinations that has loss zero? Proof The problem reduces to three-dimensional matching (see =-=Garey and Johnson, 1979-=-, page 221). We have T = 3q trials. Trials 1; 2; : : : ; 3q correspond respectively to the elements w1; w2; : : : ; wq; r1; r2; : : : ; rq; s1; s2; : : : ; sq. Choose the xt;i and yt so that each trip... |

701 | The weighted majority algorithm - Littlestone, Warmuth - 1994 |

323 | How to use expert advice
- Cesa-Bianchi, Freund, et al.
- 1997
(Show Context)
Citation Context ...duction We consider the following standard on-line learning model in which a master algorithm has to combine the predictions from a set of experts (see e.g. Littlestone and Warmuth, 1994, Vovk, 1990, =-=Cesa-Bianchi et al., 1997-=-, Kivinen and Warmuth, 1999). Learning proceeds in trials. In each trial the master receives the predictions from n experts and uses them to form its own prediction. At the end of the trial both the m... |

263 |
Aggregating strategies
- Vovk
- 1990
(Show Context)
Citation Context ...es. 1. Introduction We consider the following standard on-line learning model in which a master algorithm has to combine the predictions from a set of experts (see e.g. Littlestone and Warmuth, 1994, =-=Vovk, 1990-=-, Cesa-Bianchi et al., 1997, Kivinen and Warmuth, 1999). Learning proceeds in trials. In each trial the master receives the predictions from n experts and uses them to form its own prediction. At the ... |

204 | Tracking the best expert
- Herbster, Warmuth
- 1995
(Show Context)
Citation Context ...to the k + 1 sections. Now consider the following \direct" algorithm proposed by Freund. Run the Mixing Algorithm with the Fixed Share to Start Vector mixing scheme (i.e. the Fixed Share Algorithm of =-=Herbster and Warmuth, 1998-=-) on every pool/subset of m out of the n experts. Each run becomes an expert that feeds into the Mixing Algorithm with the Static Experts mixing scheme. If ut is the comparator sequence of the best pa... |

136 |
Additive versus exponentiated gradient updates for linear prediction
- Kivinen, Warmuth
- 1997
(Show Context)
Citation Context ... Exponentiated Gradient (EG) family of updates (see e.g. Kivinen and Warmuth, 1999). 2 Any update in the EG family is derived and analyzed with the relative entropy as a measure of progress (see e.g. =-=Kivinen and Warmuth, 1997-=-). Thus by Lemma 2 the mixing update can be used with any member of this family such as EG with square loss for linear regression (see e.g. Kivinen and Warmuth, 1997) or normalized Winnow (see e.g. He... |

109 | A game of prediction with expert advice - Vovk - 1998 |

76 | Sequential prediction of individual sequences under general loss functions
- Haussler, Kivinen, et al.
- 1998
(Show Context)
Citation Context ...98). There are two types of updates: a Loss Update followed by a Mixing Update. The Loss Update is the standard update used for the expert setting (see e.g. Littlestone and Warmuth, 1994, Vovk, 1990, =-=Haussler et al., 1998-=-, Kivinen and Warmuth, 1999) in which the weights of the experts decay exponentially with the loss. In the case of the log loss this becomes Bayes rule for computing the posterior weights for the expe... |

75 | Tracking the best disjunction - Auer, Warmuth - 1998 |

70 | Adaptive and self-confident on-line learning algorithms
- Auer, Cesa-Bianchi, et al.
- 2002
(Show Context)
Citation Context ...efficients. Another open question is whether the parameters α, γ, and (in the case of absolute loss) η can be tuned on-line using techniques from on-line learning (see e.g. Cesa-Bianchi et al., 1998, =-=Auer et al., 2000-=-) and universal coding (see e.g. Willems, 1996, Shamir and Merhav, 1999). Following Herbster and Warmuth (1998), slightly improved upper bounds should be obtainable for the Variable Share modification... |

60 | Averaging expert predictions
- Kivinen, Warmuth
- 1999
(Show Context)
Citation Context ...lowing standard on-line learning model in which a master algorithm has to combine the predictions from a set of experts (see e.g. Littlestone and Warmuth, 1994, Vovk, 1990, Cesa-Bianchi et al., 1997, =-=Kivinen and Warmuth, 1999-=-). Learning proceeds in trials. In each trial the master receives the predictions from n experts and uses them to form its own prediction. At the end of the trial both the master and the experts recei... |

56 | Tracking the best linear predictor - Herbster, Warmuth - 2001 |

45 | Derandomizing stochastic prediction strategies - Vovk - 1999 |

33 | Switching portfolios - Singer - 1997 |

28 | Coding for a binary independent piecewise-identically-distributed source - Willems - 1996 |

24 | Low-complexity sequential lossless coding for piecewise-stationary memoryless sources
- Shamir, Merhav
- 1999
(Show Context)
Citation Context ... and (in the case of absolute loss) can be tuned on-line using techniques from on-line learning (see e.g. Cesa-Bianchi et al., 1998, Auer et al., 2000) and universal coding (see e.g. Willems, 1996, =-=Shamir and Merhav, 1999-=-). Following Herbster and Warmuth (1998), slightly improved upper bounds should be obtainable for the Variable Share modi� cation of the updates when the losses of the experts lie in [0; 1]. So far we... |

17 | Tracking the best regressor - Herbster, Warmuth - 1998 |

14 | private communication - Freund |

8 | The binary exponentiated gradient algorithm for learning linear functions
- Bylander
- 1997
(Show Context)
Citation Context ...other families of updates such as EGU family (analyzed with the unnormalized entropy) and the BEG family (analyzed with the componentwise sum of binary entropies) (see e.g. Kivinen and Warmuth, 1997, =-=Bylander, 1997-=-). Acknowledgments This research was done while the � rst author was visiting UC Santa Cruz. The authors are grateful to Yoav Freund for posting an open problem which inspired this research. We also t... |

6 | Direct and indirect algorithms for online learning of disjunctions,” Theor
- Warmuth, Helmbold, et al.
- 2002
(Show Context)
Citation Context ...97). Thus by Lemma 2 the mixing update can be used with any member of this family such as EG with square loss for linear regression (see e.g. Kivinen and Warmuth, 1997) or normalized Winnow (see e.g. =-=Helmbold et al., 1999-=-). A next goal would be to adapt the mixing updates to other families of updates such as EGU family (analyzed with the unnormalized entropy) and the BEG family (analyzed with the componentwise sum of ... |

5 | Adaptive and self-con on-line learning algorithms - Auer, Cesa-Bianchi, et al. - 2002 |

2 |
Adaptive and self-con dent on-line learning algorithms
- Auer, Gentile
(Show Context)
Citation Context ...¯ cients. Another open question is whether the parameters ¬ , ® , and (in the case of absolute loss) can be tuned on-line using techniques from on-line learning (see e.g. Cesa-Bianchi et al., 1998, =-=Auer et al., 2000-=-) and universal coding (see e.g. Willems, 1996, Shamir and Merhav, 1999). Following Herbster and Warmuth (1998), slightly improved upper bounds should be obtainable for the Variable Share modi� cation... |

2 |
On Bayes methods for on-line boolean prediction Algorithmica
- Cesa-Bianchi, Helmbold, et al.
- 1998
(Show Context)
Citation Context ... is to modify the decay coe¯ cients. Another open question is whether the parameters ¬ , ® , and (in the case of absolute loss) can be tuned on-line using techniques from on-line learning (see e.g. =-=Cesa-Bianchi et al., 1998-=-, Auer et al., 2000) and universal coding (see e.g. Willems, 1996, Shamir and Merhav, 1999). Following Herbster and Warmuth (1998), slightly improved upper bounds should be obtainable for the Variable... |

1 | Private Communication - Herbster - 1998 |

1 | Tracking a Small Set of Experts - Auer, Cesa-Bianchi, et al. - 2000 |