Results 1  10
of
13
Relative Loss Bounds for Online Density Estimation with the Exponential Family of Distributions
 MACHINE LEARNING
, 2000
"... We consider online density estimation with a parameterized density from the exponential family. The online algorithm receives one example at a time and maintains a parameter that is essentially an average of the past examples. After receiving an example the algorithm incurs a loss, which is the n ..."
Abstract

Cited by 116 (11 self)
 Add to MetaCart
We consider online density estimation with a parameterized density from the exponential family. The online algorithm receives one example at a time and maintains a parameter that is essentially an average of the past examples. After receiving an example the algorithm incurs a loss, which is the negative loglikelihood of the example with respect to the past parameter of the algorithm. An oline algorithm can choose the best parameter based on all the examples. We prove bounds on the additional total loss of the online algorithm over the total loss of the best oline parameter. These relative loss bounds hold for an arbitrary sequence of examples. The goal is to design algorithms with the best possible relative loss bounds. We use a Bregman divergence to derive and analyze each algorithm. These divergences are relative entropies between two exponential distributions. We also use our methods to prove relative loss bounds for linear regression.
Competitive online statistics
 International Statistical Review
, 1999
"... A radically new approach to statistical modelling, which combines mathematical techniques of Bayesian statistics with the philosophy of the theory of competitive online algorithms, has arisen over the last decade in computer science (to a large degree, under the influence of Dawid’s prequential sta ..."
Abstract

Cited by 63 (10 self)
 Add to MetaCart
A radically new approach to statistical modelling, which combines mathematical techniques of Bayesian statistics with the philosophy of the theory of competitive online algorithms, has arisen over the last decade in computer science (to a large degree, under the influence of Dawid’s prequential statistics). In this approach, which we call “competitive online statistics”, it is not assumed that data are generated by some stochastic mechanism; the bounds derived for the performance of competitive online statistical procedures are guaranteed to hold (and not just hold with high probability or on the average). This paper reviews some results in this area; the new material in it includes the proofs for the performance of the Aggregating Algorithm in the problem of linear regression with square loss. Keywords: Bayes’s rule, competitive online algorithms, linear regression, prequential statistics, worstcase analysis.
Adaptive and SelfConfident OnLine Learning Algorithms
, 2000
"... We study online learning in the linear regression framework. Most of the performance bounds for online algorithms in this framework assume a constant learning rate. To achieve these bounds the learning rate must be optimized based on a posteriori information. This information depends on the wh ..."
Abstract

Cited by 62 (7 self)
 Add to MetaCart
We study online learning in the linear regression framework. Most of the performance bounds for online algorithms in this framework assume a constant learning rate. To achieve these bounds the learning rate must be optimized based on a posteriori information. This information depends on the whole sequence of examples and thus it is not available to any strictly online algorithm. We introduce new techniques for adaptively tuning the learning rate as the data sequence is progressively revealed. Our techniques allow us to prove essentially the same bounds as if we knew the optimal learning rate in advance. Moreover, such techniques apply to a wide class of online algorithms, including pnorm algorithms for generalized linear regression and Weighted Majority for linear regression with absolute loss. Our adaptive tunings are radically dierent from previous techniques, such as the socalled doubling trick. Whereas the doubling trick restarts the online algorithm several ti...
Tracking a Small Set of Experts by Mixing Past Posteriors
 JOURNAL OF MACHINE LEARNING RESEARCH
, 2002
"... In this paper, we examine online learning problems in which the target concept is allowed to change over time. In each trial a master algorithm receives predictions from a large set of n experts. Its goal is to predict almost as well as the best sequence of such experts chosen offline by partit ..."
Abstract

Cited by 59 (9 self)
 Add to MetaCart
In this paper, we examine online learning problems in which the target concept is allowed to change over time. In each trial a master algorithm receives predictions from a large set of n experts. Its goal is to predict almost as well as the best sequence of such experts chosen offline by partitioning the training sequence into k + 1 sections and then choosing the best expert for each section. We build on methods developed by Herbster and Warmuth and consider an open problem posed by Freund where the experts in the best partition are from a small pool of size m. Since k >> m, the best expert shifts back and forth between the experts of the small pool. We propose algorithms that solve this open problem by mixing the past posteriors maintained by the master algorithm. We relate the number of bits needed for encoding the best partition to the loss bounds of the algorithms. Instead of paying log n for choosing the best expert in each section we first pay log bits in the bounds for identifying the pool of m experts and then log m bits per new section. In the bounds we also pay twice for encoding the boundaries of the sections.
Boosting as Entropy Projection
, 1999
"... We consider the AdaBoost procedure for boosting weak learners. In AdaBoost, a key step is choosing a new distribution on the training examples based on the old distribution and the mistakes made by the present weak hypothesis. We show how AdaBoost 's choice of the new distribution can be seen ..."
Abstract

Cited by 58 (8 self)
 Add to MetaCart
We consider the AdaBoost procedure for boosting weak learners. In AdaBoost, a key step is choosing a new distribution on the training examples based on the old distribution and the mistakes made by the present weak hypothesis. We show how AdaBoost 's choice of the new distribution can be seen as an approximate solution to the following problem: Find a new distribution that is closest to the old distribution subject to the constraint that the new distribution is orthogonal to the vector of mistakes of the current weak hypothesis. The distance (or divergence) between distributions is measured by the relative entropy. Alternatively, we could say that AdaBoost approximately projects the distribution vector onto a hyperplane dened by the mistake vector. We show that this new view of AdaBoost as an entropy projection is dual to the usual view of AdaBoost as minimizing the normalization factors of the updated distributions.
On Bregman Voronoi Diagrams
 in "Proc. 18th ACMSIAM Sympos. Discrete Algorithms
, 2007
"... The Voronoi diagram of a point set is a fundamental geometric structure that partitions the space into elementary regions of influence defining a discrete proximity graph and dually a wellshaped Delaunay triangulation. In this paper, we investigate a framework for defining and building the Voronoi ..."
Abstract

Cited by 42 (22 self)
 Add to MetaCart
The Voronoi diagram of a point set is a fundamental geometric structure that partitions the space into elementary regions of influence defining a discrete proximity graph and dually a wellshaped Delaunay triangulation. In this paper, we investigate a framework for defining and building the Voronoi diagrams for a broad class of distortion measures called Bregman divergences, that includes not only the traditional (squared) Euclidean distance, but also various divergence measures based on entropic functions. As a byproduct, Bregman Voronoi diagrams allow one to define informationtheoretic Voronoi diagrams in statistical parametric spaces based on the relative entropy of distributions. We show that for a given Bregman divergence, one can define several types of Voronoi diagrams related to each other
Scaling Up ContextSensitive Text Correction
, 2001
"... The main challenge in an effort to build a realistic system with contextsensitive inference capabilities, beyond accuracy, is scalability. This paper studies this problem in the context of a learningbased approach to context sensitive text correction  the task of fixing spelling errors that resu ..."
Abstract

Cited by 31 (9 self)
 Add to MetaCart
The main challenge in an effort to build a realistic system with contextsensitive inference capabilities, beyond accuracy, is scalability. This paper studies this problem in the context of a learningbased approach to context sensitive text correction  the task of fixing spelling errors that result in valid words, such as substituting to for too, casual for causal, and so on. Research papers on this problem have developed algorithms that can achieve fairly high accuracy, in many cases over 90%. However, this level of performance is not sufficient for a large coverage practical system since it implies a low sentence level performance. We examine and offer solutions to several issues relating to scaling up a context sensitive text correction system. In particular, we suggest methods to reduce the memory requirements while maintaining a high level of performance and show that this can still allow the system to adapt to new domains. Most important, we show how to significantly increase the coverage of the system to realistic levels, while providing a very high level of performance, at the 99% level.
Learning Concept Drift with a Committee of Decision Trees
, 2001
"... Concept drift occurs when a target concept changes over time. I present a new method for learning shifting target concepts during concept drift. The method, called Concept Drift Committee (CDC), uses a weighted committee of hypotheses that votes on the current classification. When a committee member ..."
Abstract

Cited by 21 (0 self)
 Add to MetaCart
Concept drift occurs when a target concept changes over time. I present a new method for learning shifting target concepts during concept drift. The method, called Concept Drift Committee (CDC), uses a weighted committee of hypotheses that votes on the current classification. When a committee member's voting record drops below a minimal threshold, the member is forced to retire. A new committee member then takes the open place on the committee. The algorithm is compared to a leading algorithm on a number of concept drift problems. The results show that using a committee to track drift has several advantages over more customary windowbased approaches.
Universal switching linear least squares prediction
 in Proc. of the 2006 Information Theory and its Applications Workshop. La Jolla, CA: UCSD
, 2006
"... In this paper we consider sequential regression of individual sequences under the squareerror loss. We focus on the class of switching linear predictors that can segment a given individual sequence into an arbitrary number of blocks within each of which a fixed linear regressor is applied. Using a ..."
Abstract

Cited by 13 (4 self)
 Add to MetaCart
In this paper we consider sequential regression of individual sequences under the squareerror loss. We focus on the class of switching linear predictors that can segment a given individual sequence into an arbitrary number of blocks within each of which a fixed linear regressor is applied. Using a competitive algorithm framework, we construct sequential algorithms that are competitive with the best linear regression algorithms for any segmenting of the data as well as the best partitioning of the data into any fixed number of segments, where both the segmenting of the data and the linear predictors within each segment can be tuned to the underlying individual sequence. The algorithms do not require knowledge of the data length or the number of piecewise linear segments used by the members of the competing class, yet can achieve the performance of the best member that can choose both the partitioning of the sequence as well as the best regressor within each segment. We use a transition diagram [1] to compete with an exponential number of algorithms in the class, using complexity that is linear in the data length. The regret with respect to the best member is O(ln(n)) per transition for not knowing the best transition times and O(ln(n)) for not knowing the best regressor within each segment, where n is the data length. We construct lower bounds on the performance of any sequential algorithm, demonstrating a form of minmax optimality under certain settings. We also consider the case where the members are restricted to choose the best algorithm in each segment from a finite collection of candidate algorithms. Performance on synthetic and real data are given along with a Matlab implementation of the universal switching linear predictor.
Relative Loss Bounds for TemporalDifference Learning
 Machine Learning
, 2000
"... . Foster and Vovk proved relative loss bounds for linear regression where the total loss of the online algorithm minus the total loss of the best linear predictor (chosen in hindsight) grows logarithmically with the number of trials. We give similar bounds for temporaldifference learning. Learning ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
. Foster and Vovk proved relative loss bounds for linear regression where the total loss of the online algorithm minus the total loss of the best linear predictor (chosen in hindsight) grows logarithmically with the number of trials. We give similar bounds for temporaldifference learning. Learning takes place in a sequence of trials where the learner tries to predict discounted sums of future reinforcement signals. The quality of the prediction is measured with the square loss and we bound the total loss of the online algorithm minus the total loss of the best linear predictor for the whole sequence of trials. Again the difference of the losses grows logarithmic with the number of trials. The bounds hold for an arbitrary (worstcase) sequence of examples. We also give a bound on the expected difference for the case when the instances are chosen from an unknown distribution. For linear regression a corresponding lower bound shows that this expected bound cannot be improved substantia...