Results 1  10
of
45
Relative Loss Bounds for Online Density Estimation with the Exponential Family of Distributions
 MACHINE LEARNING
, 2000
"... We consider online density estimation with a parameterized density from the exponential family. The online algorithm receives one example at a time and maintains a parameter that is essentially an average of the past examples. After receiving an example the algorithm incurs a loss, which is the n ..."
Abstract

Cited by 116 (11 self)
 Add to MetaCart
We consider online density estimation with a parameterized density from the exponential family. The online algorithm receives one example at a time and maintains a parameter that is essentially an average of the past examples. After receiving an example the algorithm incurs a loss, which is the negative loglikelihood of the example with respect to the past parameter of the algorithm. An oline algorithm can choose the best parameter based on all the examples. We prove bounds on the additional total loss of the online algorithm over the total loss of the best oline parameter. These relative loss bounds hold for an arbitrary sequence of examples. The goal is to design algorithms with the best possible relative loss bounds. We use a Bregman divergence to derive and analyze each algorithm. These divergences are relative entropies between two exponential distributions. We also use our methods to prove relative loss bounds for linear regression.
Toward a method of selecting among computational models of cognition
 Psychological Review
, 2002
"... The question of how one should decide among competing explanations of data is at the heart of the scientific enterprise. Computational models of cognition are increasingly being advanced as explanations of behavior. The success of this line of inquiry depends on the development of robust methods to ..."
Abstract

Cited by 74 (4 self)
 Add to MetaCart
The question of how one should decide among competing explanations of data is at the heart of the scientific enterprise. Computational models of cognition are increasingly being advanced as explanations of behavior. The success of this line of inquiry depends on the development of robust methods to guide the evaluation and selection of these models. This article introduces a method of selecting among mathematical models of cognition known as minimum description length, which provides an intuitive and theoretically wellgrounded understanding of why one model should be chosen. A central but elusive concept in model selection, complexity, can also be derived with the method. The adequacy of the method is demonstrated in 3 areas of cognitive modeling: psychophysics, information integration, and categorization. How should one choose among competing theoretical explanations of data? This question is at the heart of the scientific enterprise, regardless of whether verbal models are being tested in an experimental setting or computational models are being evaluated in simulations. A number of criteria have been proposed to assist in this endeavor, summarized nicely by Jacobs and Grainger
Mining Product Reputations on the Web
, 2002
"... Knowing the reputations of your own and/or competitors products is important for marketing and customer relationship management. It is, however, very costly to collect and analyze survey data manually. This paper presents a new framework for mining product reputations on the Internet. It automatica ..."
Abstract

Cited by 64 (1 self)
 Add to MetaCart
Knowing the reputations of your own and/or competitors products is important for marketing and customer relationship management. It is, however, very costly to collect and analyze survey data manually. This paper presents a new framework for mining product reputations on the Internet. It automatically collects people's opinions about target products from Web pages, and uses text mining techniques to obtain reputations of the products. In advance, we generate, on the basis of humantested examples, syntactic and linguistic rules to determine whether any given statement is an opinion or not, and the positive/negative nature of that opinion. We first collect statements regarding target products using a general search engine, then, using the rules, extract opinions from them and attach to each of the opinions the labels
Competitive online statistics
 International Statistical Review
, 1999
"... A radically new approach to statistical modelling, which combines mathematical techniques of Bayesian statistics with the philosophy of the theory of competitive online algorithms, has arisen over the last decade in computer science (to a large degree, under the influence of Dawid’s prequential sta ..."
Abstract

Cited by 63 (10 self)
 Add to MetaCart
A radically new approach to statistical modelling, which combines mathematical techniques of Bayesian statistics with the philosophy of the theory of competitive online algorithms, has arisen over the last decade in computer science (to a large degree, under the influence of Dawid’s prequential statistics). In this approach, which we call “competitive online statistics”, it is not assumed that data are generated by some stochastic mechanism; the bounds derived for the performance of competitive online statistical procedures are guaranteed to hold (and not just hold with high probability or on the average). This paper reviews some results in this area; the new material in it includes the proofs for the performance of the Aggregating Algorithm in the problem of linear regression with square loss. Keywords: Bayes’s rule, competitive online algorithms, linear regression, prequential statistics, worstcase analysis.
Adaptive and SelfConfident OnLine Learning Algorithms
, 2000
"... We study online learning in the linear regression framework. Most of the performance bounds for online algorithms in this framework assume a constant learning rate. To achieve these bounds the learning rate must be optimized based on a posteriori information. This information depends on the wh ..."
Abstract

Cited by 62 (7 self)
 Add to MetaCart
We study online learning in the linear regression framework. Most of the performance bounds for online algorithms in this framework assume a constant learning rate. To achieve these bounds the learning rate must be optimized based on a posteriori information. This information depends on the whole sequence of examples and thus it is not available to any strictly online algorithm. We introduce new techniques for adaptively tuning the learning rate as the data sequence is progressively revealed. Our techniques allow us to prove essentially the same bounds as if we knew the optimal learning rate in advance. Moreover, such techniques apply to a wide class of online algorithms, including pnorm algorithms for generalized linear regression and Weighted Majority for linear regression with absolute loss. Our adaptive tunings are radically dierent from previous techniques, such as the socalled doubling trick. Whereas the doubling trick restarts the online algorithm several ti...
A tutorial introduction to the minimum description length principle
 in Advances in Minimum Description Length: Theory and Applications. 2005
"... ..."
Statistical Inference, Occam’s Razor, and Statistical Mechanics on the Space of Probability Distributions
, 1997
"... The task of parametric model selection is cast in terms of a statistical mechanics on the space of probability distributions. Using the techniques of lowtemperature expansions, I arrive at a systematic series for the Bayesian posterior probability of a model family that significantly extends known ..."
Abstract

Cited by 54 (3 self)
 Add to MetaCart
The task of parametric model selection is cast in terms of a statistical mechanics on the space of probability distributions. Using the techniques of lowtemperature expansions, I arrive at a systematic series for the Bayesian posterior probability of a model family that significantly extends known results in the literature. In particular, I arrive at a precise understanding of how Occam’s razor, the principle that simpler models should be preferred until the data justify more complex models, is automatically embodied by probability theory. These results require a measure on the space of model parameters and I derive and discuss an interpretation of Jeffreys ’ prior distribution as a uniform prior over the distributions indexed by a family. Finally, I derive a theoretical index of the complexity of a parametric family relative to some true distribution that I call the razor of the model. The form of the razor immediately suggests several interesting questions in the theory of learning that can be studied using the techniques of statistical mechanics.
Analysis of two gradientbased algorithms for online regression
 Journal of Computer and System Sciences
, 1999
"... In this paper we present a new analysis of two algorithms, Gradient Descent and Exponentiated Gradient, for solving regression problems in the online framework. Both these algorithms compute a prediction that depends linearly on the current instance, and then update the coefficients of this linear ..."
Abstract

Cited by 40 (5 self)
 Add to MetaCart
In this paper we present a new analysis of two algorithms, Gradient Descent and Exponentiated Gradient, for solving regression problems in the online framework. Both these algorithms compute a prediction that depends linearly on the current instance, and then update the coefficients of this linear combination according to the gradient of the loss function. However, the two algorithms have distinctive ways of using the gradient information for updating the coefficients. For each algorithm, we show general regression bounds for any convex loss function. Furthermore, we show special bounds for the absolute and the square loss functions, thus extending previous results by Kivinen and Warmuth. In the nonlinear regression case, we show general bounds for pairs of transfer and loss functions satisfying a certain condition. We apply this result to the Hellinger loss and the entropic loss in case of logistic regression (similar results, but only for the entropic loss, were also obtained by Helmbold et al. using a different analysis.) Finally, we describe the connection between our approach and a general family of gradientbased algorithms proposed by Warmuth et al. in recent works. 1999 Academic Press 1.
Predicting a Binary Sequence Almost as Well as the Optimal Biased Coin
, 1996
"... We apply the exponential weight algorithm, introduced and Littlestone and Warmuth [17] and by Vovk [24] to the problem of predicting a binary sequence almost as well as the best biased coin. We first show that for the case of the logarithmic loss, the derived algorithm is equivalent to the Bayes alg ..."
Abstract

Cited by 40 (5 self)
 Add to MetaCart
We apply the exponential weight algorithm, introduced and Littlestone and Warmuth [17] and by Vovk [24] to the problem of predicting a binary sequence almost as well as the best biased coin. We first show that for the case of the logarithmic loss, the derived algorithm is equivalent to the Bayes algorithm with Jeffrey's prior, that was studied by Xie and Barron under probabilistic assumptions [26]. We derive a uniform bound on the regret which holds for any sequence. We also show that if the empirical distribution of the sequence is bounded away from 0 and from 1, then, as the length of the sequence increases to infinity, the difference between this bound and a corresponding bound on the average case regret of the same algorithm (which is asymptotically optimal in that case) is only 1=2. We show that this gap of 1=2 is necessary by calculating the regret of the minmax optimal algorithm for this problem and showing that the asymptotic upper bound is tight. We also study the application...