Results 1  10
of
38
Relative Loss Bounds for Online Density Estimation with the Exponential Family of Distributions
 MACHINE LEARNING
, 2000
"... We consider online density estimation with a parameterized density from the exponential family. The online algorithm receives one example at a time and maintains a parameter that is essentially an average of the past examples. After receiving an example the algorithm incurs a loss, which is the n ..."
Abstract

Cited by 116 (11 self)
 Add to MetaCart
We consider online density estimation with a parameterized density from the exponential family. The online algorithm receives one example at a time and maintains a parameter that is essentially an average of the past examples. After receiving an example the algorithm incurs a loss, which is the negative loglikelihood of the example with respect to the past parameter of the algorithm. An oline algorithm can choose the best parameter based on all the examples. We prove bounds on the additional total loss of the online algorithm over the total loss of the best oline parameter. These relative loss bounds hold for an arbitrary sequence of examples. The goal is to design algorithms with the best possible relative loss bounds. We use a Bregman divergence to derive and analyze each algorithm. These divergences are relative entropies between two exponential distributions. We also use our methods to prove relative loss bounds for linear regression.
An experimental and theoretical comparison of model selection methods. Machine Learning 27
, 1997
"... In the model selection problem, we must balance the complexity of a statistical model with its goodness of fit to the training data. This problem arises repeatedly in statistical estimation, machine learning, and scientific inquiry in general. ..."
Abstract

Cited by 110 (5 self)
 Add to MetaCart
In the model selection problem, we must balance the complexity of a statistical model with its goodness of fit to the training data. This problem arises repeatedly in statistical estimation, machine learning, and scientific inquiry in general.
Sequential Prediction of Individual Sequences Under General Loss Functions
 IEEE Transactions on Information Theory
, 1998
"... We consider adaptive sequential prediction of arbitrary binary sequences when the performance is evaluated using a general loss function. The goal is to predict on each individual sequence nearly as well as the best prediction strategy in a given comparison class of (possibly adaptive) prediction st ..."
Abstract

Cited by 75 (7 self)
 Add to MetaCart
We consider adaptive sequential prediction of arbitrary binary sequences when the performance is evaluated using a general loss function. The goal is to predict on each individual sequence nearly as well as the best prediction strategy in a given comparison class of (possibly adaptive) prediction strategies, called experts. By using a general loss function, we generalize previous work on universal prediction, forecasting, and data compression. However, here we restrict ourselves to the case when the comparison class is finite. For a given sequence, we define the regret as the total loss on the entire sequence suffered by the adaptive sequential predictor, minus the total loss suffered by the predictor in the comparison class that performs best on that particular sequence. We show that for a large class of loss functions, the minimax regret is either \Theta(log N) or \Omega\Gamma p ` log N ), depending on the loss function, where N is the number of predictors in the comparison class a...
An InformationTheoretic External ClusterValidity Measure
 Research Report RJ 10219, IBM
, 2001
"... In this paper we propose a measure of similarity/association between two partitions of a set of objects. Our motivation is the desire to use the measure to characterize the quality or accuracy of clustering algorithms by somehow comparing the clusters they produce with "ground truth" consisting of c ..."
Abstract

Cited by 62 (3 self)
 Add to MetaCart
In this paper we propose a measure of similarity/association between two partitions of a set of objects. Our motivation is the desire to use the measure to characterize the quality or accuracy of clustering algorithms by somehow comparing the clusters they produce with "ground truth" consisting of classes assigned to the patterns by manual means or some other means in whose veracity there is confidence. Such measures are referred to as "external". Our measure also allows clusterings with different numbers of clusters to be compared in a quantitative and principled way. Our evaluation scheme quantitatively measures how useful the cluster labels of the patterns are as predictors of their class labels. When all clusterings to be compared have the same number of clusters, the measure is equivalent to the mutual information between the cluster labels and the class labels. In cases where the numbers of clusters are different, however, it computes the reduction in the number of bits that w...
Predicting a Binary Sequence Almost as Well as the Optimal Biased Coin
, 1996
"... We apply the exponential weight algorithm, introduced and Littlestone and Warmuth [17] and by Vovk [24] to the problem of predicting a binary sequence almost as well as the best biased coin. We first show that for the case of the logarithmic loss, the derived algorithm is equivalent to the Bayes alg ..."
Abstract

Cited by 40 (5 self)
 Add to MetaCart
We apply the exponential weight algorithm, introduced and Littlestone and Warmuth [17] and by Vovk [24] to the problem of predicting a binary sequence almost as well as the best biased coin. We first show that for the case of the logarithmic loss, the derived algorithm is equivalent to the Bayes algorithm with Jeffrey's prior, that was studied by Xie and Barron under probabilistic assumptions [26]. We derive a uniform bound on the regret which holds for any sequence. We also show that if the empirical distribution of the sequence is bounded away from 0 and from 1, then, as the length of the sequence increases to infinity, the difference between this bound and a corresponding bound on the average case regret of the same algorithm (which is asymptotically optimal in that case) is only 1=2. We show that this gap of 1=2 is necessary by calculating the regret of the minmax optimal algorithm for this problem and showing that the asymptotic upper bound is tight. We also study the application...
On Selecting Models for Nonlinear Time Series
 Physica D
, 1995
"... Constructing models from time series with nontrivial dynamics involves the problem of how to choose the best model from within a class of models, or to choose between competing classes. This paper discusses a method of building nonlinear models of possibly chaotic systems from data, while maintainin ..."
Abstract

Cited by 39 (11 self)
 Add to MetaCart
Constructing models from time series with nontrivial dynamics involves the problem of how to choose the best model from within a class of models, or to choose between competing classes. This paper discusses a method of building nonlinear models of possibly chaotic systems from data, while maintaining good robustness against noise. The models that are built are close to the simplest possible according to a description length criterion. The method will deliver a linear model if that has shorter description length than a nonlinear model. We show how our models can be used for prediction, smoothing and interpolation in the usual way. We also show how to apply the results to identification of chaos by detecting the presence of homoclinic orbits directly from time series. 1 The Model Selection Problem As our understanding of chaotic and other nonlinear phenomena has grown, it has become apparent that linear models are inadequate to model most dynamical processes. Nevertheless, linear models...
Inducing the morphological lexicon of a natural language from unannotated text
 In Proceedings of the International and Interdisciplinary Conference on Adaptive Knowledge Representation and Reasoning (AKRR’05
"... This work presents an algorithm for the unsupervised learning, or induction, of a simple morphology of a natural language. A probabilistic maximum a posteriori model is utilized, which builds hierarchical representations for a set of morphs, which are morphemelike units discovered from unannotated ..."
Abstract

Cited by 32 (7 self)
 Add to MetaCart
This work presents an algorithm for the unsupervised learning, or induction, of a simple morphology of a natural language. A probabilistic maximum a posteriori model is utilized, which builds hierarchical representations for a set of morphs, which are morphemelike units discovered from unannotated text corpora. The induced morph lexicon stores parameters related to both the “meaning ” and “form ” of the morphs it contains. These parameters affect the role of the morphs in words. The model is implemented in a task of unsupervised morpheme segmentation of Finnish and English words. Very good results are obtained for Finnish and almost as good results are obtained in the English task. 1.
A Bound on the Error of Cross Validation Using the Approximation and Estimation Rates, with Consequences for the TrainingTest Split
 Neural Computation
, 1996
"... : We give an analysis of the generalization error of cross validation in terms of two natural measures of the difficulty of the problem under consideration: the approximation rate (the accuracy to which the target function can be ideally approximated as a function of the number of hypothesis paramet ..."
Abstract

Cited by 24 (0 self)
 Add to MetaCart
: We give an analysis of the generalization error of cross validation in terms of two natural measures of the difficulty of the problem under consideration: the approximation rate (the accuracy to which the target function can be ideally approximated as a function of the number of hypothesis parameters), and the estimation rate (the deviation between the training and generalization errors as a function of the number of hypothesis parameters). The approximation rate captures the complexity of the target function with respect to the hypothesis model, and the estimation rate captures the extent to which the hypothesis model suffers from overfitting. Using these two measures, we give a rigorous and general bound on the error of cross validation. The bound clearly shows the tradeoffs involved with making fl  the fraction of data saved for testing  too large or too small. By optimizing the bound with respect to fl, we then argue (through a combination of formal analysis, plotting, and ...
The Acquisition of a UnificationBased Generalised Categorial Grammar
, 2002
"... The purpose of this work is to investigate the process of grammatical acquisition from data. In order to do that, a computational learning system is used, composed of a Universal Grammar with associated parameters, and a learning algorithm, following the Principles and Parameters Theory. The Univers ..."
Abstract

Cited by 24 (4 self)
 Add to MetaCart
The purpose of this work is to investigate the process of grammatical acquisition from data. In order to do that, a computational learning system is used, composed of a Universal Grammar with associated parameters, and a learning algorithm, following the Principles and Parameters Theory. The Universal Grammar is implemented as a UnificationBased Generalised Categorial Grammar, embedded in a default inheritance network of lexical types. The learning algorithm receives input from a corpus of spontaneous childdirected transcribed speech annotated with logical forms and sets the parameters based on this input. This framework is used as a basis to investigate several aspects of language acquisition. In this thesis I concentrate on the acquisition of subcategorisation frames and word order information, from data. The data to which the learner is exposed can be noisy and ambiguous, and I investigate how these factors a#ect the learning process. The results obtained show a robust learner converging towards the target grammar given the input data available. They also show how the amount of noise present in the input data a#ects the speed of convergence of the learner towards the target grammar. Future work is suggested for investigating the developmental stages of language acquisition as predicted by the learning model, with a thorough comparison with the developmental stages of a child. This is primarily a cognitive computational model of language learning that can be used to investigate and gain a better understanding of human language acquisition, and can potentially be relevant to the development of more adaptive NLP technology.
Application of change detection to dynamic contact sensing
 The International Journal of Robotics Research
, 1994
"... The forces of contact during manipulation convey substantial information about the state of the manipulation. ..."
Abstract

Cited by 21 (1 self)
 Add to MetaCart
The forces of contact during manipulation convey substantial information about the state of the manipulation.