Results 11  20
of
346
Sequential prediction of individual sequences under general loss functions
 IEEE Trans. on Information Theory
, 1998
"... Abstract—We consider adaptive sequential prediction of arbitrary binary sequences when the performance is evaluated using a general loss function. The goal is to predict on each individual sequence nearly as well as the best prediction strategy in a given comparison class of (possibly adaptive) pre ..."
Abstract

Cited by 84 (8 self)
 Add to MetaCart
(Show Context)
Abstract—We consider adaptive sequential prediction of arbitrary binary sequences when the performance is evaluated using a general loss function. The goal is to predict on each individual sequence nearly as well as the best prediction strategy in a given comparison class of (possibly adaptive) prediction strategies, called experts. By using a general loss function, we generalize previous work on universal prediction, forecasting, and data compression. However, here we restrict ourselves to the case when the comparison class is finite. For a given sequence, we define the regret as the total loss on the entire sequence suffered by the adaptive sequential predictor, minus the total loss suffered by the predictor in the comparison class that performs best on that particular sequence. We show that for a large class of loss functions, the minimax regret is either (log N)
Minimum Description Length Induction, Bayesianism, and Kolmogorov Complexity
 IEEE Transactions on Information Theory
, 1998
"... The relationship between the Bayesian approach and the minimum description length approach is established. We sharpen and clarify the general modeling principles MDL and MML, abstracted as the ideal MDL principle and defined from Bayes's rule by means of Kolmogorov complexity. The basic conditi ..."
Abstract

Cited by 79 (7 self)
 Add to MetaCart
The relationship between the Bayesian approach and the minimum description length approach is established. We sharpen and clarify the general modeling principles MDL and MML, abstracted as the ideal MDL principle and defined from Bayes's rule by means of Kolmogorov complexity. The basic condition under which the ideal principle should be applied is encapsulated as the Fundamental Inequality, which in broad terms states that the principle is valid when the data are random, relative to every contemplated hypothesis and also these hypotheses are random relative to the (universal) prior. Basically, the ideal principle states that the prior probability associated with the hypothesis should be given by the algorithmic universal probability, and the sum of the log universal probability of the model plus the log of the probability of the data given the model should be minimized. If we restrict the model class to the finite sets then application of the ideal principle turns into Kolmogorov's mi...
Strong Optimality of the Normalized ML Models as Universal Codes
 IEEE Transactions on Information Theory
, 2000
"... We show that the normalized maximum likelihood (NML) distribution as a universal code for a parametric class of models is closest to the negative logarithm of the maximized likelihood in the mean code length distance, where the mean is taken with respect to the worst case model inside or outside ..."
Abstract

Cited by 78 (8 self)
 Add to MetaCart
We show that the normalized maximum likelihood (NML) distribution as a universal code for a parametric class of models is closest to the negative logarithm of the maximized likelihood in the mean code length distance, where the mean is taken with respect to the worst case model inside or outside the parametric class. We strengthen this result by showing that the same minmax bound results even when the data generating models are restricted to be most `benevolent' in minimizing the mean of the negative logarithm of the maximized likelihood. Further, we show for the class of exponential models that the bound cannot be beaten in essence by any code except when the mean is taken with respect to the most benevolent data generating models in a set of vanishing size. These results allow us to decompose the data into two parts, the first having all the useful information that can be extracted with the parametric models and the rest which has none. We also show that, if we change Ak...
Spam filtering using statistical data compression models
 Journal of Machine Learning Research
, 2006
"... Spam filtering poses a special problem in text categorization, of which the defining characteristic is that filters face an active adversary, which constantly attempts to evade filtering. Since spam evolves continuously and most practical applications are based on online user feedback, the task call ..."
Abstract

Cited by 70 (12 self)
 Add to MetaCart
(Show Context)
Spam filtering poses a special problem in text categorization, of which the defining characteristic is that filters face an active adversary, which constantly attempts to evade filtering. Since spam evolves continuously and most practical applications are based on online user feedback, the task calls for fast, incremental and robust learning algorithms. In this paper, we investigate a novel approach to spam filtering based on adaptive statistical data compression models. The nature of these models allows them to be employed as probabilistic text classifiers based on characterlevel or binary sequences. By modeling messages as sequences, tokenization and other errorprone preprocessing steps are omitted altogether, resulting in a method that is very robust. The models are also fast to construct and incrementally updateable. We evaluate the filtering performance of two different compression algorithms; dynamic Markov compression and prediction by partial matching. The results of our empirical evaluation indicate that compression models outperform currently established spam filters, as well as a number of methods proposed in previous studies.
3D Statistical Shape Models Using Direct Optimisation of Description Length
, 2002
"... We describea n a26`('9b method for buildingoptima 3D sta22j9b'2 sha e models from sets oftraj'Hj sha es. Althoughsha e models showconsideraj promisea a bami for segmentingan interpreting imainga ma jordra wba k of theae9`2j h is the need toestaH69 a dense correspondenceadenc a tran ..."
Abstract

Cited by 69 (7 self)
 Add to MetaCart
(Show Context)
We describea n a26`('9b method for buildingoptima 3D sta22j9b'2 sha e models from sets oftraj'Hj sha es. Althoughsha e models showconsideraj promisea a bami for segmentingan interpreting imainga ma jordra wba k of theae9`2j h is the need toestaH69 a dense correspondenceadenc a trance9 set ofexa')( sha es. It is importa t to esta)`9b the correct correspondence, otherwise poor models ca result. In 2D, thisca be a hieved usingma ua `la9`'H`9b but in 3D this becomesimpra2`269 We show it is possible toesta6jH9 correspondences automatically, byca6)22 the correspondence problema one of finding the`optima) paima)9b`2'2)9 of ea hsha e in thetra'22 set. We describea n explicit representares ofsurfa6 paa6(9b`j"`9a tha ensures the resulting correspondencesad legac ag show how this representaen9ca bemaH('9b2)j to minimise thed933J292 length of the tra'H22 set using the model. This results incompaH models with good generab2('H9 properties. Resultsas reported for two sets ofbiomedica sha es, showingsignifica t improvement in model propertiescompa9' to thoseobta9j) usinga uniform surfam paam92))559b2'6 1
A Unifying Framework for Detecting Outliers and Change Points from NonStationary Time Series Data
 In Proc. of the Eighth ACM SIGKDD, ACM
, 2002
"... We m'e concerned vith the issues of outlier detection and change point detection from a data stream. In the area of data mining, there have been increased interest in these issues since the former is related to fraud detection, rare event discovery, etc., vhile the latter is related to event/tr ..."
Abstract

Cited by 65 (3 self)
 Add to MetaCart
We m'e concerned vith the issues of outlier detection and change point detection from a data stream. In the area of data mining, there have been increased interest in these issues since the former is related to fraud detection, rare event discovery, etc., vhile the latter is related to event/trend change detection, activity monitoring, etc. Specifically, it is important to consider the situation where the data source is nonstationary, since the nature of data source may change over time in real applications. Although in most previous work outlier detection and change point detection have not been related explicitly, this paper presents a unifying frame vork for dealing vith both of them on the basis of the theory of online learning of nonstationary time series. In this framevork a probabilistic model of the data source is inerementally learned using an online discounting learning algorithm, which can track the changing data source adaplively by forgetting the effect of past data gradually. Then the score for any given data is calculated to measure its deviation from the learned model, vith a higher score indicating a high possibility of being an outlier. Further change points in a data stream are detected by applying this scoring method into a time series of moving averaged losses for prediction using the learned model. Specifically ve develop an efficient algorithms for online discounting learning of autoregression models from time series data, and demonstrate the validity of our framework through simulation and experimental applications to stock market data analysis.
Hypothesis Selection and Testing by the MDL Principle
 The Computer Journal
, 1998
"... ses where the variance is known or taken as a parameter. 1. INTRODUCTION Although the term `hypothesis' in statistics is synonymous with that of a probability `model' as an explanation of data, hypothesis testing is not quite the same problem as model selection. This is because usually ..."
Abstract

Cited by 64 (3 self)
 Add to MetaCart
(Show Context)
ses where the variance is known or taken as a parameter. 1. INTRODUCTION Although the term `hypothesis' in statistics is synonymous with that of a probability `model' as an explanation of data, hypothesis testing is not quite the same problem as model selection. This is because usually a particular hypothesis, called the `null hypothesis', has already been selected as a favorite model and it will be abandoned in favor of another model only when it clearly fails to explain the currently available data. In model selection, by contrast, all the models considered are regarded on the same footing and the objective is simply to pick the one that best explains the data. For the Bayesians certain models may be favored in terms of a prior probability, but in the minimum description length (MDL) approach to be outlined below, prior knowledge of any kind is to be used in selecting the tentative models, which in the end, unlike in the Bayesians' case, can and will be fitted to data
Model Selection based on Minimum Description Length
 Journal of Mathematical Psychology
, 1999
"... this paper is, of necessity, quite technical. To get a first but much gentler glimpse, we advise to just read the following (section 2) and the last section (7), which discusses in what sense we may expect Occam's razor to actually work. 2 The Fundamental Idea ..."
Abstract

Cited by 59 (3 self)
 Add to MetaCart
(Show Context)
this paper is, of necessity, quite technical. To get a first but much gentler glimpse, we advise to just read the following (section 2) and the last section (7), which discusses in what sense we may expect Occam's razor to actually work. 2 The Fundamental Idea
MDL Denoising
 IEEE Transactions on Information Theory
, 1999
"... The socalled denoising problem, relative to normal models for noise, is formalized such that `noise' is defined as the incompressible part in the data while the compressible part defines the meaningful information bearing signal. Such a decomposition is effected by minimization of the ideal ..."
Abstract

Cited by 59 (10 self)
 Add to MetaCart
(Show Context)
The socalled denoising problem, relative to normal models for noise, is formalized such that `noise' is defined as the incompressible part in the data while the compressible part defines the meaningful information bearing signal. Such a decomposition is effected by minimization of the ideal code length, called for by the Minimum Description Length (MDL) principle, and obtained by an application of the normalized maximum likelihood technique to the primary parameters, their range, and their number. For any orthonormal regression matrix, such as defined by wavelet transforms, the minimization can be done with a threshold for the squared coefficients resulting from the expansion of the data sequence in the basis vectors defined by the matrix. keywords: linear regression, wavelet transforms, threshold, stochastic complexity, Kolmogorov sufficient statistics 1 Introduction Intuitively speaking the socalled `denoising' problem is to separate an observed data sequence x 1 ; x 2 ; ...