Results 11  20
of
235
Mutual information, Fisher information and population coding
 Neural Computation
, 1998
"... In the context of parameter estimation and model selection, it is only quite recently that a direct link between the Fisher information and information theoretic quantities has been exhibited. We give an interpretation of this link within the standard framework of information theory. We show that in ..."
Abstract

Cited by 61 (3 self)
 Add to MetaCart
In the context of parameter estimation and model selection, it is only quite recently that a direct link between the Fisher information and information theoretic quantities has been exhibited. We give an interpretation of this link within the standard framework of information theory. We show that in the context of population coding, the mutual information between the activity of a large array of neurons and a stimulus to which the neurons are tuned is naturally related to the Fisher information. In the light of this result we consider the optimization of the tuning curves parameters in the case of neurons responding to a stimulus represented by an angular variable. To appear in Neural Computation Vol. 10, Issue 7, published by the MIT press. 1 Laboratory associated with C.N.R.S. (U.R.A. 1306), ENS, and Universities Paris VI and Paris VII 1 Introduction A natural framework to study how neurons communicate, or transmit information, in the nervous system is information theory (see e...
Strong Optimality of the Normalized ML Models as Universal Codes
 IEEE Transactions on Information Theory
, 2000
"... We show that the normalized maximum likelihood (NML) distribution as a universal code for a parametric class of models is closest to the negative logarithm of the maximized likelihood in the mean code length distance, where the mean is taken with respect to the worst case model inside or outside ..."
Abstract

Cited by 59 (7 self)
 Add to MetaCart
We show that the normalized maximum likelihood (NML) distribution as a universal code for a parametric class of models is closest to the negative logarithm of the maximized likelihood in the mean code length distance, where the mean is taken with respect to the worst case model inside or outside the parametric class. We strengthen this result by showing that the same minmax bound results even when the data generating models are restricted to be most `benevolent' in minimizing the mean of the negative logarithm of the maximized likelihood. Further, we show for the class of exponential models that the bound cannot be beaten in essence by any code except when the mean is taken with respect to the most benevolent data generating models in a set of vanishing size. These results allow us to decompose the data into two parts, the first having all the useful information that can be extracted with the parametric models and the rest which has none. We also show that, if we change Ak...
A tutorial introduction to the minimum description length principle
 in Advances in Minimum Description Length: Theory and Applications. 2005
"... ..."
Hypothesis Selection and Testing by the MDL Principle
 The Computer Journal
, 1998
"... ses where the variance is known or taken as a parameter. 1. INTRODUCTION Although the term `hypothesis' in statistics is synonymous with that of a probability `model' as an explanation of data, hypothesis testing is not quite the same problem as model selection. This is because usually a particul ..."
Abstract

Cited by 58 (3 self)
 Add to MetaCart
ses where the variance is known or taken as a parameter. 1. INTRODUCTION Although the term `hypothesis' in statistics is synonymous with that of a probability `model' as an explanation of data, hypothesis testing is not quite the same problem as model selection. This is because usually a particular hypothesis, called the `null hypothesis', has already been selected as a favorite model and it will be abandoned in favor of another model only when it clearly fails to explain the currently available data. In model selection, by contrast, all the models considered are regarded on the same footing and the objective is simply to pick the one that best explains the data. For the Bayesians certain models may be favored in terms of a prior probability, but in the minimum description length (MDL) approach to be outlined below, prior knowledge of any kind is to be used in selecting the tentative models, which in the end, unlike in the Bayesians' case, can and will be fitted to data
3D Statistical Shape Models Using Direct Optimisation of Description Length
, 2002
"... We describea n a26`('9b method for buildingoptima 3D sta22j9b'2 sha e models from sets oftraj'Hj sha es. Althoughsha e models showconsideraj promisea a bami for segmentingan interpreting imainga ma jordra wba k of theae9`2j h is the need toestaH69 a dense correspondenceadenc a trance9 set ofexa') ..."
Abstract

Cited by 55 (4 self)
 Add to MetaCart
We describea n a26`('9b method for buildingoptima 3D sta22j9b'2 sha e models from sets oftraj'Hj sha es. Althoughsha e models showconsideraj promisea a bami for segmentingan interpreting imainga ma jordra wba k of theae9`2j h is the need toestaH69 a dense correspondenceadenc a trance9 set ofexa')( sha es. It is importa t to esta)`9b the correct correspondence, otherwise poor models ca result. In 2D, thisca be a hieved usingma ua `la9`'H`9b but in 3D this becomesimpra2`269 We show it is possible toesta6jH9 correspondences automatically, byca6)22 the correspondence problema one of finding the`optima) paima)9b`2'2)9 of ea hsha e in thetra'22 set. We describea n explicit representares ofsurfa6 paa6(9b`j"`9a tha ensures the resulting correspondencesad legac ag show how this representaen9ca bemaH('9b2)j to minimise thed933J292 length of the tra'H22 set using the model. This results incompaH models with good generab2('H9 properties. Resultsas reported for two sets ofbiomedica sha es, showingsignifica t improvement in model propertiescompa9' to thoseobta9j) usinga uniform surfam paam92))559b2'6 1
Spam filtering using statistical data compression models
 Journal of Machine Learning Research
, 2006
"... Spam filtering poses a special problem in text categorization, of which the defining characteristic is that filters face an active adversary, which constantly attempts to evade filtering. Since spam evolves continuously and most practical applications are based on online user feedback, the task call ..."
Abstract

Cited by 52 (12 self)
 Add to MetaCart
Spam filtering poses a special problem in text categorization, of which the defining characteristic is that filters face an active adversary, which constantly attempts to evade filtering. Since spam evolves continuously and most practical applications are based on online user feedback, the task calls for fast, incremental and robust learning algorithms. In this paper, we investigate a novel approach to spam filtering based on adaptive statistical data compression models. The nature of these models allows them to be employed as probabilistic text classifiers based on characterlevel or binary sequences. By modeling messages as sequences, tokenization and other errorprone preprocessing steps are omitted altogether, resulting in a method that is very robust. The models are also fast to construct and incrementally updateable. We evaluate the filtering performance of two different compression algorithms; dynamic Markov compression and prediction by partial matching. The results of our empirical evaluation indicate that compression models outperform currently established spam filters, as well as a number of methods proposed in previous studies.
MDL Denoising
 IEEE Transactions on Information Theory
, 1999
"... The socalled denoising problem, relative to normal models for noise, is formalized such that `noise' is defined as the incompressible part in the data while the compressible part defines the meaningful information bearing signal. Such a decomposition is effected by minimization of the ideal code ..."
Abstract

Cited by 50 (9 self)
 Add to MetaCart
The socalled denoising problem, relative to normal models for noise, is formalized such that `noise' is defined as the incompressible part in the data while the compressible part defines the meaningful information bearing signal. Such a decomposition is effected by minimization of the ideal code length, called for by the Minimum Description Length (MDL) principle, and obtained by an application of the normalized maximum likelihood technique to the primary parameters, their range, and their number. For any orthonormal regression matrix, such as defined by wavelet transforms, the minimization can be done with a threshold for the squared coefficients resulting from the expansion of the data sequence in the basis vectors defined by the matrix. keywords: linear regression, wavelet transforms, threshold, stochastic complexity, Kolmogorov sufficient statistics 1 Introduction Intuitively speaking the socalled `denoising' problem is to separate an observed data sequence x 1 ; x 2 ; ...
A Vector Quantization Approach to Universal Noiseless Coding and Quantization
 IEEE Trans. Inform. Theory
, 1996
"... AbstractA twostage code is a block code in which each block of data is coded in two stages: the first stage codes the identity of a block code among a collection of codes, and the second stage codes the data using the identified code. The collection of codes may he noiseless codes, fixedrate quan ..."
Abstract

Cited by 44 (10 self)
 Add to MetaCart
AbstractA twostage code is a block code in which each block of data is coded in two stages: the first stage codes the identity of a block code among a collection of codes, and the second stage codes the data using the identified code. The collection of codes may he noiseless codes, fixedrate quantizers, or variablerate quantizers. We take a vector quantization approach to twostage coding, in which the first stage code can be regarded as a vector quantizer that “quantizes ” the input data of length n to one of a fixed collection of block codes. We apply the generalized Lloyd algorithm to the firststage quantizer, using induced measures of rate and distortion, to design locally optimal twostage, codes. On a source of medical images, twostage variahlerate vector quantizers designed in this way outperform standard (onestage) fixedrate vector quantizers by over 9 dB. The tail of the operational distortionrate function of the firststage quantizer determines the optimal rate of convergence of the redundancy of a universal sequence of twostage codes. We show that there exist twostage universal noiseless codes, fixedrate quantizers, and variablerate quantizers whose perletter rate and distortion redundancies converge to zero as (k/2)n ’ logn, when the universe of sources has finite dimension k. This extends the achievability part of Rissanen’s theorem from universal noiseless codes to universal quantizers. Further, we show that the redundancies converge as O(n’) when the universe of sources is countable, and as O(r~l+‘) when the universe of sources is infinitedimensional, under appropriate conditions. Index TermsTwostage, adaptive, compression, minimum description length, clustering. I.
A Unifying Framework for Detecting Outliers and Change Points from NonStationary Time Series Data
 In Proc. of the Eighth ACM SIGKDD, ACM
, 2002
"... We m'e concerned vith the issues of outlier detection and change point detection from a data stream. In the area of data mining, there have been increased interest in these issues since the former is related to fraud detection, rare event discovery, etc., vhile the latter is related to event/trend c ..."
Abstract

Cited by 41 (2 self)
 Add to MetaCart
We m'e concerned vith the issues of outlier detection and change point detection from a data stream. In the area of data mining, there have been increased interest in these issues since the former is related to fraud detection, rare event discovery, etc., vhile the latter is related to event/trend change detection, activity monitoring, etc. Specifically, it is important to consider the situation where the data source is nonstationary, since the nature of data source may change over time in real applications. Although in most previous work outlier detection and change point detection have not been related explicitly, this paper presents a unifying frame vork for dealing vith both of them on the basis of the theory of online learning of nonstationary time series. In this framevork a probabilistic model of the data source is inerementally learned using an online discounting learning algorithm, which can track the changing data source adaplively by forgetting the effect of past data gradually. Then the score for any given data is calculated to measure its deviation from the learned model, vith a higher score indicating a high possibility of being an outlier. Further change points in a data stream are detected by applying this scoring method into a time series of moving averaged losses for prediction using the learned model. Specifically ve develop an efficient algorithms for online discounting learning of autoregression models from time series data, and demonstrate the validity of our framework through simulation and experimental applications to stock market data analysis.
Model Selection based on Minimum Description Length
 Journal of Mathematical Psychology
, 1999
"... this paper is, of necessity, quite technical. To get a first but much gentler glimpse, we advise to just read the following (section 2) and the last section (7), which discusses in what sense we may expect Occam's razor to actually work. 2 The Fundamental Idea ..."
Abstract

Cited by 39 (3 self)
 Add to MetaCart
this paper is, of necessity, quite technical. To get a first but much gentler glimpse, we advise to just read the following (section 2) and the last section (7), which discusses in what sense we may expect Occam's razor to actually work. 2 The Fundamental Idea