Results 11 - 20
of
35
A Fast Normalized Maximum Likelihood Algorithm for Multinomial Data
- In Proceedings of the Nineteenth International Joint Conference on Artificial Intelligence (IJCAI-05
, 2005
"... Stochastic complexity of a data set is defined as the shortest possible code length for the data obtainable by using some fixed set of models. This measure is of great theoretical and practical importance as a tool for tasks such as model selection or data clustering. In the case of multinomial data ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
Stochastic complexity of a data set is defined as the shortest possible code length for the data obtainable by using some fixed set of models. This measure is of great theoretical and practical importance as a tool for tasks such as model selection or data clustering. In the case of multinomial data, computing the modern version of stochastic complexity, defined as the Normalized Maximum Likelihood (NML) criterion, requires computing a sum with an exponential number of terms. Furthermore, in order to apply NML in practice, one often needs to compute a whole table of these exponential sums. In our previous work, we were able to compute this table by a recursive algorithm. The purpose of this paper is to significantly improve the time complexity of this algorithm. The techniques used here are based on the discrete Fourier transform and the convolution theorem.
Computing the Regret Table for Multinomial Data
, 2005
"... Stochastic complexity of a data set is defined as the shortest possible code length for the data obtainable by using some fixed set of models. This measure is of great theoretical and practical importance as a tool for tasks such as model selection or data clustering. In the case ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
Stochastic complexity of a data set is defined as the shortest possible code length for the data obtainable by using some fixed set of models. This measure is of great theoretical and practical importance as a tool for tasks such as model selection or data clustering. In the case
Analyzing the Stochastic Complexity via Tree Polynomials
, 2005
"... Stochastic complexity of a data set is defined as the shortest possible code length for the data obtainable by using some fixed set of models. This measure ..."
Abstract
-
Cited by 4 (4 self)
- Add to MetaCart
Stochastic complexity of a data set is defined as the shortest possible code length for the data obtainable by using some fixed set of models. This measure
NML Computation Algorithms for Tree-Structured Multinomial Bayesian Networks
, 2007
"... Typical problems in bioinformatics involve large discrete datasets. Therefore, in order to apply statistical methods in such domains, it is important to develop efficient algorithms suitable for discrete data. The minimum description length (MDL) principle is a theoretically well-founded, general fr ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
Typical problems in bioinformatics involve large discrete datasets. Therefore, in order to apply statistical methods in such domains, it is important to develop efficient algorithms suitable for discrete data. The minimum description length (MDL) principle is a theoretically well-founded, general framework for performing statistical inference. The mathematical formalization of MDL is based on the normalized maximum likelihood (NML) distribution, which has several desirable theoretical properties. In the case of discrete data, straightforward computation of the NML distribution requires exponential time with respect to the sample size, since the definition involves a sum over all the possible data samples of a fixed size. In this paper, we first review some existing algorithms for efficient NML computation in the case of multinomial and naive Bayes model families. Then we proceed by extending these algorithms to more complex, tree-structured Bayesian networks.
The Precise Minimax Redundancy
- IN PROCEEDINGS OF IEEE SYMPOSIUM ON INFORMATION THEORY
, 2002
"... We start with a quick introduction of the redundancy problem. A code C n : A ! f0; 1g is de ned as a mapping from the set A of all sequences x 1 = (x 1 ; : : : ; x n ) of length n over the nite alphabet A to the set f0; 1g of all binary sequences. Given a probabilistic source model, we le ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
We start with a quick introduction of the redundancy problem. A code C n : A ! f0; 1g is de ned as a mapping from the set A of all sequences x 1 = (x 1 ; : : : ; x n ) of length n over the nite alphabet A to the set f0; 1g of all binary sequences. Given a probabilistic source model, we let 1 ) be the probability of the message x 1 ; given a code C n , we let L(C n ; x 1 ) be the code length for x 1 . From Shannon's works we know that the entropy H n (P ) = 1 ) lg P (x 1 ) is the absolute lower bound on the expected code length, where lg := log 2 denotes the binary logarithm. Hence lg P (x 1 ) can be viewed as the \ideal" code length. The next natural question is to ask by how much the length L(C n ; x 1 ) of a code diers from the ideal code length, either for individual sequences or on average. The pointwise redundancy R n (C n ; P ; x 1 ) = L(C n ; x while the average redundancy R n (C n ; P ) and the maximal redundancy R n (C n ;
Calculating the normalized maximum likelihood distribution for Bayesian forests
- in Proc. IADIS International Conference on Intelligent Systems and Agents
, 2007
"... When learning Bayesian network structures from sample data, an important issue is how to evaluate the goodness of alternative network structures. Perhaps the most commonly used model (class) selection criterion is the marginal likelihood, which is obtained by integrating over a prior distribution fo ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
When learning Bayesian network structures from sample data, an important issue is how to evaluate the goodness of alternative network structures. Perhaps the most commonly used model (class) selection criterion is the marginal likelihood, which is obtained by integrating over a prior distribution for the model parameters. However, the problem of determining a reasonable prior for the parameters is a highly controversial issue, and no completely satisfying Bayesian solution has yet been presented in the non-informative setting. The normalized maximum likelihood (NML), based on Rissanen’s information-theoretic MDL methodology, offers an alternative, theoretically solid criterion that is objective and non-informative, while no parameter prior is required. It has been previously shown that for discrete data, this criterion can be computed in linear time for Bayesian networks with no arcs, and in quadratic time for the so called Naive Bayes network structure. Here we extend the previous results by showing how to compute the NML criterion in polynomial time for tree-structured Bayesian networks. The order of the polynomial depends on the number of values of the variables, but neither on the number of variables itself, nor on the sample size.
Some sufficient conditions on an arbitrary class of stochastic processes for the existence of a predictor
- In Proc. 19th International Conf. on Algorithmic Learning Theory (ALT’08), LNAI 5254
, 2008
"... Abstract. We consider the problem of sequence prediction in a probabilistic setting. Let there be given a class C of stochastic processes (probability measures on the set of one-way infinite sequences). We are interested in the question of what are the conditions on C under which there exists a pred ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
Abstract. We consider the problem of sequence prediction in a probabilistic setting. Let there be given a class C of stochastic processes (probability measures on the set of one-way infinite sequences). We are interested in the question of what are the conditions on C under which there exists a predictor (also a stochastic process) for which the predicted probabilities converge to the correct ones if any of the processes in C is chosen to generate the data. We find some sufficient conditions on C under which such a predictor exists. Some of the conditions are asymptotic in nature, while others are based on the local (truncated to first observations) behaviour of the processes. The conditions lead to constructions of the predictors. In some cases we obtain rates of convergence that are optimal up to an additive logarithmic term. We emphasize that the framework is completely general: the stochastic processes considered are not required to be i.i.d., stationary, or to belong to some parametric family. 1
Average Redundancy for Known Sources: Ubiquitous Trees in Source Coding
, 2008
"... Analytic information theory aims at studying problems of information theory using analytic techniques of computer science and combinatorics. Following Hadamard’s precept, these problems are tackled by complex analysis methods such as generating functions, Mellin transform, Fourier series, saddle poi ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Analytic information theory aims at studying problems of information theory using analytic techniques of computer science and combinatorics. Following Hadamard’s precept, these problems are tackled by complex analysis methods such as generating functions, Mellin transform, Fourier series, saddle point method, analytic poissonization and depoissonization, and singularity analysis. This approach lies at the crossroad of computer science and information theory. In this survey we concentrate on one facet of information theory (i.e., source coding better known as data compression), namely the redundancy rate problem. The redundancy rate problem determines by how much the actual code length exceeds the optimal code length. We further restrict our interest to the average redundancy for known sources, that is, when statistics of information sources are known. We present precise analyses of three types of lossless data compression schemes, namely fixed-to-variable (FV) length codes, variable-to-fixed (VF) length codes, and variable-to-variable (VV) length codes. In particular, we investigate average redundancy of Huffman, Tunstall, and Khodak codes. These codes have succinct representations as trees, either as coding or parsing trees, and we analyze here some of their parameters (e.g., the average path from the root to a leaf).
Superior Guarantees for Sequential Prediction and Lossless Compression via Alphabet Decomposition
"... We present worst case bounds for the learning rate of a known prediction method that is based on hierarchical applications of binary context tree weighting (CTW) predictors. A heuristic application of this approach that relies on Huffman’s alphabet decomposition is known to achieve state-ofthe-art p ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
We present worst case bounds for the learning rate of a known prediction method that is based on hierarchical applications of binary context tree weighting (CTW) predictors. A heuristic application of this approach that relies on Huffman’s alphabet decomposition is known to achieve state-ofthe-art performance in prediction and lossless compression benchmarks. We show that our new bound for this heuristic is tighter than the best known performance guarantees for prediction and lossless compression algorithms in various settings. This result substantiates the efficiency of this hierarchical method and provides a compelling explanation for its practical success. In addition, we present the results of a few experiments that examine other possibilities for improving the multialphabet prediction performance of CTW-based algorithms.
Combining Expert Advice Efficiently
"... We show how models for prediction with expert advice can be defined concisely and clearly using hidden Markov models (HMMs); standard HMM algorithms can then be used to efficiently calculate how the expert predictions should be weighted according to the model. We cast many existing models as HMMs an ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
We show how models for prediction with expert advice can be defined concisely and clearly using hidden Markov models (HMMs); standard HMM algorithms can then be used to efficiently calculate how the expert predictions should be weighted according to the model. We cast many existing models as HMMs and recover the best known running times in each case. We also describe two new models: the switch distribution, which was recently developed to improve Bayesian/Minimum Description Length model selection, and a new generalisation of the fixed share algorithm based on runlength coding. We give loss bounds for all models and shed new light on the relationships between them. 1

