Results 1  10
of
19
Universal lossless compression with unknown alphabets  The average case
, 2006
"... Universal compression of patterns of sequences generated by independently identically distributed (i.i.d.) sources with unknown, possibly large, alphabets is investigated. A pattern is a sequence of indices that contains all consecutive indices in increasing order of first occurrence. If the alphabe ..."
Abstract

Cited by 20 (4 self)
 Add to MetaCart
(Show Context)
Universal compression of patterns of sequences generated by independently identically distributed (i.i.d.) sources with unknown, possibly large, alphabets is investigated. A pattern is a sequence of indices that contains all consecutive indices in increasing order of first occurrence. If the alphabet of a source that generated a sequence is unknown, the inevitable cost of coding the unknown alphabet symbols can be exploited to create the pattern of the sequence. This pattern can in turn be compressed by itself. It is shown that if the alphabet size k is essentially small, then the average minimax and maximin redundancies as well as the redundancy of every code for almost every source, when compressing a pattern, consist of at least 0.5 log ( n/k 3) bits per each unknown probability parameter, and if all alphabet letters are likely to occur, there exist codes whose redundancy is at most 0.5 log ( n/k 2) bits per each unknown probability parameter, where n is the length of the data sequences. Otherwise, if the alphabet is large, these redundancies are essentially at least O ( n −2/3) bits per symbol, and there exist codes that achieve redundancy of essentially O ( n −1/2) bits per symbol. Two suboptimal lowcomplexity sequential algorithms for compression of patterns are presented and their description lengths
On the entropy rate of pattern processes
 Proceedings of the 2005 Data Compression Conference, Snowbird
, 2005
"... We study the entropy rate of pattern sequences of stochastic processes, and its relationship to the entropy rate of the original process. We give a complete characterization of this relationship for i.i.d. processes over arbitrary alphabets, stationary ergodic processes over discrete alphabets, and ..."
Abstract

Cited by 10 (0 self)
 Add to MetaCart
(Show Context)
We study the entropy rate of pattern sequences of stochastic processes, and its relationship to the entropy rate of the original process. We give a complete characterization of this relationship for i.i.d. processes over arbitrary alphabets, stationary ergodic processes over discrete alphabets, and a broad family of stationary ergodic processes over uncountable alphabets. For cases where the entropy rate of the pattern process is infinite, we characterize the possible growth rate of the block entropy. 1
Competitive Closeness Testing
 24TH ANNUAL CONFERENCE ON LEARNING THEORY
, 2011
"... We test whether two sequences are generated by the same distribution or by two different ones. Unlike previous work, we make no assumptions on the distributions ’ support size. Additionally, we compare our performance to that of the best possible test. We describe an efficientlycomputable algorithm ..."
Abstract

Cited by 10 (4 self)
 Add to MetaCart
We test whether two sequences are generated by the same distribution or by two different ones. Unlike previous work, we make no assumptions on the distributions ’ support size. Additionally, we compare our performance to that of the best possible test. We describe an efficientlycomputable algorithm based on pattern maximum likelihood that is near optimal whenever the best possible error probability is ≤ exp(−14n 2/3) using lengthn sequences.
Optimal information storage: Nonsequential sources and neural channels
, 2006
"... Information storage and retrieval systems are communication systems from the present to the future and fall naturally into the framework of information theory. The goal of information storage is to preserve as much signal fidelity under resource constraints as possible. The information storage theor ..."
Abstract

Cited by 9 (6 self)
 Add to MetaCart
(Show Context)
Information storage and retrieval systems are communication systems from the present to the future and fall naturally into the framework of information theory. The goal of information storage is to preserve as much signal fidelity under resource constraints as possible. The information storage theorem delineates average fidelity and average resource values that are achievable and those that are not. Moreover, observable properties of optimal information storage systems and the robustness of optimal systems
Universal Coding on Infinite Alphabets: Exponentially Decreasing Envelopes
, 2008
"... This paper deals with the problem of universal lossless coding on a countable infinite alphabet. It focuses on some classes of sources defined by an envelope condition on the marginal distribution, namely exponentially decreasing envelope classes with exponent α. The minimax redundancy of exponentia ..."
Abstract

Cited by 8 (2 self)
 Add to MetaCart
This paper deals with the problem of universal lossless coding on a countable infinite alphabet. It focuses on some classes of sources defined by an envelope condition on the marginal distribution, namely exponentially decreasing envelope classes with exponent α. The minimax redundancy of exponentially decreasing envelope 1 classes is proved to be equivalent to 4α log e log² n. Then a coding strategy is proposed, with a Bayes redundancy equivalent to the maximin redundancy. At last, an adaptive algorithm is provided, whose redundancy is equivalent to the minimax redundancy.
Tight Bounds on Profile Redundancy and Distinguishability
"... The minimax KLdivergence of any distribution from all distributions in a collection P has several practical implications. In compression, it is called redundancy and represents the least additional number of bits over the entropy needed to encode the output of any distribution in P. In online estim ..."
Abstract

Cited by 7 (3 self)
 Add to MetaCart
The minimax KLdivergence of any distribution from all distributions in a collection P has several practical implications. In compression, it is called redundancy and represents the least additional number of bits over the entropy needed to encode the output of any distribution in P. In online estimation and learning, it is the lowest expected logloss regret when guessing a sequence of random values generated by a distribution in P. In hypothesis testing, it upper bounds the largest number of distinguishable distributions in P. Motivated by problems ranging from population estimation to text classification and speech recognition, several machinelearning and informationtheory researchers have recently considered labelinvariant observations and properties induced by i.i.d. distributions. A sufficient statistic for all these properties is the data’s profile, the multiset of the number of times each data element appears. Improving on a sequence of previous works, we show that the redundancy of the collection of distributions induced over profiles by lengthn i.i.d. sequences is between 0.3 · n 1/3 and n 1/3 log 2 n, in particular, establishing its exact growth power. 1
Identifying statistical dependence in genomic sequences via mutual information estimates
 EURASIP J. Bioinform. Syst. Biol
, 2007
"... Questions of understanding and quantifying the representation and amount of information in organisms have become a central part of biological research, as they potentially hold the key to fundamental advances. In this paper, we demonstrate the use of informationtheoretic tools for the task of iden ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
(Show Context)
Questions of understanding and quantifying the representation and amount of information in organisms have become a central part of biological research, as they potentially hold the key to fundamental advances. In this paper, we demonstrate the use of informationtheoretic tools for the task of identifying segments of biomolecules (DNA or RNA) that are statistically correlated. We develop a precise and reliable methodology, based on the notion of mutual information, for finding and extracting statistical as well as structural dependencies. A simple threshold function is defined, and its use in quantifying the level of significance of dependencies between biological segments is explored. These tools are used in two specific applications. First, for the identification of correlations between different parts of the maize zmSRp32 gene. There, we find significant dependencies between the 5 ’ untranslated region in zmSRp32 and its alternatively spliced exons. This observation may indicate the presence of asyet unknown alternative splicing mechanisms or structural scaffolds. Second, using data from the FBI’s Combined DNA Index System (CODIS), we demonstrate that our approach is particularly well suited for the problem of discovering short tandem repeats, an application of importance in genetic profiling.
Universal compression of Markov and related sources over arbitrary alphabets
 IEEE TRANSACTIONS ON INFORMATION THEORY
, 2006
"... Recent work has considered encoding a string by separately conveying its symbols and its pattern—the order in which the symbols appear. It was shown that the patterns of i.i.d. strings can be losslessly compressed with diminishing persymbol redundancy. In this paper the pattern redundancy of distri ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
Recent work has considered encoding a string by separately conveying its symbols and its pattern—the order in which the symbols appear. It was shown that the patterns of i.i.d. strings can be losslessly compressed with diminishing persymbol redundancy. In this paper the pattern redundancy of distributions with memory is considered. Close lower and upper bounds are established on the pattern redundancy of strings generated by Hidden Markov Models with a small number of states, showing in particular that their persymbol pattern redundancy diminishes with increasing string length. The upper bounds are obtained by analyzing the growth rate of the number of multidimensional integer partitions, and the lower bounds, using Hayman’s Theorem.
Classification using pattern probability estimators
 In Proceedings of IEEE Symposium on Information Theory
, 2010
"... Abstract—We consider the problem of classification, where the data of the classes are generated i.i.d. according to unknown probability distributions. The goal is to classify test data with minimum error probability, based on the training data available for the classes. The Likelihood Ratio Test (LR ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
(Show Context)
Abstract—We consider the problem of classification, where the data of the classes are generated i.i.d. according to unknown probability distributions. The goal is to classify test data with minimum error probability, based on the training data available for the classes. The Likelihood Ratio Test (LRT) is the optimal decision rule when the distributions are known. Hence, a popular approach for classification is to estimate the likelihoods using well known probability estimators, e.g., the Laplace and GoodTuring estimators, and use them in a LRT. We are primarily interested in situations where the alphabet of the underlying distributions is large compared to the training data available, which is indeed the case in most practical applications. We motivate and propose LRT’s based on pattern probability estimators that are known to achieve low redundancy for universal compression of large alphabet sources. While a complete proof for optimality of these decision rules is warranted, we demonstrate their performance and compare it with other wellknown classifiers by various experiments on synthetic data and real data for text classification. I.
On Universal Coding of Unordered Data
"... Abstract — There are several applications in information transfer and storage where the order of source letters is irrelevant at the destination. For these sourcedestination pairs, multiset communication rather than the more difficult task of sequence communication may be performed. In this work, w ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
(Show Context)
Abstract — There are several applications in information transfer and storage where the order of source letters is irrelevant at the destination. For these sourcedestination pairs, multiset communication rather than the more difficult task of sequence communication may be performed. In this work, we study universal multiset communication. For classes of countablealphabet sources that meet Kieffer’s condition for sequence communication, we present a scheme that universally achieves a rate of n + o(n) bits per multiset letter for multiset communication. We also define redundancy measures that are normalized by the logarithm of the multiset size rather than per multiset letter and show that these redundancy measures cannot be driven to zero for the class of finitealphabet memoryless multisets. This further implies that finitealphabet memoryless multisets cannot be encoded universally with vanishing fractional redundancy. I.