Results 1  10
of
11
Universal lossless compression with unknown alphabets  The average case
, 2006
"... Universal compression of patterns of sequences generated by independently identically distributed (i.i.d.) sources with unknown, possibly large, alphabets is investigated. A pattern is a sequence of indices that contains all consecutive indices in increasing order of first occurrence. If the alphabe ..."
Abstract

Cited by 11 (3 self)
 Add to MetaCart
Universal compression of patterns of sequences generated by independently identically distributed (i.i.d.) sources with unknown, possibly large, alphabets is investigated. A pattern is a sequence of indices that contains all consecutive indices in increasing order of first occurrence. If the alphabet of a source that generated a sequence is unknown, the inevitable cost of coding the unknown alphabet symbols can be exploited to create the pattern of the sequence. This pattern can in turn be compressed by itself. It is shown that if the alphabet size k is essentially small, then the average minimax and maximin redundancies as well as the redundancy of every code for almost every source, when compressing a pattern, consist of at least 0.5 log ( n/k 3) bits per each unknown probability parameter, and if all alphabet letters are likely to occur, there exist codes whose redundancy is at most 0.5 log ( n/k 2) bits per each unknown probability parameter, where n is the length of the data sequences. Otherwise, if the alphabet is large, these redundancies are essentially at least O ( n −2/3) bits per symbol, and there exist codes that achieve redundancy of essentially O ( n −1/2) bits per symbol. Two suboptimal lowcomplexity sequential algorithms for compression of patterns are presented and their description lengths
On the entropy rate of pattern processes
 Proceedings of the 2005 Data Compression Conference, Snowbird
, 2005
"... We study the entropy rate of pattern sequences of stochastic processes, and its relationship to the entropy rate of the original process. We give a complete characterization of this relationship for i.i.d. processes over arbitrary alphabets, stationary ergodic processes over discrete alphabets, and ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
We study the entropy rate of pattern sequences of stochastic processes, and its relationship to the entropy rate of the original process. We give a complete characterization of this relationship for i.i.d. processes over arbitrary alphabets, stationary ergodic processes over discrete alphabets, and a broad family of stationary ergodic processes over uncountable alphabets. For cases where the entropy rate of the pattern process is infinite, we characterize the possible growth rate of the block entropy. 1
Optimal information storage: Nonsequential sources and neural channels
, 2006
"... Information storage and retrieval systems are communication systems from the present to the future and fall naturally into the framework of information theory. The goal of information storage is to preserve as much signal fidelity under resource constraints as possible. The information storage theor ..."
Abstract

Cited by 6 (5 self)
 Add to MetaCart
Information storage and retrieval systems are communication systems from the present to the future and fall naturally into the framework of information theory. The goal of information storage is to preserve as much signal fidelity under resource constraints as possible. The information storage theorem delineates average fidelity and average resource values that are achievable and those that are not. Moreover, observable properties of optimal information storage systems and the robustness of optimal systems
Identifying Statistical Dependence in Genomic Sequences via Mutual Information Estimates ∗
"... Questions of understanding and quantifying the representation and amount of information in organisms have become a central part of biological research, as they potentially hold the key to fundamental advances. In this paper, we demonstrate the use of informationtheoretic tools for the task of identi ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
Questions of understanding and quantifying the representation and amount of information in organisms have become a central part of biological research, as they potentially hold the key to fundamental advances. In this paper, we demonstrate the use of informationtheoretic tools for the task of identifying segments of biomolecules (DNA or RNA) that are statistically correlated. We develop a precise and reliable methodology, based on the notion of mutual information, for finding and extracting statistical as well as structural dependencies. A simple threshold function is defined, and its use in quantifying the level of significance of dependencies between biological segments is explored. These tools are used in two specific applications. First, for the identification of correlations between different parts of the maize zmSRp32 gene. There, we find significant dependencies between the 5 ’ untranslated region in zmSRp32 and its alternatively spliced exons. This observation may indicate the presence of asyet unknown alternative splicing mechanisms or structural scaffolds. Second, using data from the FBI’s Combined DNA Index System (CODIS), we demonstrate that our approach is particularly well suited for the problem of discovering short tandem repeats, an application of importance in genetic profiling.
Universal compression of Markov and related sources over arbitrary alphabets
 IEEE TRANSACTIONS ON INFORMATION THEORY
, 2006
"... Recent work has considered encoding a string by separately conveying its symbols and its pattern—the order in which the symbols appear. It was shown that the patterns of i.i.d. strings can be losslessly compressed with diminishing persymbol redundancy. In this paper the pattern redundancy of distri ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
Recent work has considered encoding a string by separately conveying its symbols and its pattern—the order in which the symbols appear. It was shown that the patterns of i.i.d. strings can be losslessly compressed with diminishing persymbol redundancy. In this paper the pattern redundancy of distributions with memory is considered. Close lower and upper bounds are established on the pattern redundancy of strings generated by Hidden Markov Models with a small number of states, showing in particular that their persymbol pattern redundancy diminishes with increasing string length. The upper bounds are obtained by analyzing the growth rate of the number of multidimensional integer partitions, and the lower bounds, using Hayman’s Theorem.
On Universal Coding of Unordered Data
"... Abstract — There are several applications in information transfer and storage where the order of source letters is irrelevant at the destination. For these sourcedestination pairs, multiset communication rather than the more difficult task of sequence communication may be performed. In this work, w ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
Abstract — There are several applications in information transfer and storage where the order of source letters is irrelevant at the destination. For these sourcedestination pairs, multiset communication rather than the more difficult task of sequence communication may be performed. In this work, we study universal multiset communication. For classes of countablealphabet sources that meet Kieffer’s condition for sequence communication, we present a scheme that universally achieves a rate of n + o(n) bits per multiset letter for multiset communication. We also define redundancy measures that are normalized by the logarithm of the multiset size rather than per multiset letter and show that these redundancy measures cannot be driven to zero for the class of finitealphabet memoryless multisets. This further implies that finitealphabet memoryless multisets cannot be encoded universally with vanishing fractional redundancy. I.
Universal Coding on Infinite Alphabets: Exponentially Decreasing Envelopes
, 2008
"... This paper deals with the problem of universal lossless coding on a countable infinite alphabet. It focuses on some classes of sources defined by an envelope condition on the marginal distribution, namely exponentially decreasing envelope classes with exponent α. The minimax redundancy of exponentia ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
This paper deals with the problem of universal lossless coding on a countable infinite alphabet. It focuses on some classes of sources defined by an envelope condition on the marginal distribution, namely exponentially decreasing envelope classes with exponent α. The minimax redundancy of exponentially decreasing envelope 1 classes is proved to be equivalent to 4α log e log² n. Then a coding strategy is proposed, with a Bayes redundancy equivalent to the maximin redundancy. At last, an adaptive algorithm is provided, whose redundancy is equivalent to the minimax redundancy.
Patterns of i.i.d. Sequences and Their Entropy Part II: Bounds for Some Distributions ∗
, 711
"... A pattern of a sequence is a sequence of integer indices with each index describing the order of first occurrence of the respective symbol in the original sequence. In a recent paper, tight general bounds on the block entropy of patterns of sequences generated by independent and identically distribu ..."
Abstract
 Add to MetaCart
A pattern of a sequence is a sequence of integer indices with each index describing the order of first occurrence of the respective symbol in the original sequence. In a recent paper, tight general bounds on the block entropy of patterns of sequences generated by independent and identically distributed (i.i.d.) sources were derived. In this paper, precise approximations are provided for the pattern block entropies for patterns of sequences generated by i.i.d. uniform and monotonic distributions, including distributions over the integers, and the geometric distribution. Numerical bounds on the pattern block entropies of these distributions are provided even for very short blocks. Tight bounds are obtained even for distributions that have infinite i.i.d. entropy rates. The approximations are obtained using general bounds and their derivation techniques. Conditional index entropy is also studied for distributions over smaller alphabets. Index Terms: patterns, monotonic distributions, uniform distributions, entropy.
Tight Bounds on Profile Redundancy and Distinguishability
"... The minimax KLdivergence of any distribution from all distributions in a collection P has several practical implications. In compression, it is called redundancy and represents the least additional number of bits over the entropy needed to encode the output of any distribution in P. In online estim ..."
Abstract
 Add to MetaCart
The minimax KLdivergence of any distribution from all distributions in a collection P has several practical implications. In compression, it is called redundancy and represents the least additional number of bits over the entropy needed to encode the output of any distribution in P. In online estimation and learning, it is the lowest expected logloss regret when guessing a sequence of random values generated by a distribution in P. In hypothesis testing, it upper bounds the largest number of distinguishable distributions in P. Motivated by problems ranging from population estimation to text classification and speech recognition, several machinelearning and informationtheory researchers have recently considered labelinvariant observations and properties induced by i.i.d. distributions. A sufficient statistic for all these properties is the data’s profile, the multiset of the number of times each data element appears. Improving on a sequence of previous works, we show that the redundancy of the collection of distributions induced over profiles by lengthn i.i.d. sequences is between 0.3 · n 1/3 and n 1/3 log 2 n, in particular, establishing its exact growth power. 1