Results 1 -
5 of
5
On the entropy rate of pattern processes
- In Proceedings of the Data Compression Conference
, 2005
"... We study the entropy rate of pattern sequences of stochastic processes, and its relationship to the entropy rate of the original process. We give a complete characterization of this relationship for i.i.d. processes over arbitrary alphabets, stationary ergodic processes over discrete alphabets, and ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
We study the entropy rate of pattern sequences of stochastic processes, and its relationship to the entropy rate of the original process. We give a complete characterization of this relationship for i.i.d. processes over arbitrary alphabets, stationary ergodic processes over discrete alphabets, and a broad family of stationary ergodic processes over uncountable alphabets. For cases where the entropy rate of the pattern process is infinite, we characterize the possible growth rate of the block entropy. 1
Universal compression of Markov and related sources over arbitrary alphabets
- IEEE Transactions on Information Theory
, 2006
"... Abstract — Recent work has considered encoding a string by separately conveying its symbols and its pattern—the order in which the symbols appear. It was shown that the patterns of i.i.d. strings can be losslessly compressed with diminishing per-symbol redundancy. In this paper the pattern redundanc ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract — Recent work has considered encoding a string by separately conveying its symbols and its pattern—the order in which the symbols appear. It was shown that the patterns of i.i.d. strings can be losslessly compressed with diminishing per-symbol redundancy. In this paper the pattern redundancy of distributions with memory is considered. Close lower and upper bounds are established on the pattern redundancy of strings generated by Hidden Markov Models with a small number of states, showing in particular that their per-symbol pattern redundancy diminishes with increasing string length. The upper bounds are obtained by analyzing the growth rate of the number of multi-dimensional integer partitions, and the lower bounds, using Hayman’s Theorem. Index Terms — Hidden Markov Models, integer partitions, large alphabets, multi-dimensional partitions, patterns,
Identifying Statistical Dependence in Genomic Sequences via Mutual Information Estimates ∗
"... Questions of understanding and quantifying the representation and amount of information in organisms have become a central part of biological research, as they potentially hold the key to fundamental advances. In this paper, we demonstrate the use of informationtheoretic tools for the task of identi ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Questions of understanding and quantifying the representation and amount of information in organisms have become a central part of biological research, as they potentially hold the key to fundamental advances. In this paper, we demonstrate the use of informationtheoretic tools for the task of identifying segments of biomolecules (DNA or RNA) that are statistically correlated. We develop a precise and reliable methodology, based on the notion of mutual information, for finding and extracting statistical as well as structural dependencies. A simple threshold function is defined, and its use in quantifying the level of significance of dependencies between biological segments is explored. These tools are used in two specific applications. First, for the identification of correlations between different parts of the maize zmSRp32 gene. There, we find significant dependencies between the 5 ’ untranslated region in zmSRp32 and its alternatively spliced exons. This observation may indicate the presence of as-yet unknown alternative splicing mechanisms or structural scaffolds. Second, using data from the FBI’s Combined DNA Index System (CODIS), we demonstrate that our approach is particularly well suited for the problem of discovering short tandem repeats, an application of importance in genetic profiling.
On Universal Coding of Unordered Data
"... Abstract — There are several applications in information transfer and storage where the order of source letters is irrelevant at the destination. For these source-destination pairs, multiset communication rather than the more difficult task of sequence communication may be performed. In this work, w ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Abstract — There are several applications in information transfer and storage where the order of source letters is irrelevant at the destination. For these source-destination pairs, multiset communication rather than the more difficult task of sequence communication may be performed. In this work, we study universal multiset communication. For classes of countable-alphabet sources that meet Kieffer’s condition for sequence communication, we present a scheme that universally achieves a rate of n + o(n) bits per multiset letter for multiset communication. We also define redundancy measures that are normalized by the logarithm of the multiset size rather than per multiset letter and show that these redundancy measures cannot be driven to zero for the class of finite-alphabet memoryless multisets. This further implies that finite-alphabet memoryless multisets cannot be encoded universally with vanishing fractional redundancy. I.
Patterns of i.i.d. Sequences and Their Entropy- Part II: Bounds for Some Distributions ∗
, 711
"... A pattern of a sequence is a sequence of integer indices with each index describing the order of first occurrence of the respective symbol in the original sequence. In a recent paper, tight general bounds on the block entropy of patterns of sequences generated by independent and identically distribu ..."
Abstract
- Add to MetaCart
A pattern of a sequence is a sequence of integer indices with each index describing the order of first occurrence of the respective symbol in the original sequence. In a recent paper, tight general bounds on the block entropy of patterns of sequences generated by independent and identically distributed (i.i.d.) sources were derived. In this paper, precise approximations are provided for the pattern block entropies for patterns of sequences generated by i.i.d. uniform and monotonic distributions, including distributions over the integers, and the geometric distribution. Numerical bounds on the pattern block entropies of these distributions are provided even for very short blocks. Tight bounds are obtained even for distributions that have infinite i.i.d. entropy rates. The approximations are obtained using general bounds and their derivation techniques. Conditional index entropy is also studied for distributions over smaller alphabets. Index Terms: patterns, monotonic distributions, uniform distributions, entropy.

