Results 1  10
of
17
Universal lossless compression with unknown alphabets  The average case
, 2006
"... Universal compression of patterns of sequences generated by independently identically distributed (i.i.d.) sources with unknown, possibly large, alphabets is investigated. A pattern is a sequence of indices that contains all consecutive indices in increasing order of first occurrence. If the alphabe ..."
Abstract

Cited by 14 (4 self)
 Add to MetaCart
(Show Context)
Universal compression of patterns of sequences generated by independently identically distributed (i.i.d.) sources with unknown, possibly large, alphabets is investigated. A pattern is a sequence of indices that contains all consecutive indices in increasing order of first occurrence. If the alphabet of a source that generated a sequence is unknown, the inevitable cost of coding the unknown alphabet symbols can be exploited to create the pattern of the sequence. This pattern can in turn be compressed by itself. It is shown that if the alphabet size k is essentially small, then the average minimax and maximin redundancies as well as the redundancy of every code for almost every source, when compressing a pattern, consist of at least 0.5 log ( n/k 3) bits per each unknown probability parameter, and if all alphabet letters are likely to occur, there exist codes whose redundancy is at most 0.5 log ( n/k 2) bits per each unknown probability parameter, where n is the length of the data sequences. Otherwise, if the alphabet is large, these redundancies are essentially at least O ( n −2/3) bits per symbol, and there exist codes that achieve redundancy of essentially O ( n −1/2) bits per symbol. Two suboptimal lowcomplexity sequential algorithms for compression of patterns are presented and their description lengths
On the entropy rate of pattern processes
 Proceedings of the 2005 Data Compression Conference, Snowbird
, 2005
"... We study the entropy rate of pattern sequences of stochastic processes, and its relationship to the entropy rate of the original process. We give a complete characterization of this relationship for i.i.d. processes over arbitrary alphabets, stationary ergodic processes over discrete alphabets, and ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
(Show Context)
We study the entropy rate of pattern sequences of stochastic processes, and its relationship to the entropy rate of the original process. We give a complete characterization of this relationship for i.i.d. processes over arbitrary alphabets, stationary ergodic processes over discrete alphabets, and a broad family of stationary ergodic processes over uncountable alphabets. For cases where the entropy rate of the pattern process is infinite, we characterize the possible growth rate of the block entropy. 1
Optimal information storage: Nonsequential sources and neural channels
, 2006
"... Information storage and retrieval systems are communication systems from the present to the future and fall naturally into the framework of information theory. The goal of information storage is to preserve as much signal fidelity under resource constraints as possible. The information storage theor ..."
Abstract

Cited by 6 (5 self)
 Add to MetaCart
(Show Context)
Information storage and retrieval systems are communication systems from the present to the future and fall naturally into the framework of information theory. The goal of information storage is to preserve as much signal fidelity under resource constraints as possible. The information storage theorem delineates average fidelity and average resource values that are achievable and those that are not. Moreover, observable properties of optimal information storage systems and the robustness of optimal systems
Identifying statistical dependence in genomic sequences via mutual information estimates
 EURASIP J. Bioinform. Syst. Biol
, 2007
"... Questions of understanding and quantifying the representation and amount of information in organisms have become a central part of biological research, as they potentially hold the key to fundamental advances. In this paper, we demonstrate the use of informationtheoretic tools for the task of iden ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
(Show Context)
Questions of understanding and quantifying the representation and amount of information in organisms have become a central part of biological research, as they potentially hold the key to fundamental advances. In this paper, we demonstrate the use of informationtheoretic tools for the task of identifying segments of biomolecules (DNA or RNA) that are statistically correlated. We develop a precise and reliable methodology, based on the notion of mutual information, for finding and extracting statistical as well as structural dependencies. A simple threshold function is defined, and its use in quantifying the level of significance of dependencies between biological segments is explored. These tools are used in two specific applications. First, for the identification of correlations between different parts of the maize zmSRp32 gene. There, we find significant dependencies between the 5 ’ untranslated region in zmSRp32 and its alternatively spliced exons. This observation may indicate the presence of asyet unknown alternative splicing mechanisms or structural scaffolds. Second, using data from the FBI’s Combined DNA Index System (CODIS), we demonstrate that our approach is particularly well suited for the problem of discovering short tandem repeats, an application of importance in genetic profiling.
Universal compression of Markov and related sources over arbitrary alphabets
 IEEE TRANSACTIONS ON INFORMATION THEORY
, 2006
"... Recent work has considered encoding a string by separately conveying its symbols and its pattern—the order in which the symbols appear. It was shown that the patterns of i.i.d. strings can be losslessly compressed with diminishing persymbol redundancy. In this paper the pattern redundancy of distri ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
Recent work has considered encoding a string by separately conveying its symbols and its pattern—the order in which the symbols appear. It was shown that the patterns of i.i.d. strings can be losslessly compressed with diminishing persymbol redundancy. In this paper the pattern redundancy of distributions with memory is considered. Close lower and upper bounds are established on the pattern redundancy of strings generated by Hidden Markov Models with a small number of states, showing in particular that their persymbol pattern redundancy diminishes with increasing string length. The upper bounds are obtained by analyzing the growth rate of the number of multidimensional integer partitions, and the lower bounds, using Hayman’s Theorem.
Benefiting from disorder: Source coding for unordered data. arXiv
, 708
"... The order of letters is not always relevant in a communication task. This paper discusses the implications of order irrelevance on source coding, presenting results in several major branches of source coding theory: lossless coding, universal lossless coding, ratedistortion, highrate quantization, ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
The order of letters is not always relevant in a communication task. This paper discusses the implications of order irrelevance on source coding, presenting results in several major branches of source coding theory: lossless coding, universal lossless coding, ratedistortion, highrate quantization, and universal lossy coding. The main conclusions demonstrate that there is a significant rate savings when order is irrelevant. In particular, lossless coding of n letters from a finite alphabet requires Θ(log n) bits and universal lossless coding requires n + o(n) bits for many countable alphabet sources. However, there are no universal schemes that can drive a strong redundancy measure to zero. Results for lossy coding include distributionfree expressions for the rate savings from order irrelevance in various highrate quantization schemes. Ratedistortion bounds are given, and it is shown that the analogue of the Shannon lower bound is loose at all finite rates.
Universal Coding on Infinite Alphabets: Exponentially Decreasing Envelopes
, 2008
"... This paper deals with the problem of universal lossless coding on a countable infinite alphabet. It focuses on some classes of sources defined by an envelope condition on the marginal distribution, namely exponentially decreasing envelope classes with exponent α. The minimax redundancy of exponentia ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
This paper deals with the problem of universal lossless coding on a countable infinite alphabet. It focuses on some classes of sources defined by an envelope condition on the marginal distribution, namely exponentially decreasing envelope classes with exponent α. The minimax redundancy of exponentially decreasing envelope 1 classes is proved to be equivalent to 4α log e log² n. Then a coding strategy is proposed, with a Bayes redundancy equivalent to the maximin redundancy. At last, an adaptive algorithm is provided, whose redundancy is equivalent to the minimax redundancy.
Competitive Closeness Testing
 24TH ANNUAL CONFERENCE ON LEARNING THEORY
, 2011
"... We test whether two sequences are generated by the same distribution or by two different ones. Unlike previous work, we make no assumptions on the distributions ’ support size. Additionally, we compare our performance to that of the best possible test. We describe an efficientlycomputable algorithm ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
We test whether two sequences are generated by the same distribution or by two different ones. Unlike previous work, we make no assumptions on the distributions ’ support size. Additionally, we compare our performance to that of the best possible test. We describe an efficientlycomputable algorithm based on pattern maximum likelihood that is near optimal whenever the best possible error probability is ≤ exp(−14n 2/3) using lengthn sequences.
On Universal Coding of Unordered Data
"... Abstract — There are several applications in information transfer and storage where the order of source letters is irrelevant at the destination. For these sourcedestination pairs, multiset communication rather than the more difficult task of sequence communication may be performed. In this work, w ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
(Show Context)
Abstract — There are several applications in information transfer and storage where the order of source letters is irrelevant at the destination. For these sourcedestination pairs, multiset communication rather than the more difficult task of sequence communication may be performed. In this work, we study universal multiset communication. For classes of countablealphabet sources that meet Kieffer’s condition for sequence communication, we present a scheme that universally achieves a rate of n + o(n) bits per multiset letter for multiset communication. We also define redundancy measures that are normalized by the logarithm of the multiset size rather than per multiset letter and show that these redundancy measures cannot be driven to zero for the class of finitealphabet memoryless multisets. This further implies that finitealphabet memoryless multisets cannot be encoded universally with vanishing fractional redundancy. I.
Patterns of i.i.d. sequences and their entropy
 IEEE Trans. Inf. Theory
"... Bounds on the entropy of patterns of sequences generated by independently identically distributed (i.i.d.) sources are derived. A pattern is a sequence of indices that contains all consecutive integer indices in increasing order of first occurrence. If the alphabet of a source that generated a seque ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
Bounds on the entropy of patterns of sequences generated by independently identically distributed (i.i.d.) sources are derived. A pattern is a sequence of indices that contains all consecutive integer indices in increasing order of first occurrence. If the alphabet of a source that generated a sequence is unknown, the inevitable cost of coding the unknown alphabet symbols can be exploited to create the pattern of the sequence. This pattern can in turn be compressed by itself. The bounds derived here are functions of the i.i.d. source entropy, alphabet size, and letter probabilities. It is shown that for large alphabets, the pattern entropy must decrease from the i.i.d. one. The decrease is in many cases more significant than the universal coding redundancy bounds derived in prior works. The pattern entropy is confined between two bounds that depend on the arrangement of the letter probabilities in the probability space. For very large alphabets whose size may be greater than the coded pattern length, all low probability letters are packed into one symbol. The pattern entropy is upper and lower bounded in terms of the i.i.d. entropy of the new packed alphabet. Correction terms are provided for both upper and lower bounds. The bounds are used to approximate the pattern entropy for various specific distributions, with focus on uniform and monotonic ones. Tight bounds are obtained on the pattern entropy even for distributions that have infinite i.i.d. entropy rates.