Results 1 - 10
of
11
A lower bound on compression of unknown alphabets
- Theoret. Comput. Sci
, 2005
"... Many applications call for universal compression of strings over large, possibly infinite, alphabets. However, it has long been known that the resulting redundancy is infinite even for i.i.d. distributions. It was recently shown that the redudancy of the strings ’ patterns, which abstract the values ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
Many applications call for universal compression of strings over large, possibly infinite, alphabets. However, it has long been known that the resulting redundancy is infinite even for i.i.d. distributions. It was recently shown that the redudancy of the strings ’ patterns, which abstract the values of the symbols, retaining only their relative precedence, is sublinear in the blocklength n, hence the per-symbol redundancy diminishes to zero. In this paper we show that pattern redundancy is at least (1.5 log 2 e) n 1/3 bits. To do so, we construct a generating function whose coefficients lower bound the redundancy, and use Hayman’s saddle-point approximation technique to determine the coefficients ’ asymptotic behavior. 1
Population estimation with performance guarantees
- In Proceedings of IEEE Symposium on Information Theory
, 2007
"... Abstract — We estimate the population size by sampling uniformly from the population. Given an accuracy to which we need to estimate the population with a pre-specified confidence, we provide a simple stopping rule for the sampling process. I. SUMMARY Many applications such as species estimation [1] ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Abstract — We estimate the population size by sampling uniformly from the population. Given an accuracy to which we need to estimate the population with a pre-specified confidence, we provide a simple stopping rule for the sampling process. I. SUMMARY Many applications such as species estimation [1], database sampling [2], and epidemiologic studies [3], [4], [5] call for estimating a population size based on a relatively small sample. We derive a simple, yet nearly optimal, stopping rule for sampling and an estimation formula for alphabet size from uniform samples taken from the population. We will consider an approach outlined for the species estimation problem by Good [6] further on in the summary. For a more complete survey of prior results obtained in the species estimation problem, see [1]. For problems in database sampling see [7], [2]. The results obtained in this paper are also related to capture-recapture problems [3], [4], [5], where the unknown population size is estimated given the number of samples that are recaptured (repetitions) when sampling randomly from the population. Here, we are interested in how many recaptures are necessary to estimate the population to a given accuracy with a specified confidence. Intuitively speaking, the more the number of recaptures, the better the population size can be estimated. Formally, in an n-element sample let m denote the number of distinct elements. Let r = n − m denote the number of repeated elements. For example, in c,g,c,s,g,c,v, there are n = 7 samples, there are m = 4 distinct elements, c,g,s, and v, and r = 7 − 4 = 3 repeated elements, one g and two c ′. In the following, n independent samples are drawn uniformly from a k-element population and M k n and R k n = n − M k n are the random number of distinct and repeated elements observed. We drop the subscripts and superscripts when there is no ambiguity. A. Good’s approach By linearity of expectations, E(M) = k 1 −
Average Redundancy for Known Sources: Ubiquitous Trees in Source Coding
, 2008
"... Analytic information theory aims at studying problems of information theory using analytic techniques of computer science and combinatorics. Following Hadamard’s precept, these problems are tackled by complex analysis methods such as generating functions, Mellin transform, Fourier series, saddle poi ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Analytic information theory aims at studying problems of information theory using analytic techniques of computer science and combinatorics. Following Hadamard’s precept, these problems are tackled by complex analysis methods such as generating functions, Mellin transform, Fourier series, saddle point method, analytic poissonization and depoissonization, and singularity analysis. This approach lies at the crossroad of computer science and information theory. In this survey we concentrate on one facet of information theory (i.e., source coding better known as data compression), namely the redundancy rate problem. The redundancy rate problem determines by how much the actual code length exceeds the optimal code length. We further restrict our interest to the average redundancy for known sources, that is, when statistics of information sources are known. We present precise analyses of three types of lossless data compression schemes, namely fixed-to-variable (FV) length codes, variable-to-fixed (VF) length codes, and variable-to-variable (VV) length codes. In particular, we investigate average redundancy of Huffman, Tunstall, and Khodak codes. These codes have succinct representations as trees, either as coding or parsing trees, and we analyze here some of their parameters (e.g., the average path from the root to a leaf).
On the entropy rate of pattern processes
- In Proceedings of the Data Compression Conference
, 2005
"... We study the entropy rate of pattern sequences of stochastic processes, and its relationship to the entropy rate of the original process. We give a complete characterization of this relationship for i.i.d. processes over arbitrary alphabets, stationary ergodic processes over discrete alphabets, and ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
We study the entropy rate of pattern sequences of stochastic processes, and its relationship to the entropy rate of the original process. We give a complete characterization of this relationship for i.i.d. processes over arbitrary alphabets, stationary ergodic processes over discrete alphabets, and a broad family of stationary ergodic processes over uncountable alphabets. For cases where the entropy rate of the pattern process is infinite, we characterize the possible growth rate of the block entropy. 1
Strong consistency of the Good-Turing estimator
- in IEEE Int. Symp. Inf. Theor. Proc
, 2006
"... Abstract — We consider the problem of estimating the total probability of all symbols that appear with a given frequency in a string of i.i.d. random variables with unknown distribution. We focus on the regime in which the block length is large yet no symbol appears frequently in the string. This is ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Abstract — We consider the problem of estimating the total probability of all symbols that appear with a given frequency in a string of i.i.d. random variables with unknown distribution. We focus on the regime in which the block length is large yet no symbol appears frequently in the string. This is accomplished by allowing the distribution to change with the block length. Under a natural convergence assumption on the sequence of underlying distributions, we show that the total probabilities converge to a deterministic limit, which we characterize. We then show that the Good-Turing total probability estimator is strongly consistent. I.
Universal compression of Markov and related sources over arbitrary alphabets
- IEEE Transactions on Information Theory
, 2006
"... Abstract — Recent work has considered encoding a string by separately conveying its symbols and its pattern—the order in which the symbols appear. It was shown that the patterns of i.i.d. strings can be losslessly compressed with diminishing per-symbol redundancy. In this paper the pattern redundanc ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract — Recent work has considered encoding a string by separately conveying its symbols and its pattern—the order in which the symbols appear. It was shown that the patterns of i.i.d. strings can be losslessly compressed with diminishing per-symbol redundancy. In this paper the pattern redundancy of distributions with memory is considered. Close lower and upper bounds are established on the pattern redundancy of strings generated by Hidden Markov Models with a small number of states, showing in particular that their per-symbol pattern redundancy diminishes with increasing string length. The upper bounds are obtained by analyzing the growth rate of the number of multi-dimensional integer partitions, and the lower bounds, using Hayman’s Theorem. Index Terms — Hidden Markov Models, integer partitions, large alphabets, multi-dimensional partitions, patterns,
On Universal Coding of Unordered Data
"... Abstract — There are several applications in information transfer and storage where the order of source letters is irrelevant at the destination. For these source-destination pairs, multiset communication rather than the more difficult task of sequence communication may be performed. In this work, w ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Abstract — There are several applications in information transfer and storage where the order of source letters is irrelevant at the destination. For these source-destination pairs, multiset communication rather than the more difficult task of sequence communication may be performed. In this work, we study universal multiset communication. For classes of countable-alphabet sources that meet Kieffer’s condition for sequence communication, we present a scheme that universally achieves a rate of n + o(n) bits per multiset letter for multiset communication. We also define redundancy measures that are normalized by the logarithm of the multiset size rather than per multiset letter and show that these redundancy measures cannot be driven to zero for the class of finite-alphabet memoryless multisets. This further implies that finite-alphabet memoryless multisets cannot be encoded universally with vanishing fractional redundancy. I.
Minimax Redundancy for Large Alphabets
"... Abstract—We study the minimax redundancy of universal coding for large alphabets over memoryless sources and present two main results: We first complete studies initiated in Orlitsky and Santhanam [12] deriving precise asymptotics of the minimax redundancy for all ranges of the alphabet sizes. Secon ..."
Abstract
- Add to MetaCart
Abstract—We study the minimax redundancy of universal coding for large alphabets over memoryless sources and present two main results: We first complete studies initiated in Orlitsky and Santhanam [12] deriving precise asymptotics of the minimax redundancy for all ranges of the alphabet sizes. Second, we consider the minimax redundancy of a source model in which some symbol probabilities are fixed. The latter model leads to an interesting binomial sum asymptotics with superexponential growth functions. Our findings could be used to approximate numerically the minimax redundancy for various ranges of the sequence length and the alphabet size. These results are obtained by analytic techniques such as tree-like generating functions and the saddle point method. I.
UNIVERSAL PREDICTION OVER LARGE ALPHABETS
"... Insurance transfers losses associated with risks to the insurer for a price, the premium. Considering a natural probabilistic framework for the insurance problem, we derive a necessary and sufficient condition on loss models such that the insurer remains solvent despite the losses taken on. In parti ..."
Abstract
- Add to MetaCart
Insurance transfers losses associated with risks to the insurer for a price, the premium. Considering a natural probabilistic framework for the insurance problem, we derive a necessary and sufficient condition on loss models such that the insurer remains solvent despite the losses taken on. In particular, there need not be any upper bound on the loss—rather it is the structure of the model space that decides insurability. Insurance is a way of managing losses associated with risks—for example, floods, network outages, and earthquakes— primarily by transfering risk to another entity—the insurer, for a price, the premium. The insurer attempts to break even by balancing the possible loss that may be suffered by a few (risk) with the guaranteed payments of many
Patterns of i.i.d. Sequences and Their Entropy- Part II: Bounds for Some Distributions ∗
, 711
"... A pattern of a sequence is a sequence of integer indices with each index describing the order of first occurrence of the respective symbol in the original sequence. In a recent paper, tight general bounds on the block entropy of patterns of sequences generated by independent and identically distribu ..."
Abstract
- Add to MetaCart
A pattern of a sequence is a sequence of integer indices with each index describing the order of first occurrence of the respective symbol in the original sequence. In a recent paper, tight general bounds on the block entropy of patterns of sequences generated by independent and identically distributed (i.i.d.) sources were derived. In this paper, precise approximations are provided for the pattern block entropies for patterns of sequences generated by i.i.d. uniform and monotonic distributions, including distributions over the integers, and the geometric distribution. Numerical bounds on the pattern block entropies of these distributions are provided even for very short blocks. Tight bounds are obtained even for distributions that have infinite i.i.d. entropy rates. The approximations are obtained using general bounds and their derivation techniques. Conditional index entropy is also studied for distributions over smaller alphabets. Index Terms: patterns, monotonic distributions, uniform distributions, entropy.

