## Universal compression of memoryless sources over unknown alphabets (2004)

Venue: | IEEE TRANSACTIONS ON INFORMATION THEORY |

Citations: | 32 - 10 self |

### BibTeX

@ARTICLE{Orlitsky04universalcompression,

author = {Alon Orlitsky and Narayana P. Santhanam and Junan Zhang},

title = {Universal compression of memoryless sources over unknown alphabets},

journal = {IEEE TRANSACTIONS ON INFORMATION THEORY},

year = {2004},

volume = {50},

number = {7},

pages = {1469--1481}

}

### Years of Citing Articles

### OpenURL

### Abstract

It has long been known that the compression redundancy of independent and identically distributed (i.i.d.) strings increases to infinity as the alphabet size grows. It is also apparent that any string can be described by separately conveying its symbols, and its pattern—the order in which the symbols appear. Concentrating on the latter, we show that the patterns of i.i.d. strings over all, including infinite and even unknown, alphabets, can be compressed with diminishing redundancy, both in block and sequentially, and that the compression can be performed in linear time. To establish these results, we show that the number of patterns is the Bell number, that the number of patterns with a given number of symbols is the Stirling number of the second kind, and that the redundancy of patterns can be bounded using results of Hardy and Ramanujan on the number of integer partitions. The results also imply an asymptotically optimal solution for the Good-Turing probability-estimation problem.

### Citations

1461 |
An Introduction to Probability Theory
- Feller
- 1971
(Show Context)
Citation Context ... . The following lemma provides Lemma 4: When and (4)s1474 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 50, NO. 7, JULY 2004 and when, in addition, Proof: Feller’s bounds on Stirling’s approximation =-=[38]-=- state that for every Hence, for all Taking derivatives, it is easy to see that for all , hence, for all and Therefore, for all where proving the first part of the lemma. When and , , and the second p... |

857 | An empirical study of smoothing techniques for language modeling - Chen, Goodman - 1996 |

854 |
An introduction to the theory of numbers
- Hardy, Wright
- 1979
(Show Context)
Citation Context ...r . is an unordered partition of iff Henceforth, we use the notation developed for profiles of patterns in Section IV-B for unordered partitions as well. Lemma 10: (Hardy and Ramanujan [40], see also =-=[41]-=-) The number of unordered partitions of is Lemmas 8 and 10 imply the following upper bound on the pattern redundancy of . Theorem 11: For all In particular, the pattern redundancy of i.i.d. strings is... |

347 |
Universal Codeword Sets and Representations of the Integers
- Elias
- 1975
(Show Context)
Citation Context ...ided general distributions over large alphabets. On the theoretical side, researchers constructed compression algorithms for subclasses of i.i.d. distributions that satisfy Kieffer’s condition. Elias =-=[16]-=-, Györfi, Pali, and Van der Meulen [17], and Foster, Stine, and Wyner [18] considered monotone, namely, , i.i.d. distributions over the natural numbers, Uyematsu and Kanaya [19] studied bounded-moment... |

323 |
A course in combinatorics
- Lint, Wilson
- 2001
(Show Context)
Citation Context ..., , , , , , , and so on, hence, Note that is for and otherwise. The numbers are called Stirling numbers of the second kind while the numbers , are called Bell numbers. Many results are known for both =-=[36]-=-. In particular, it is easy to see that for all , Bell numbers satisfy the recursion Set partitions are equivalent to patterns. To see that, let the mapping assign to the set partition where denotes t... |

285 | Universal Coding, Information Prediction and Estimation - Rissanen - 1984 |

275 | Fisher information and stochastic complexity - Rissanen - 1996 |

194 | general language model for information retrieval
- Song, Croft
- 1999
(Show Context)
Citation Context ... (r), the first that appeared (a again), the fourth (c), the first (a), etc. Of the pattern and the dictionary parts of describing a string, the former has a greater bearing on many applications [24]–=-=[30]-=-. For example, in language modeling, the pattern reflects the structure of the language while the dictionary reflects the spelling of words. We therefore concentrate on pattern compression.sORLITSKY e... |

151 | Universal portfolios - Cover - 1991 |

135 | Universal prediction
- Merhav, Feder
- 1998
(Show Context)
Citation Context ...orms almost optimally by approaching the entropy no matter which distribution in generates the data. The following is a brief introduction to universal compression. For an extensive overview, see [3]–=-=[6]-=-. Manuscript received March 29, 2003; revised March 31, 2004. This work was supported by the National Science Foundation under Grant CCR-0313367. The material in this paper was presented in part at th... |

126 |
Universal sequential coding of single messages
- Shtarkov
- 1987
(Show Context)
Citation Context ...size in ways. These two decompositions uniquely define the partition, hence, the number of profile- partitions of is Shtarkov’s Sum We will frequently evaluate redundancies using a result by Shtarkov =-=[37]-=- showing that the distribution achieving in (1) is It follows that the redundancy of a collection of distributions over is determined by Shtarkov’s sum Approximation of Binomial Coefficients While fin... |

113 |
Asymptotic formulae in combinatory analysis
- Hardy, Ramanujan
- 1918
(Show Context)
Citation Context ...emma 9: A vector . is an unordered partition of iff Henceforth, we use the notation developed for profiles of patterns in Section IV-B for unordered partitions as well. Lemma 10: (Hardy and Ramanujan =-=[40]-=-, see also [41]) The number of unordered partitions of is Lemmas 8 and 10 imply the following upper bound on the pattern redundancy of . Theorem 11: For all In particular, the pattern redundancy of i.... |

103 | A game of prediction with expert advice - Vovk - 1998 |

90 |
Universal noiseless coding
- Davisson
- 1973
(Show Context)
Citation Context ...ngs to some natural class , such as the collection of independent and identically distributed (i.i.d.), Markov, or stationary distributions, but that the precise distribution within is not known [1], =-=[2]-=-. The objective then is to compress the data almost as well as when the distribution is known in advance, namely, to find a universal compression scheme that performs almost optimally by approaching t... |

86 | The number of partitions of a set - Rota - 1964 |

85 | Universal portfolios with side information - Cover, Ordentlich - 1996 |

55 | Asymptotic minimax regret for data compression, gambling, and prediction - Xie, Barron - 1996 |

51 |
Probability Scoring for Spelling Correction
- Church, Gale
- 1991
(Show Context)
Citation Context ...ppear (r), the first that appeared (a again), the fourth (c), the first (a), etc. Of the pattern and the dictionary parts of describing a string, the former has a greater bearing on many applications =-=[24]-=-–[30]. For example, in language modeling, the pattern reflects the structure of the language while the dictionary reflects the spelling of words. We therefore concentrate on pattern compression.sORLIT... |

49 | A decision-theoretic extension of stochastic complexity and its applications to learning - Yamanishi - 1998 |

42 |
Always Good Turing: Asymptotically optimal probability estimation
- Orlitsky, Santhanam, et al.
- 2003
(Show Context)
Citation Context ...ishes to zero. These results can be extended to distributions with memory [31]. They can also be adapted to yield an asymptotically optimal solution for the Good-Turing probability-estimation problem =-=[32]-=-. III. PATTERNS We formally describe patterns and their redundancy. Let be any alphabet. For denotes the set of symbols appearing in . The index of is and one more than the number of distinct symbols ... |

40 |
A unified approach to weak universal source coding
- Kieffer
(Show Context)
Citation Context ...eases to infinity as grows. Similar conclusions hold also when is allowed to grow with the block length [15]. A systematic study of universal compression over arbitrary alphabets was taken by Kieffer =-=[4]-=- who analyzed a slightly less restrictive form of redundancy, related to weak universal compression. He derived a necessary and sufficient condition for weak universality, and used it to show that eve... |

30 | A Method for Disambiguating Word Senses - Gale, Church, et al. - 1993 |

29 |
The performance of universal coding
- Krichevsky, Trofimov
- 1981
(Show Context)
Citation Context ...d. distributions. Consequently, the redundancy of , the collection of i.i.d.distributions over sequences of length drawn from an alphabet of size , was studied extensively, and a succession of papers =-=[7]-=-–[14] has shown that for any fixed as increases where is the gamma function, and the term diminishes with increasing at a rate determined by . For any fixed alphabet size , this redundancy grows logar... |

21 |
The On-line Encyclopedia of Integer Sequences”. The OEIS Foundation Inc. http://oeis.org
- Sloane
(Show Context)
Citation Context ...erations per symbol. IV. PRELIMINARIES We first establish a correspondence between patterns and set partitions. Set partitions have been studied extensively by a number of well-known researchers [33]–=-=[35]-=-, and in this section and in Section V we use their properties to derive the asymptotics of and of the growth rate of . A. Set Partitions and Patterns A partition of a set is a collection of disjoint ... |

17 | Grammar based codes: A new class of universal lossless source codes
- Kieffer, Yang
(Show Context)
Citation Context ...7], and Foster, Stine, and Wyner [18] considered monotone, namely, , i.i.d. distributions over the natural numbers, Uyematsu and Kanaya [19] studied bounded-moment distributions, and Kieffer and Yang =-=[20]-=- and He and Yang [21] showed that any collection satisfying Kieffer’s condition can be universally compressed by grammar codes. Since actual distributions may not satisfy Kieffer’s condition, practica... |

16 |
Smeets, “Multialphabet coding with separate alphabet description
- ˚Aberg, Shtarkov, et al.
- 1997
(Show Context)
Citation Context ...tion and the discussion above, we recently took a different approach to compression of large, possibly infinite, alphabets [15], [22]. A similar approach was considered by Åberg, Shtarkov, and Smeets =-=[23]-=- who lower-bounded its performance when the underlying alphabet is finite and the sequence length increases to infinity, see Section V-C. To motivate this approach, consider perhaps the simplest infin... |

16 |
Redundancy rates for renewal and other processes
- Csiszár, Shields
- 1996
(Show Context)
Citation Context ...number of compressed symbols increases. We note that the number of integer partitions has also been used bys1476 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 50, NO. 7, JULY 2004 Csiszár and Shields =-=[42]-=- to bound the redundancy of renewal processes. C. Lower Bound In the last section, we showed that the redundancy of patterns of i.i.d. strings is . We now show that it is .We provide a simple proof of... |

14 | On asymptotics of certain recurrences arising in universal coding - Szpankowski - 1998 |

13 | Multi-alphabet universal coding of memoryless sources - Shtarkov, Tjalkens, et al. - 1995 |

12 |
der Meulen, “On universal noiseless source coding for infinite source alphabets.” Eur
- Gyorfi, Pali, et al.
(Show Context)
Citation Context ...lphabets. On the theoretical side, researchers constructed compression algorithms for subclasses of i.i.d. distributions that satisfy Kieffer’s condition. Elias [16], Györfi, Pali, and Van der Meulen =-=[17]-=-, and Foster, Stine, and Wyner [18] considered monotone, namely, , i.i.d. distributions over the natural numbers, Uyematsu and Kanaya [19] studied bounded-moment distributions, and Kieffer and Yang [2... |

12 |
Universal compression of unknown alphabets
- Jevtić, Orlitsky, et al.
(Show Context)
Citation Context ...ificantly fewer bits. Motivated by language modeling for speech recognition and the discussion above, we recently took a different approach to compression of large, possibly infinite, alphabets [15], =-=[22]-=-. A similar approach was considered by Åberg, Shtarkov, and Smeets [23] who lower-bounded its performance when the underlying alphabet is finite and the sequence length increases to infinity, see Sect... |

11 | Universal lossless compression with unknown alphabets—the average case
- Shamir
- 2006
(Show Context)
Citation Context ... general. If so, it would yield a lower bound similar to those described here. For a more complete discussion, see [44]. Note also that subsequent to the derivation of Theorem 13, Shamir et al. [45], =-=[46]-=- showed that the average-case pattern redundancy is lower-bounded by for arbitrarily small . VI. SEQUENTIAL COMPRESSION The compression schemes considered so far operated on the whole block of symbols... |

10 |
Performance of universal codes over infinite alphabets
- Orlitsky, Santhanam
- 2003
(Show Context)
Citation Context ...l letters. When, as in the above applications, the alphabet size is large, is high too, and increases to infinity as grows. Similar conclusions hold also when is allowed to grow with the block length =-=[15]-=-. A systematic study of universal compression over arbitrary alphabets was taken by Kieffer [4] who analyzed a slightly less restrictive form of redundancy, related to weak universal compression. He d... |

10 | Universal codes for finite sequences of integers drawn from a monotone distribution
- Foster, Stine, et al.
- 2002
(Show Context)
Citation Context ...researchers constructed compression algorithms for subclasses of i.i.d. distributions that satisfy Kieffer’s condition. Elias [16], Györfi, Pali, and Van der Meulen [17], and Foster, Stine, and Wyner =-=[18]-=- considered monotone, namely, , i.i.d. distributions over the natural numbers, Uyematsu and Kanaya [19] studied bounded-moment distributions, and Kieffer and Yang [20] and He and Yang [21] showed that... |

10 | A lower bound on compression of unknown alphabets
- Jevtic, Orlitsky, et al.
- 2005
(Show Context)
Citation Context ...eased by taking in the proof, yielding Generating functions and Hayman’s theorem can be used to evaluate the exact asymptotic growth of thereby improving the lower bound to the following. Theorem 13: =-=[43]-=- (see also [15]) As increases These lower bounds should be compared with those in Åberg, Shtarkov, and Smeets [23] who lower-bounded pattern redundancy when the number of symbols is fixed and finite a... |

9 |
Coding of discrete sources with unknown statistics
- Shtarkov
- 1977
(Show Context)
Citation Context ...performs almost optimally by approaching the entropy no matter which distribution in generates the data. The following is a brief introduction to universal compression. For an extensive overview, see =-=[3]-=-–[6]. Manuscript received March 29, 2003; revised March 31, 2004. This work was supported by the National Science Foundation under Grant CCR-0313367. The material in this paper was presented in part a... |

8 |
on the entropy of patterns of i.i.d. sequences
- Shamir, “Bounds
(Show Context)
Citation Context ...old in general. If so, it would yield a lower bound similar to those described here. For a more complete discussion, see [44]. Note also that subsequent to the derivation of Theorem 13, Shamir et al. =-=[45]-=-, [46] showed that the average-case pattern redundancy is lower-bounded by for arbitrarily small . VI. SEQUENTIAL COMPRESSION The compression schemes considered so far operated on the whole block of s... |

7 |
The universality of grammar-based codes for sources with countably infinite alphabets
- He, Yang
- 2005
(Show Context)
Citation Context ..., and Wyner [18] considered monotone, namely, , i.i.d. distributions over the natural numbers, Uyematsu and Kanaya [19] studied bounded-moment distributions, and Kieffer and Yang [20] and He and Yang =-=[21]-=- showed that any collection satisfying Kieffer’s condition can be universally compressed by grammar codes. Since actual distributions may not satisfy Kieffer’s condition, practical compression algorit... |

6 | Minimax regret under log loss for general classes of experts - Cesa-Bianchi, Lugosi - 1999 |

4 | The precise minimax redundancy
- Drmota, Szpankowski
- 2002
(Show Context)
Citation Context ...istributions. Consequently, the redundancy of , the collection of i.i.d.distributions over sequences of length drawn from an alphabet of size , was studied extensively, and a succession of papers [7]–=-=[14]-=- has shown that for any fixed as increases where is the gamma function, and the term diminishes with increasing at a rate determined by . For any fixed alphabet size , this redundancy grows logarithmi... |

4 |
Asymptotic optimality of two variations of Lempel-Ziv codes for sources with countably infinite alphabet
- Uyematsu, Kanaya
- 1975
(Show Context)
Citation Context ...fer’s condition. Elias [16], Györfi, Pali, and Van der Meulen [17], and Foster, Stine, and Wyner [18] considered monotone, namely, , i.i.d. distributions over the natural numbers, Uyematsu and Kanaya =-=[19]-=- studied bounded-moment distributions, and Kieffer and Yang [20] and He and Yang [21] showed that any collection satisfying Kieffer’s condition can be universally compressed by grammar codes. Since ac... |

3 |
Universal methods of coding for the case of unknown statistics
- Fittingoff
- 1972
(Show Context)
Citation Context ... belongs to some natural class , such as the collection of independent and identically distributed (i.i.d.), Markov, or stationary distributions, but that the precise distribution within is not known =-=[1]-=-, [2]. The objective then is to compress the data almost as well as when the distribution is known in advance, namely, to find a universal compression scheme that performs almost optimally by approach... |

2 |
On the redundancy of hmm patterns
- Dhulipala, Orlitsky
(Show Context)
Citation Context ...lock and sequential compression, the redundancy grows sublinearly with the block length, hence the per-symbol redundancy diminishes to zero. These results can be extended to distributions with memory =-=[31]-=-. They can also be adapted to yield an asymptotically optimal solution for the Good-Turing probability-estimation problem [32]. III. PATTERNS We formally describe patterns and their redundancy. Let be... |

1 |
On modeling profiles instead of values,” manuscript, submitted for publication
- Orlitsky, Santhanam, et al.
(Show Context)
Citation Context ...e Shtarkov’s sum (4) as In the next subsection, we use this sum to compute and . However, for larger , exact calculation of the maximum-likelihood probabilities of patterns, namely, , seems difficult =-=[39]-=-. Hence, in Sections V-B and V-C, respectively, we prove upper and lower bounds on the maximum-likelihood probabilities of patterns and use these bounds to upper and lower bound . A. The Redundancy of... |

1 |
Speaking of infinity,” manuscript, submitted for publication
- Orlitsky, Santhanam
(Show Context)
Citation Context ... extends to arbitrary , which may grow with , the bound they derive may still hold in general. If so, it would yield a lower bound similar to those described here. For a more complete discussion, see =-=[44]-=-. Note also that subsequent to the derivation of Theorem 13, Shamir et al. [45], [46] showed that the average-case pattern redundancy is lower-bounded by for arbitrarily small . VI. SEQUENTIAL COMPRES... |