## Asymptotic Properties Of Data Compression And Suffix Trees (1993)

Venue: | IEEE Trans. Inform. Theory |

Citations: | 40 - 11 self |

### BibTeX

@ARTICLE{Szpankowski93asymptoticproperties,

author = {Wojciech Szpankowski},

title = {Asymptotic Properties Of Data Compression And Suffix Trees},

journal = {IEEE Trans. Inform. Theory},

year = {1993},

volume = {39},

pages = {1647--1659}

}

### Years of Citing Articles

### OpenURL

### Abstract

Recently, Wyner and Ziv have proved that the typical length of a repeated subword found within the first n positions of a stationary ergodic sequence is (1=h) log n in probability where h is the entropy of the alphabet. This finding was used to obtain several insights into certain universal data compression schemes, most notably the Lempel-Ziv data compression algorithm. Wyner and Ziv have also conjectured that their result can be extended to a stronger almost sure convergence. In this paper, we settle this conjecture in the negative in the so called right domain asymptotic, that is, during a dynamic phase of expanding the data base. We prove -- under an additional assumption involving mixing conditions -- that the length of a typical repeated subword oscillates almost surely (a.s.) between (1=h 1 ) log n and (1=h 2 ) log n where 0 ! h 2 ! h h 1 ! 1. We also show that the length of the nth block in the Lempel-Ziv parsing algorithm reveals a similar behavior. We relate our findings to...

### Citations

8564 |
Elements of Information Theory
- Cover, Thomas
- 2003
(Show Context)
Citation Context ...ve limit is guaranteed by Shannon's Theorem (cf. [8]). It is also known that hslog V . Hereafter, all logarithms -- unless stated explicitly otherwise -- are natural logarithms. It is well known (cf. =-=[12]-=-, [13], [22]) that the entropy of a stationary ergodic information source is intimately related to coding and certain data compression schemes, most notably the universal compression scheme of Lempel ... |

2435 |
The Design and Analysis of Computer Algorithms
- Aho, Hopcroft, et al.
- 1974
(Show Context)
Citation Context ...shown how to construct such a 1 There is another way of parsing a sequence in which phrases do not overlap. For example, our sequence parsing by using a special data structure called suffix tree (cf. =-=[1]-=-, [4], [15]). Naturally, one is interested in the length of a block in the parsing algorithm. Let l n be the length of the nth block in the Lempel-Ziv parsing algorithm. There is a relationship betwee... |

2196 |
The Art of Computer Programming
- Knuth
- 2005
(Show Context)
Citation Context ...cussed in Example 1.2 (cf. [3]) in which the phrases do not overlap. Such a parsing scheme can be conveniently modeled by another digital tree, namely the so called digital search tree (cf. [3], [1], =-=[26]-=-). Then the length l n of the nth block is exactly equal to the depth of the nth node in such a digital search tree. Applying Pittel's result for independent digital search trees [31], we can prove un... |

888 |
Information Theory: Coding Theorems for Discrete Memoryless Channels
- Csiszár, Körner
- 1981
(Show Context)
Citation Context ...it is guaranteed by Shannon's Theorem (cf. [8]). It is also known that hslog V . Hereafter, all logarithms -- unless stated explicitly otherwise -- are natural logarithms. It is well known (cf. [12], =-=[13]-=-, [22]) that the entropy of a stationary ergodic information source is intimately related to coding and certain data compression schemes, most notably the universal compression scheme of Lempel and Zi... |

548 | A space–economical suffix tree construction algorithm - McCreight - 1976 |

426 |
Linear pattern matching algorithms
- Weiner
- 1973
(Show Context)
Citation Context ...h l n in the Lempel-Ziv parsing algorithm. EXAMPLE 1.3 String Matching Algorithms Repeated substrings also arise in many algorithms on strings, notably string matching algorithms (cf. [1], [2], [33], =-=[40]-=-). A string matching algorithm searches for all (exact or approximate) occurrences of the pattern string P in the text string T . Consider either the Knuth-Morris-Pratt algorithm or the Boyer-Moore al... |

245 |
Ergodic theory and information
- Billingsley
- 1965
(Show Context)
Citation Context ...(X n 1 ) = PrfX k = x k ; 1sksn; x k 2 \Sigmag : (2:1) The entropy of fX k g is h = lim n!1 E log P \Gamma1 (X n 1 ) n ; (2:2) The existence of the above limit is guaranteed by Shannon's Theorem (cf. =-=[8]-=-). It is also known that hslog V . Hereafter, all logarithms -- unless stated explicitly otherwise -- are natural logarithms. It is well known (cf. [12], [13], [22]) that the entropy of a stationary e... |

242 |
Comparison Methods for Queues and Other Stochastic Models
- Stoyan
- 1983
(Show Context)
Citation Context ...e second moment method to evaluate the appropriate probability. Note that H n is stochastically non-decreasing, that is, if msn, then Hmsst H n , wheresst means stochastically smaller, and hence (cf. =-=[36]-=-) PrfH nskgsPrfHmskg for msn ; (3:6) with k = O(log n). We select m in such a way that the probability of the RHS of the above will be easier to evaluate than the original probability. In order to est... |

222 |
On the Complexity of Finite Sequences
- Lempel, Ziv
- 1976
(Show Context)
Citation Context ...rings and other kinds of regularities in words. In data compression, such a repeated subsequence can be used to reduce the size of the original sequence (e.g., universal data compression schemes [7], =-=[27], [43]). I-=-n exact string matching algorithms the longest suffix that matches a substring of the pattern string is used for "fast" shift of the pattern over a text string (cf. Knuth-Morris-Pratt and Bo... |

124 | Algorithms for finding patterns in strings - Aho - 1990 |

69 |
Entropy and data compression schemes
- Ornstein, Weiss
- 1993
(Show Context)
Citation Context ...his is due to the fact that e L n is a nondecreasing sequence as opposed to L n . In the non-Markovian case, the proof for the (a.s.) convergence e L n is more intricate and due to Ornstein and Weiss =-=[30]-=-. In this paper, we mainly deal with the more interesting right domain asymptotic which has also several applications in the analysis and design of algorithms on words. In particular, during the cours... |

63 |
Linear algorithm for data compression via string matching
- Rodeh, Pratt, et al.
- 1981
(Show Context)
Citation Context ...a2 0 = X \Gammam 0 +e Ln \Gamma2 \Gammam 0 , and transmits m 0 , e L n and X e Ln \Gamma1 . To encode m 0 we need log V n symbols, and it is known that e L n may be represented by log e L n bits (cf. =-=[33]-=-). As noted by Wyner and Ziv [41], the number of encoded symbols per source symbol is asymptotically log V n e L n + log e L n e L n + log V e L n : Hence, the ratio log V n= e L n determines an asymp... |

57 |
Asymptotical growth of a class of random trees
- PITTEL
- 1985
(Show Context)
Citation Context ...d rigorously the latter result regarding the average size of a suffix tree. Some related topics were discussed by Guibas and Odlyzko in [16]. Our findings were inspired by the seminal paper of Pittel =-=[31]-=- who considered a typical behavior of a trie constructed from independent words (i.e., independent tries). Pittel was the first who noticed that the depth of insertion in an independent trie does not ... |

57 |
Some asymptotic properties of the entropy of a stationary ergodic data source with applications to data compression
- Wyner, Ziv
- 1989
(Show Context)
Citation Context ... data base to the new position. This idea can be modeled mathematically in two different fashions that are discussed next. A. Static Model -- Left Domain Asymptotic This is the model of Wyner and Ziv =-=[41]-=-. It is assumed that the subsequence to be compressed fX k g 1 k=0 is always the same (by definition fixed at position k = 0), and the data base fX k g \Gamma1 k=\Gamman expands only to the left. Ther... |

54 | Szpankowski - Autocorrelation on words and its applications: analysis of suffix trees by string-ruler approach
- Jacquet, W
- 1994
(Show Context)
Citation Context ...ernal structure (i.e., repeated substrings) of the first n symbols of the pattern P . It turns out that this problem can be efficiently solved by means of a suffix tree (cf. [1], [4], [5], [9], [14], =-=[18]-=-, [28], [40]). In particular, recently Chang and Lawler [11] used suffix trees to design an algorithm that on average needs O((jT j=jP j) log jP j) steps to find all occurrences of the pattern P of le... |

54 | Paths in a random digital tree: limiting distributions - Pittel - 1986 |

52 | A generalized suffix tree and its (un)expected asymptotic behaviors - Szpankowski - 1993 |

46 |
Approximate string matching in sublinear expected time
- CHANG, LAWLER
- 1990
(Show Context)
Citation Context ... matching algorithms the longest suffix that matches a substring of the pattern string is used for "fast" shift of the pattern over a text string (cf. Knuth-Morris-Pratt and Boyer-Moore [2];=-= see also [11]-=-), and so forth. The problem of repeated patterns is studied here in a probabilistic framework. We assume that a stationary and ergodic source of information generates an infinite sequence fX k g 1 k=... |

40 |
Analysis of digital tries with Markovian dependency
- Jacquet, Szpankowski
- 1991
(Show Context)
Citation Context ...ined for the Markovian model. This is due to two facts: (i) the limiting behavior of independent tries do not differ too much from asymptotics of suffix trees (cf. [18]); (ii) Jacquet and Szpankowski =-=[17]-=- established the limiting distribution of the depth for independent tries in a Markovian framework. (iv) Second Order Behavior for the Lempel-Ziv Parsing Scheme. The limiting distribution of a randoml... |

29 | The Erdös-Rényi strong law for pattern matching with a given proportion of mismatches. The Annals of Probability 17
- Arratia, Waterman
- 1989
(Show Context)
Citation Context ...f the height was recently obtained by Devroye, Szpankowski and Rais [14], and the limiting distribution of the depth in a suffix tree is reported in Jacquet and Szpankowski [18]. Arratia and Waterman =-=[6]-=- investigated a related problem, namely the longest contiguous matching within a single sequence, and obtained several interesting results in this direction. Their findings are related to the hight of... |

26 |
Sample converses in source coding theory
- Kieffer
- 1991
(Show Context)
Citation Context ...scanty in the literature. To our best knowledge, asymptotic analysis of universal data compressions was pursued by Ziv and Lempel (cf. [43], [27]; see also [7]), Wyner and Ziv [41], [42], and Kieffer =-=[22]-=-. The average case analysis of suffix trees was initialized by Grassberger [15], and Apostolico and Szpankowski [5]. For the Bernoulli model, the asymptotic behavior of the height was recently obtaine... |

25 |
A Universal Algorithm for Sequential Data
- Ziv, Lempel
(Show Context)
Citation Context ...and other kinds of regularities in words. In data compression, such a repeated subsequence can be used to reduce the size of the original sequence (e.g., universal data compression schemes [7], [27], =-=[43]). In exac-=-t string matching algorithms the longest suffix that matches a substring of the pattern string is used for "fast" shift of the pattern over a text string (cf. Knuth-Morris-Pratt and Boyer-Mo... |

23 | Estimating the information content of symbol sequences and efficient codes - Grassberger - 1989 |

22 |
Subadditive processes, in École d’Été de Probabilités de SaintFlour V
- Kingman
- 1975
(Show Context)
Citation Context ... for example, for the Bernoulli model, the Markovian model, the hidden Markov model and m-dependent model (cf. [13]). In order to apply the Borel-Cantelli Lemma, we use the trick suggested by Kingman =-=[23]-=-; that is, we construct a subsequence n r of n for which O(1= p log n r ) is summable. Fix s, and define n r = 2 s 2 2 2r . Note that e L nr = log n r ! 1=h (a.s.) provided P (B n ) is summable. To pr... |

22 |
On the height of digital trees and related problems
- Szpankowski
- 1991
(Show Context)
Citation Context ...P V i;j=1 �� i p i;j log p i;j where �� i is the stationary distribution of the Markov chain. The other quantities, that is, h 1 and h 2 , are a little harder to evaluate. Pittel [31] and Szpa=-=nkowski [38]-=- evaluated the height of regular tries with Markovian dependency, and they showed that the parameter h 2 is a function of the largest eigenvalue ` of the matrix P [2] = P ffi P which represents the Sc... |

21 |
A note on the height of suffix trees
- Devroye, Szpankowski, et al.
- 1992
(Show Context)
Citation Context ...he internal structure (i.e., repeated substrings) of the first n symbols of the pattern P . It turns out that this problem can be efficiently solved by means of a suffix tree (cf. [1], [4], [5], [9], =-=[14]-=-, [18], [28], [40]). In particular, recently Chang and Lawler [11] used suffix trees to design an algorithm that on average needs O((jT j=jP j) log jP j) steps to find all occurrences of the pattern P... |

19 |
Fixed data base version of the Lempel-Ziv data compression algorithm
- Wyner, Ziv
- 1991
(Show Context)
Citation Context ...ssions are rather scanty in the literature. To our best knowledge, asymptotic analysis of universal data compressions was pursued by Ziv and Lempel (cf. [43], [27]; see also [7]), Wyner and Ziv [41], =-=[42]-=-, and Kieffer [22]. The average case analysis of suffix trees was initialized by Grassberger [15], and Apostolico and Szpankowski [5]. For the Bernoulli model, the asymptotic behavior of the height wa... |

15 |
Entropy and prefixes
- Shields
- 1992
(Show Context)
Citation Context ...ility (pr:) (1:3b) where h is the entropy of X. This result concerns the convergence in probability (pr.) of e L n . In fact, a similar results also holds for L n in the right domain asymptotics (cf. =-=[35]-=-, [39]). Wyner and Ziv [41] asked whether it can be extended to a stronger almost sure (a.s.) convergence. In the right domain asymptotic, we shall settle this question in the negative for the Markovi... |

15 |
Some results on V -ary asymmetric tries
- Szpankowski
- 1988
(Show Context)
Citation Context ...); (2:19b) for some " ? 0, where H 2 = P V i=1 p 2 i log p i , and P 1 (x) and P 2 (x) are fluctuating periodic functions with small amplitudes (an explicit formula for the constant C can be foun=-=d in [37]-=-). We conjecture that the same type of limiting distributions can be obtained for the Markovian model. This is due to two facts: (i) the limiting behavior of independent tries do not differ too much f... |

13 |
Odlyzko - String Overlaps
- Guibas, M
- 1981
(Show Context)
Citation Context ...ns an oscillating term). Jacquet and Szpankowski [18] established rigorously the latter result regarding the average size of a suffix tree. Some related topics were discussed by Guibas and Odlyzko in =-=[16]-=-. Our findings were inspired by the seminal paper of Pittel [31] who considered a typical behavior of a trie constructed from independent words (i.e., independent tries). Pittel was the first who noti... |

10 |
Counts of long aligned word matches among random letter sequences
- KARLIN, OST
- 1987
(Show Context)
Citation Context ... P 2 (w k ) 1 A 3=2 = c 1 (EP (w k )) 3=2 ; where (A) follows directly from (3.9) (by setting w k = w 0 k ), and (B) is a consequence of the following inequality, which can be found in Karlin and Ost =-=[20]-=- and Szpankowski [38], `sr ) 0 @ X W k P ` (w k ) 1 A 1=` 0 @ X W k P r (w k ) 1 A 1=r : (3:10) Finally, the above implies the following estimate for the second sum in the denominator of (3.8) X (i;j)... |

9 |
A Diffusion Limit for a Class of Random-Growing Binary
- Aldous, Shields
- 1988
(Show Context)
Citation Context ...r independent tries, we obtain the desired result. Finally, we can easily prove our conjecture for a modified version of the Lempel-Ziv parsing algorithm that we already discussed in Example 1.2 (cf. =-=[3]-=-) in which the phrases do not overlap. Such a parsing scheme can be conveniently modeled by another digital tree, namely the so called digital search tree (cf. [3], [1], [26]). Then the length l n of ... |

8 |
Optimization of stationary control of discrete deterministic process
- Romanovski
- 1967
(Show Context)
Citation Context ...rix P [2] = P ffi P which represents the Schur product of P (i.e., elementwise product). More precisely, h 2 = (1=2) log ` \Gamma1 . With respect to h 1 we need a result from digraphs (cf. Romanovski =-=[34]-=-, Karp [21]). Consider a digraph on \Sigma with weights equal to \Gamma log p ij where ! i ; ! j 2 \Sigma. Define a cycle C = f! 1 ; ! 2 ; :::; ! v ; ! 1 g for some vsV such that ! i 2 \Sigma, and let... |

7 |
New asymptotic bounds and improvements on the Lempel-Ziv data compression algorithm
- Bender, Wolf
- 1991
(Show Context)
Citation Context ...substrings and other kinds of regularities in words. In data compression, such a repeated subsequence can be used to reduce the size of the original sequence (e.g., universal data compression schemes =-=[7], [27], [4-=-3]). In exact string matching algorithms the longest suffix that matches a substring of the pattern string is used for "fast" shift of the pattern over a text string (cf. Knuth-Morris-Pratt ... |

6 |
The Myriad Virtues of Suffix Trees, Combinatorial Algorithms on Words
- Apostolico
- 1985
(Show Context)
Citation Context ... how to construct such a 1 There is another way of parsing a sequence in which phrases do not overlap. For example, our sequence parsing by using a special data structure called suffix tree (cf. [1], =-=[4]-=-, [15]). Naturally, one is interested in the length of a block in the parsing algorithm. Let l n be the length of the nth block in the Lempel-Ziv parsing algorithm. There is a relationship between l n... |

5 |
On the Application of the Borel-Cantelli
- Chung, Erdos
- 1952
(Show Context)
Citation Context ...in such a way that the probability of the RHS of the above will be easier to evaluate than the original probability. In order to estimate PrfHmskg we use the second moment method (cf. Chung and Erdos =-=[10]), wh-=-ich states that for events A i Prf m [ i=1 A i gs( P m i=1 PrfA i g) 2 P m i=1 PrfA i g + P i6=j PrfA i " A j g : (3:7) In our case, we set A i;j = fC i;jskg, and hence PrfHmskg = Prf S m i;j=1 A... |

4 |
Self-alignments in Words and Their
- Apostolico, Szpankowski
- 1992
(Show Context)
Citation Context ...pends on the internal structure (i.e., repeated substrings) of the first n symbols of the pattern P . It turns out that this problem can be efficiently solved by means of a suffix tree (cf. [1], [4], =-=[5]-=-, [9], [14], [18], [28], [40]). In particular, recently Chang and Lawler [11] used suffix trees to design an algorithm that on average needs O((jT j=jP j) log jP j) steps to find all occurrences of th... |

4 |
On the Variance of the External
- Kirschenhofer, Prodinger, et al.
- 1989
(Show Context)
Citation Context ...1 L n (m). Even in the Bernoulli model major difficulties arise in the evaluation of the limiting distribution of E n in a suffix tree. For independent tries, Kirschenhofer, Prodinger and Szpankowski =-=[24] rec-=-ently obtained for the symmetric alphabet the variance of E n which is varE n = (ff + P 3 (log n))n +O(log 2 n) where ff �� 4:35 : : : (explicit formula for ff can be found in [24]) and P 3 (log n... |

3 |
A Characterization of the Minimum Cycle Mean
- Karp
- 1978
(Show Context)
Citation Context ... P ffi P which represents the Schur product of P (i.e., elementwise product). More precisely, h 2 = (1=2) log ` \Gamma1 . With respect to h 1 we need a result from digraphs (cf. Romanovski [34], Karp =-=[21]-=-). Consider a digraph on \Sigma with weights equal to \Gamma log p ij where ! i ; ! j 2 \Sigma. Define a cycle C = f! 1 ; ! 2 ; :::; ! v ; ! 1 g for some vsV such that ! i 2 \Sigma, and let `(C) = \Ga... |

1 |
On the Lempel-Ziv Parsing Algorithm and Its Digital Tree Representation, INRIA Rapport de Recherche
- Jacquet, Szpankowski
- 1992
(Show Context)
Citation Context ...n for the number of pharses M n and the internal path length of the associated digital tree, however, without explicit formula for the variance. This was recently rectified by Jacquet and Szpankowski =-=[19] who use-=-d a result of Kirschenhofer, Prodinger and Szpankowski [25] to show that var M n �� (fi + P 4 (log n))n= log 3 n where fi �� 0:26600 : : : (explict but complicated formula on fi can be found i... |