Results 1  10
of
31
A Generalized Suffix Tree and Its (Un)Expected Asymptotic Behaviors
 SIAM J. Computing
, 1996
"... Suffix trees find several applications in computer science and telecommunications, most notably in algorithms on strings, data compressions and codes. Despite this, very little is known about their typical behaviors. In a probabilistic framework, we consider a family of suffix trees  further calle ..."
Abstract

Cited by 52 (29 self)
 Add to MetaCart
Suffix trees find several applications in computer science and telecommunications, most notably in algorithms on strings, data compressions and codes. Despite this, very little is known about their typical behaviors. In a probabilistic framework, we consider a family of suffix trees  further called bsuffix trees  built from the first n suffixes of a random word. In this family a noncompact suffix tree (i.e., such that every edge is labeled by a single symbol) is represented by b = 1, and a compact suffix tree (i.e., without unary nodes) is asymptotically equivalent to b ! 1 as n ! 1. We study several parameters of bsuffix trees, namely: the depth of a given suffix, the depth of insertion, the height and the shortest feasible path. Some new results concerning typical (i.e., almost sure) behaviors of these parameters are established. These findings are used to obtain several insights into certain algorithms on words, molecular biology and universal data compression schemes. Key Wo...
Dynamical Sources in Information Theory: A General Analysis of Trie Structures
 ALGORITHMICA
, 1999
"... Digital trees, also known as tries, are a general purpose flexible data structure that implements dictionaries built on sets of words. An analysis is given of three major representations of tries in the form of arraytries, list tries, and bsttries ("ternary search tries"). The size and the sear ..."
Abstract

Cited by 50 (7 self)
 Add to MetaCart
Digital trees, also known as tries, are a general purpose flexible data structure that implements dictionaries built on sets of words. An analysis is given of three major representations of tries in the form of arraytries, list tries, and bsttries ("ternary search tries"). The size and the search costs of the corresponding representations are analysed precisely in the average case, while a complete distributional analysis of height of tries is given. The unifying data model used is that of dynamical sources and it encompasses classical models like those of memoryless sources with independent symbols, of finite Markovchains, and of nonuniform densities. The probabilistic behaviour of the main parameters, namely size, path length, or height, appears to be determined by two intrinsic characteristics of the source: the entropy and the probability of letter coincidence. These characteristics are themselves related in a natural way to spectral properties of specific transfer operators of the Ruelle type.
Asymptotic Properties Of Data Compression And Suffix Trees
 IEEE Trans. Inform. Theory
, 1993
"... Recently, Wyner and Ziv have proved that the typical length of a repeated subword found within the first n positions of a stationary ergodic sequence is (1=h) log n in probability where h is the entropy of the alphabet. This finding was used to obtain several insights into certain universal data com ..."
Abstract

Cited by 40 (11 self)
 Add to MetaCart
Recently, Wyner and Ziv have proved that the typical length of a repeated subword found within the first n positions of a stationary ergodic sequence is (1=h) log n in probability where h is the entropy of the alphabet. This finding was used to obtain several insights into certain universal data compression schemes, most notably the LempelZiv data compression algorithm. Wyner and Ziv have also conjectured that their result can be extended to a stronger almost sure convergence. In this paper, we settle this conjecture in the negative in the so called right domain asymptotic, that is, during a dynamic phase of expanding the data base. We prove  under an additional assumption involving mixing conditions  that the length of a typical repeated subword oscillates almost surely (a.s.) between (1=h 1 ) log n and (1=h 2 ) log n where 0 ! h 2 ! h h 1 ! 1. We also show that the length of the nth block in the LempelZiv parsing algorithm reveals a similar behavior. We relate our findings to...
Entropy Computations Via Analytic Depoissonization
 IEEE Trans. Information Theory
, 1998
"... We investigate the basic question of information theory, namely, evaluation of Shannon entropy, and a more general Rényi entropy, for some discrete distributions (e.g., binomial, negative binomial, etc.). We aim at establishing analytic methods (i.e., those in which complex analysis plays a pivotal ..."
Abstract

Cited by 29 (12 self)
 Add to MetaCart
We investigate the basic question of information theory, namely, evaluation of Shannon entropy, and a more general Rényi entropy, for some discrete distributions (e.g., binomial, negative binomial, etc.). We aim at establishing analytic methods (i.e., those in which complex analysis plays a pivotal role) for such computations which often yield estimates of unparalleled precision. The main analytic tool used here is that of analytic poissonization and depoissonization. We illustrate our approach on the entropy evaluation of the binomial distribution, that is, we prove that for Binomial(n; p) distribution the entropy h n becomes h n i 1 2 ln n+ 1 2 +ln p 2ßp(1 \Gamma p)+ P k1 a k n \Gammak where a k are explicitly computable constants. Moreover, we shall argue that analytic methods (e.g., complex asymptotics such as Rice's method and singularity analysis, Mellin transforms, poissonization and depoissonization) can offer new tools for information theory, especially for studying ...
Burst Tries: A Fast, Efficient Data Structure for String Keys
 ACM Transactions on Information Systems
, 2002
"... Many applications depend on efficient management of large sets of distinct strings in memory. For example, during index construction for text databases a record is held for each distinct word in the text, containing the word itself and information such as counters. We propose a new data structure, t ..."
Abstract

Cited by 28 (10 self)
 Add to MetaCart
Many applications depend on efficient management of large sets of distinct strings in memory. For example, during index construction for text databases a record is held for each distinct word in the text, containing the word itself and information such as counters. We propose a new data structure, the burst trie, that has significant advantages over existing options for such applications: it requires no more memory than a binary tree; it is as fast as a trie; and, while not as fast as a hash table, a burst trie maintains the strings in sorted or nearsorted order. In this paper we describe burst tries and explore the parameters that govern their performance. We experimentally determine good choices of parameters, and compare burst tries to other structures used for the same task, with a variety of data sets. These experiments show that the burst trie is particularly effective for the skewed frequency distributions common in text collections, and dramatically outperforms all other data structures for the task of managing strings while maintaining sort order.
The Analysis of Hybrid Trie Structures
, 1998
"... This paper provides a detailed analysis of various implementations of digital tries, including the “ternary search tries” of Bentley and Sedgewick. The methods employed combine symbolic uses of generating functions, Poisson models, and MeIlin transforms. Theoretical results are matched against real ..."
Abstract

Cited by 24 (2 self)
 Add to MetaCart
This paper provides a detailed analysis of various implementations of digital tries, including the “ternary search tries” of Bentley and Sedgewick. The methods employed combine symbolic uses of generating functions, Poisson models, and MeIlin transforms. Theoretical results are matched against reallife data and justify the claim that ternary search tries are a highly efficient dynamic dictionary structure for strings and textual data.
Profile of Tries
, 2006
"... Tries (from retrieval) are one of the most popular data structures on words. They are pertinent to (internal) structure of stored words and several splitting procedures used in diverse contexts. The profile of a trie is a parameter that represents the number of nodes (either internal or external) wi ..."
Abstract

Cited by 18 (8 self)
 Add to MetaCart
Tries (from retrieval) are one of the most popular data structures on words. They are pertinent to (internal) structure of stored words and several splitting procedures used in diverse contexts. The profile of a trie is a parameter that represents the number of nodes (either internal or external) with the same distance from the root. It is a function of the number of strings stored in a trie and the distance from the root. Several, if not all, trie parameters such as height, size, depth, shortest path, and fillup level can be uniformly analyzed through the (external and internal) profiles. Although profiles represent one of the most fundamental parameters of tries, they have been hardly studied in the past. The analysis of profiles is surprisingly arduous but once it is carried out it reveals unusually intriguing and interesting behavior. We present a detailed study of the distribution of the profiles in a trie built over random strings generated by a memoryless source. We first derive recurrences satisfied by the expected profiles and solve them asymptotically for all possible ranges of the distance from the root. It appears that profiles of tries exhibit several fascinating phenomena. When moving from the root to the leaves of a trie, the growth of the expected profiles vary. Near the root, the external profiles tend to zero in an exponentially rate, then the rate gradually rises to being logarithmic; the external profiles then abruptly tend to infinity, first logarithmically
Average Profile Of The LempelZiv Parsing Scheme For A Markovian Source
 Algorithmica
, 1998
"... For a Markovian source, we analyze the LempelZiv parsing scheme that partitions sequences into phrases such that a new phrase is the shortest phrase not seen in the past. We consider three models: In the Markov Independent model, several sequences are generated independently by Markovian sources, ..."
Abstract

Cited by 17 (11 self)
 Add to MetaCart
For a Markovian source, we analyze the LempelZiv parsing scheme that partitions sequences into phrases such that a new phrase is the shortest phrase not seen in the past. We consider three models: In the Markov Independent model, several sequences are generated independently by Markovian sources, and the ith phrase is the shortest prefix of the ith sequence that was not seen before as a phrase (i.e., a prefix of previous (i \Gamma 1) sequences). In the other two models, only a single sequence is generated by a Markovian source. In the second model, for which we coin the name GilbertKadota model, a fixed number of phrases is generated according to the LempelZiv algorithm, thus producing a sequence of a variable (random) length. In the last model, known also as the LempelZiv model, a string of fixed length is partitioned into a variable (random) number of phrases. These three models can be efficiently represented and analyzed by digital search trees that are of interest to other al...
Laws of large numbers and tail inequalities for random tries and Patricia trees
 Journal of Computational and Applied Mathematics
, 2002
"... Abstract. We consider random tries and random patricia trees constructed from n independent strings of symbols drawn from any distribution on any discrete space. If Hn is the height of this tree, we show that Hn/E{Hn} tends to one in probability. Additional tail inequalities are given for the height ..."
Abstract

Cited by 15 (5 self)
 Add to MetaCart
Abstract. We consider random tries and random patricia trees constructed from n independent strings of symbols drawn from any distribution on any discrete space. If Hn is the height of this tree, we show that Hn/E{Hn} tends to one in probability. Additional tail inequalities are given for the height, depth, size, and profile of these trees and ordinary tries that apply without any conditions on the string distributions—they need not even be identically distributed.