Results 1  10
of
29
Asymptotic Behavior of the LempelZiv Parsing Scheme and Digital Search Trees
 Theoretical Computer Science
, 1995
"... The LempelZiv parsing scheme finds a wide range of applications, most notably in data compression and algorithms on words. It partitions a sequence of length n into variable phrases such that a new phrase is the shortest substring not seen in the past as a phrase. The parameter of interest is the n ..."
Abstract

Cited by 73 (35 self)
 Add to MetaCart
The LempelZiv parsing scheme finds a wide range of applications, most notably in data compression and algorithms on words. It partitions a sequence of length n into variable phrases such that a new phrase is the shortest substring not seen in the past as a phrase. The parameter of interest is the number M n of phrases that one can construct from a sequence of length n. In this paper, for the memoryless source with unequal probabilities of symbols generation we derive the limiting distribution of M n which turns out to be normal. This proves a long standing open problem. In fact, to obtain this result we solved another open problem, namely, that of establishing the limiting distribution of the internal path length in a digital search tree. The latter is a consequence of an asymptotic solution of a multiplicative differentialfunctional equation often arising in the analysis of algorithms on words. Interestingly enough, our findings are proved by a combination of probabilistic techniques such as renewal equation and uniform integrability, and analytical techniques such as Mellin transform, differentialfunctional equations, dePoissonization, and so forth. In concluding remarks we indicate a possibility of extending our results to Markovian models.
Autocorrelation On Words And Its Applications  Analysis of Suffix Trees by StringRuler Approach
 J. Combin.Theory Ser. A
, 1994
"... We study in a probabilistic framework some topics concerning the way words can overlap. Our probabilistic models assumes that a word is a sequence of i.i.d. symbols generated from a finite alphabet. This defines the so called Bernoulli model. We investigate the length of a subword that can be recopi ..."
Abstract

Cited by 58 (23 self)
 Add to MetaCart
We study in a probabilistic framework some topics concerning the way words can overlap. Our probabilistic models assumes that a word is a sequence of i.i.d. symbols generated from a finite alphabet. This defines the so called Bernoulli model. We investigate the length of a subword that can be recopied, that is, a subword that occurs at least twice in a given word. An occurrence of such repeated substrings is easy to detect in a digital tree called a suffix tree. The length of a repeated substring corresponds to the typical depth in the associated suffix tree. Our main finding shows that the typical depth in a suffix tree is asymptotically distributed in the same manner as the typical depth in a digital tree that stores independent keys (i.e., independent tries). More precisely, we prove that the typical depth in a suffix tree built from the first n suffixes of a random word is normally distributed with the mean asymptotically becoming 1=h 1 log n and the variance ff \Delta log n, where...
Probability Metrics and Recursive Algorithms
"... In this paper it is shown by several examples that probability metrics are a useful tool to study the asymptotic behaviour of (stochastic) recursive algorithms. The basic idea of this approach is to find a `suitable ' probability metric which yields contraction properties of the transformation ..."
Abstract

Cited by 54 (9 self)
 Add to MetaCart
In this paper it is shown by several examples that probability metrics are a useful tool to study the asymptotic behaviour of (stochastic) recursive algorithms. The basic idea of this approach is to find a `suitable ' probability metric which yields contraction properties of the transformations describing the limits of the algorithm. In order to demonstrate the wide range of applicability of this contraction method we investigate examples from various fields, some of them have been analyzed already in the literature.
On the Distribution for the Duration of a Randomized Leader Election Algorithm
 Ann. Appl. Probab
, 1996
"... We investigate the duration of an elimination process for identifying a winner by coin tossing, or, equivalently, the height of a random incomplete trie. Applications of the process include the election of a leader in a computer network. Using direct probabilistic arguments we obtain exact expressio ..."
Abstract

Cited by 44 (11 self)
 Add to MetaCart
We investigate the duration of an elimination process for identifying a winner by coin tossing, or, equivalently, the height of a random incomplete trie. Applications of the process include the election of a leader in a computer network. Using direct probabilistic arguments we obtain exact expressions for the discrete distribution and the moments of the height. Elementary approximation techniques then yield asymptotics for the distribution. We show that no limiting distribution exists, as the asymptotic expressions exhibit periodic fluctuations. In many similar problems associated with digital trees, no such exact expressions can be derived. We therefore outline a powerful general approach, based on the analytic techniques of Mellin transforms, Poissonization, and dePoissonization, from which distributional asymptotics for the height can also be derived. In fact, it was this complex variables approach that led to our original discovery of the exact distribution. Complex analysis metho...
Burst Tries: A Fast, Efficient Data Structure for String Keys
 ACM Transactions on Information Systems
, 2002
"... Many applications depend on efficient management of large sets of distinct strings in memory. For example, during index construction for text databases a record is held for each distinct word in the text, containing the word itself and information such as counters. We propose a new data structure, t ..."
Abstract

Cited by 42 (10 self)
 Add to MetaCart
(Show Context)
Many applications depend on efficient management of large sets of distinct strings in memory. For example, during index construction for text databases a record is held for each distinct word in the text, containing the word itself and information such as counters. We propose a new data structure, the burst trie, that has significant advantages over existing options for such applications: it requires no more memory than a binary tree; it is as fast as a trie; and, while not as fast as a hash table, a burst trie maintains the strings in sorted or nearsorted order. In this paper we describe burst tries and explore the parameters that govern their performance. We experimentally determine good choices of parameters, and compare burst tries to other structures used for the same task, with a variety of data sets. These experiments show that the burst trie is particularly effective for the skewed frequency distributions common in text collections, and dramatically outperforms all other data structures for the task of managing strings while maintaining sort order.
Entropy Computations Via Analytic Depoissonization
 IEEE Trans. Information Theory
, 1998
"... We investigate the basic question of information theory, namely, evaluation of Shannon entropy, and a more general Rényi entropy, for some discrete distributions (e.g., binomial, negative binomial, etc.). We aim at establishing analytic methods (i.e., those in which complex analysis plays a pivotal ..."
Abstract

Cited by 36 (12 self)
 Add to MetaCart
We investigate the basic question of information theory, namely, evaluation of Shannon entropy, and a more general Rényi entropy, for some discrete distributions (e.g., binomial, negative binomial, etc.). We aim at establishing analytic methods (i.e., those in which complex analysis plays a pivotal role) for such computations which often yield estimates of unparalleled precision. The main analytic tool used here is that of analytic poissonization and depoissonization. We illustrate our approach on the entropy evaluation of the binomial distribution, that is, we prove that for Binomial(n; p) distribution the entropy h n becomes h n i 1 2 ln n+ 1 2 +ln p 2ßp(1 \Gamma p)+ P k1 a k n \Gammak where a k are explicitly computable constants. Moreover, we shall argue that analytic methods (e.g., complex asymptotics such as Rice's method and singularity analysis, Mellin transforms, poissonization and depoissonization) can offer new tools for information theory, especially for studying ...
A probabilistic analysis of some tree algorithms
 ANNALS OF APPLIED PROBABILITY
, 2005
"... In this paper a general class of tree algorithms is analyzed. It is shown that, by using an appropriate probabilistic representation of the quantities of interest, the asymptotic behavior of these algorithms can be obtained quite easily without resorting to the usual complex analysis techniques. Thi ..."
Abstract

Cited by 21 (6 self)
 Add to MetaCart
In this paper a general class of tree algorithms is analyzed. It is shown that, by using an appropriate probabilistic representation of the quantities of interest, the asymptotic behavior of these algorithms can be obtained quite easily without resorting to the usual complex analysis techniques. This approach gives a unified probabilistic treatment of these questions. It simplifies and extends some of the results known in this domain.
Profile of Tries
, 2006
"... Tries (from retrieval) are one of the most popular data structures on words. They are pertinent to (internal) structure of stored words and several splitting procedures used in diverse contexts. The profile of a trie is a parameter that represents the number of nodes (either internal or external) wi ..."
Abstract

Cited by 21 (8 self)
 Add to MetaCart
(Show Context)
Tries (from retrieval) are one of the most popular data structures on words. They are pertinent to (internal) structure of stored words and several splitting procedures used in diverse contexts. The profile of a trie is a parameter that represents the number of nodes (either internal or external) with the same distance from the root. It is a function of the number of strings stored in a trie and the distance from the root. Several, if not all, trie parameters such as height, size, depth, shortest path, and fillup level can be uniformly analyzed through the (external and internal) profiles. Although profiles represent one of the most fundamental parameters of tries, they have been hardly studied in the past. The analysis of profiles is surprisingly arduous but once it is carried out it reveals unusually intriguing and interesting behavior. We present a detailed study of the distribution of the profiles in a trie built over random strings generated by a memoryless source. We first derive recurrences satisfied by the expected profiles and solve them asymptotically for all possible ranges of the distance from the root. It appears that profiles of tries exhibit several fascinating phenomena. When moving from the root to the leaves of a trie, the growth of the expected profiles vary. Near the root, the external profiles tend to zero in an exponentially rate, then the rate gradually rises to being logarithmic; the external profiles then abruptly tend to infinity, first logarithmically
Multidimensional Digital Searching And Some New Parameters In Tries
, 1993
"... Multidimensional digital searching (Md tries) is analyzed from the view point of partial match retrieval. Our first result extends the analysis of Flajolet and Puech of the average cost of retrieval under the Bernoulli model to biased probabilities of symbols occurrences in a key. The second main f ..."
Abstract

Cited by 13 (2 self)
 Add to MetaCart
Multidimensional digital searching (Md tries) is analyzed from the view point of partial match retrieval. Our first result extends the analysis of Flajolet and Puech of the average cost of retrieval under the Bernoulli model to biased probabilities of symbols occurrences in a key. The second main finding concerns the variance of the cost of the retrieval in the unbiased case. This variance is of order O(N 1\Gammas=M ) where N is the number of records stored in a Md trie, and s is the number of specified components in a query of size M . For M = 2 and s = 1 we present a detailed analysis of the variance, which identifies the constant at p N . This analysis, which is the central part of our paper, requires certain series transformation identities which go back to Ramanujan. In the Appendix we provide a Mellin transform approach to these results. y This work was supported by Fonds zur Forderung der Wissenschaftlichen Forschung Project P7497TEC z Support for this research was pr...