Results 1  10
of
15
Autocorrelation On Words And Its Applications  Analysis of Suffix Trees by StringRuler Approach
 J. Combin.Theory Ser. A
, 1994
"... We study in a probabilistic framework some topics concerning the way words can overlap. Our probabilistic models assumes that a word is a sequence of i.i.d. symbols generated from a finite alphabet. This defines the so called Bernoulli model. We investigate the length of a subword that can be recopi ..."
Abstract

Cited by 54 (23 self)
 Add to MetaCart
We study in a probabilistic framework some topics concerning the way words can overlap. Our probabilistic models assumes that a word is a sequence of i.i.d. symbols generated from a finite alphabet. This defines the so called Bernoulli model. We investigate the length of a subword that can be recopied, that is, a subword that occurs at least twice in a given word. An occurrence of such repeated substrings is easy to detect in a digital tree called a suffix tree. The length of a repeated substring corresponds to the typical depth in the associated suffix tree. Our main finding shows that the typical depth in a suffix tree is asymptotically distributed in the same manner as the typical depth in a digital tree that stores independent keys (i.e., independent tries). More precisely, we prove that the typical depth in a suffix tree built from the first n suffixes of a random word is normally distributed with the mean asymptotically becoming 1=h 1 log n and the variance ff \Delta log n, where...
A Generalized Suffix Tree and Its (Un)Expected Asymptotic Behaviors
 SIAM J. Computing
, 1996
"... Suffix trees find several applications in computer science and telecommunications, most notably in algorithms on strings, data compressions and codes. Despite this, very little is known about their typical behaviors. In a probabilistic framework, we consider a family of suffix trees  further calle ..."
Abstract

Cited by 52 (29 self)
 Add to MetaCart
Suffix trees find several applications in computer science and telecommunications, most notably in algorithms on strings, data compressions and codes. Despite this, very little is known about their typical behaviors. In a probabilistic framework, we consider a family of suffix trees  further called bsuffix trees  built from the first n suffixes of a random word. In this family a noncompact suffix tree (i.e., such that every edge is labeled by a single symbol) is represented by b = 1, and a compact suffix tree (i.e., without unary nodes) is asymptotically equivalent to b ! 1 as n ! 1. We study several parameters of bsuffix trees, namely: the depth of a given suffix, the depth of insertion, the height and the shortest feasible path. Some new results concerning typical (i.e., almost sure) behaviors of these parameters are established. These findings are used to obtain several insights into certain algorithms on words, molecular biology and universal data compression schemes. Key Wo...
Dynamical Sources in Information Theory: A General Analysis of Trie Structures
 ALGORITHMICA
, 1999
"... Digital trees, also known as tries, are a general purpose flexible data structure that implements dictionaries built on sets of words. An analysis is given of three major representations of tries in the form of arraytries, list tries, and bsttries ("ternary search tries"). The size and the sear ..."
Abstract

Cited by 50 (7 self)
 Add to MetaCart
Digital trees, also known as tries, are a general purpose flexible data structure that implements dictionaries built on sets of words. An analysis is given of three major representations of tries in the form of arraytries, list tries, and bsttries ("ternary search tries"). The size and the search costs of the corresponding representations are analysed precisely in the average case, while a complete distributional analysis of height of tries is given. The unifying data model used is that of dynamical sources and it encompasses classical models like those of memoryless sources with independent symbols, of finite Markovchains, and of nonuniform densities. The probabilistic behaviour of the main parameters, namely size, path length, or height, appears to be determined by two intrinsic characteristics of the source: the entropy and the probability of letter coincidence. These characteristics are themselves related in a natural way to spectral properties of specific transfer operators of the Ruelle type.
Asymptotic Properties Of Data Compression And Suffix Trees
 IEEE Trans. Inform. Theory
, 1993
"... Recently, Wyner and Ziv have proved that the typical length of a repeated subword found within the first n positions of a stationary ergodic sequence is (1=h) log n in probability where h is the entropy of the alphabet. This finding was used to obtain several insights into certain universal data com ..."
Abstract

Cited by 40 (11 self)
 Add to MetaCart
Recently, Wyner and Ziv have proved that the typical length of a repeated subword found within the first n positions of a stationary ergodic sequence is (1=h) log n in probability where h is the entropy of the alphabet. This finding was used to obtain several insights into certain universal data compression schemes, most notably the LempelZiv data compression algorithm. Wyner and Ziv have also conjectured that their result can be extended to a stronger almost sure convergence. In this paper, we settle this conjecture in the negative in the so called right domain asymptotic, that is, during a dynamic phase of expanding the data base. We prove  under an additional assumption involving mixing conditions  that the length of a typical repeated subword oscillates almost surely (a.s.) between (1=h 1 ) log n and (1=h 2 ) log n where 0 ! h 2 ! h h 1 ! 1. We also show that the length of the nth block in the LempelZiv parsing algorithm reveals a similar behavior. We relate our findings to...
Burst Tries: A Fast, Efficient Data Structure for String Keys
 ACM Transactions on Information Systems
, 2002
"... Many applications depend on efficient management of large sets of distinct strings in memory. For example, during index construction for text databases a record is held for each distinct word in the text, containing the word itself and information such as counters. We propose a new data structure, t ..."
Abstract

Cited by 28 (10 self)
 Add to MetaCart
Many applications depend on efficient management of large sets of distinct strings in memory. For example, during index construction for text databases a record is held for each distinct word in the text, containing the word itself and information such as counters. We propose a new data structure, the burst trie, that has significant advantages over existing options for such applications: it requires no more memory than a binary tree; it is as fast as a trie; and, while not as fast as a hash table, a burst trie maintains the strings in sorted or nearsorted order. In this paper we describe burst tries and explore the parameters that govern their performance. We experimentally determine good choices of parameters, and compare burst tries to other structures used for the same task, with a variety of data sets. These experiments show that the burst trie is particularly effective for the skewed frequency distributions common in text collections, and dramatically outperforms all other data structures for the task of managing strings while maintaining sort order.
Profile of Tries
, 2006
"... Tries (from retrieval) are one of the most popular data structures on words. They are pertinent to (internal) structure of stored words and several splitting procedures used in diverse contexts. The profile of a trie is a parameter that represents the number of nodes (either internal or external) wi ..."
Abstract

Cited by 18 (8 self)
 Add to MetaCart
Tries (from retrieval) are one of the most popular data structures on words. They are pertinent to (internal) structure of stored words and several splitting procedures used in diverse contexts. The profile of a trie is a parameter that represents the number of nodes (either internal or external) with the same distance from the root. It is a function of the number of strings stored in a trie and the distance from the root. Several, if not all, trie parameters such as height, size, depth, shortest path, and fillup level can be uniformly analyzed through the (external and internal) profiles. Although profiles represent one of the most fundamental parameters of tries, they have been hardly studied in the past. The analysis of profiles is surprisingly arduous but once it is carried out it reveals unusually intriguing and interesting behavior. We present a detailed study of the distribution of the profiles in a trie built over random strings generated by a memoryless source. We first derive recurrences satisfied by the expected profiles and solve them asymptotically for all possible ranges of the distance from the root. It appears that profiles of tries exhibit several fascinating phenomena. When moving from the root to the leaves of a trie, the growth of the expected profiles vary. Near the root, the external profiles tend to zero in an exponentially rate, then the rate gradually rises to being logarithmic; the external profiles then abruptly tend to infinity, first logarithmically
Laws of large numbers and tail inequalities for random tries and Patricia trees
 Journal of Computational and Applied Mathematics
, 2002
"... Abstract. We consider random tries and random patricia trees constructed from n independent strings of symbols drawn from any distribution on any discrete space. If Hn is the height of this tree, we show that Hn/E{Hn} tends to one in probability. Additional tail inequalities are given for the height ..."
Abstract

Cited by 15 (5 self)
 Add to MetaCart
Abstract. We consider random tries and random patricia trees constructed from n independent strings of symbols drawn from any distribution on any discrete space. If Hn is the height of this tree, we show that Hn/E{Hn} tends to one in probability. Additional tail inequalities are given for the height, depth, size, and profile of these trees and ordinary tries that apply without any conditions on the string distributions—they need not even be identically distributed.
Asymptotic Behavior Of The Height In A Digital Search Tree And The Longest Phrase Of The LempelZiv Scheme
 SIAM J. Computing
, 2000
"... . We study the height of a digital search tree (DST) built from n random strings generated by an unbiased memoryless source (i.e., all symbols are equally likely). We shall argue that the height of such a tree is equivalent to the length of the longest phrase in the LempelZiv parsing scheme that p ..."
Abstract

Cited by 11 (5 self)
 Add to MetaCart
. We study the height of a digital search tree (DST) built from n random strings generated by an unbiased memoryless source (i.e., all symbols are equally likely). We shall argue that the height of such a tree is equivalent to the length of the longest phrase in the LempelZiv parsing scheme that partitions a random sequence into n phrases. We also analyze the longest phrase in the LempelZiv scheme in which a string of fixed length m is parsed into a random number of phrases. In the course of our analysis, we shall identify four natural regions of the height distribution and characterize them asymptotically for large n. In particular, for the region where most of the probability mass is concentrated, the asymptotic distribution of the height exhibits an exponential of a Gaussian distribution (with an oscillating term) around the most probable value k 1 = #log 2 n+ # 2 log 2 n  log 2 ( # 2 log 2 n) + 1 log 2  1 2 # + 1. More precisely, we shall prove that the asymptoti...
Greedy Algorithms For The Shortest Common Superstring That Are Asymptotically Optimal
, 1997
"... There has recently been a resurgence of interest in the shortest common superstring problem due to its important applications in molecular biology (e.g., recombination of DNA) and data compression. The problem is NPhard, but it has been known for some time that greedy algorithms work well for this ..."
Abstract

Cited by 5 (4 self)
 Add to MetaCart
There has recently been a resurgence of interest in the shortest common superstring problem due to its important applications in molecular biology (e.g., recombination of DNA) and data compression. The problem is NPhard, but it has been known for some time that greedy algorithms work well for this problem. More precisely, it was proved in a recent sequence of papers that in the worst case a greedy algorithm produces a superstring that is at most fi times (2 fi 4) worse than optimal. We analyze the problem in a probabilistic framework, and consider the optimal total overlap O opt n and the overlap O gr n produced by various greedy algorithms. These turn out to be asymptotically equivalent. We show that with high probability lim n!1 O opt n n log n = lim n!1 O gr n n log n = 1 H where n is the number of original strings, and H is the entropy of the underlying alphabet. Our results hold under a condition that the lengths of all strings are not too short.