Results 1  10
of
39
Burst Tries: A Fast, Efficient Data Structure for String Keys
 ACM Transactions on Information Systems
, 2002
"... Many applications depend on efficient management of large sets of distinct strings in memory. For example, during index construction for text databases a record is held for each distinct word in the text, containing the word itself and information such as counters. We propose a new data structure, t ..."
Abstract

Cited by 28 (10 self)
 Add to MetaCart
Many applications depend on efficient management of large sets of distinct strings in memory. For example, during index construction for text databases a record is held for each distinct word in the text, containing the word itself and information such as counters. We propose a new data structure, the burst trie, that has significant advantages over existing options for such applications: it requires no more memory than a binary tree; it is as fast as a trie; and, while not as fast as a hash table, a burst trie maintains the strings in sorted or nearsorted order. In this paper we describe burst tries and explore the parameters that govern their performance. We experimentally determine good choices of parameters, and compare burst tries to other structures used for the same task, with a variety of data sets. These experiments show that the burst trie is particularly effective for the skewed frequency distributions common in text collections, and dramatically outperforms all other data structures for the task of managing strings while maintaining sort order.
Profile of Tries
, 2006
"... Tries (from retrieval) are one of the most popular data structures on words. They are pertinent to (internal) structure of stored words and several splitting procedures used in diverse contexts. The profile of a trie is a parameter that represents the number of nodes (either internal or external) wi ..."
Abstract

Cited by 18 (8 self)
 Add to MetaCart
Tries (from retrieval) are one of the most popular data structures on words. They are pertinent to (internal) structure of stored words and several splitting procedures used in diverse contexts. The profile of a trie is a parameter that represents the number of nodes (either internal or external) with the same distance from the root. It is a function of the number of strings stored in a trie and the distance from the root. Several, if not all, trie parameters such as height, size, depth, shortest path, and fillup level can be uniformly analyzed through the (external and internal) profiles. Although profiles represent one of the most fundamental parameters of tries, they have been hardly studied in the past. The analysis of profiles is surprisingly arduous but once it is carried out it reveals unusually intriguing and interesting behavior. We present a detailed study of the distribution of the profiles in a trie built over random strings generated by a memoryless source. We first derive recurrences satisfied by the expected profiles and solve them asymptotically for all possible ranges of the distance from the root. It appears that profiles of tries exhibit several fascinating phenomena. When moving from the root to the leaves of a trie, the growth of the expected profiles vary. Near the root, the external profiles tend to zero in an exponentially rate, then the rate gradually rises to being logarithmic; the external profiles then abruptly tend to infinity, first logarithmically
A probabilistic analysis of some tree algorithms, in "Annals of Applied Probability
, 2005
"... In this paper a general class of tree algorithms is analyzed. It is shown that, by using an appropriate probabilistic representation of the quantities of interest, the asymptotic behavior of these algorithms can be obtained quite easily without resorting to the usual complex analysis techniques. Thi ..."
Abstract

Cited by 16 (5 self)
 Add to MetaCart
In this paper a general class of tree algorithms is analyzed. It is shown that, by using an appropriate probabilistic representation of the quantities of interest, the asymptotic behavior of these algorithms can be obtained quite easily without resorting to the usual complex analysis techniques. This approach gives a unified probabilistic treatment of these questions. It simplifies and extends some of the results known in this domain. 1. Introduction. A
Laws of large numbers and tail inequalities for random tries and Patricia trees
 Journal of Computational and Applied Mathematics
, 2002
"... Abstract. We consider random tries and random patricia trees constructed from n independent strings of symbols drawn from any distribution on any discrete space. If Hn is the height of this tree, we show that Hn/E{Hn} tends to one in probability. Additional tail inequalities are given for the height ..."
Abstract

Cited by 15 (5 self)
 Add to MetaCart
Abstract. We consider random tries and random patricia trees constructed from n independent strings of symbols drawn from any distribution on any discrete space. If Hn is the height of this tree, we show that Hn/E{Hn} tends to one in probability. Additional tail inequalities are given for the height, depth, size, and profile of these trees and ordinary tries that apply without any conditions on the string distributions—they need not even be identically distributed.
The NDTree: A Dynamic Indexing Technique for Multidimensional Nonordered Discrete Data Spaces
 In Proc. of VLDB
, 2003
"... Similarity searches in multidimensional Nonordered Discrete Data Spaces (NDDS) are becoming increasingly important for application areas such as genome sequence databases. ..."
Abstract

Cited by 14 (8 self)
 Add to MetaCart
Similarity searches in multidimensional Nonordered Discrete Data Spaces (NDDS) are becoming increasingly important for application areas such as genome sequence databases.
The Complete Analysis of a Polynomial Factorization Algorithm Over Finite Fields
, 2001
"... This paper derives basic probabilistic properties of random polynomials over finite fields that are of interest in the study of polynomial factorization algorithms. We show that the main characteristics of random polynomial can be treated systematically by methods of "analytic combinatorics" based o ..."
Abstract

Cited by 14 (3 self)
 Add to MetaCart
This paper derives basic probabilistic properties of random polynomials over finite fields that are of interest in the study of polynomial factorization algorithms. We show that the main characteristics of random polynomial can be treated systematically by methods of "analytic combinatorics" based on the combined use of generating functions and of singularity analysis. Our object of study is the classical factorization chain which is described in Fig. 1 and which, despite its simplicity, does not appear to have been totally analysed so far. In this paper, we provide a complete averagecase analysis.
Continued Fractions, Comparison Algorithms, and Fine Structure Constants
, 2000
"... There are known algorithms based on continued fractions for comparing fractions and for determining the sign of 2x2 determinants. The analysis of such extremely simple algorithms leads to an incursion into a surprising variety of domains. We take the reader through a light tour of dynamical systems ..."
Abstract

Cited by 10 (2 self)
 Add to MetaCart
There are known algorithms based on continued fractions for comparing fractions and for determining the sign of 2x2 determinants. The analysis of such extremely simple algorithms leads to an incursion into a surprising variety of domains. We take the reader through a light tour of dynamical systems (symbolic dynamics), number theory (continued fractions), special functions (multiple zeta values), functional analysis (transfer operators), numerical analysis (series acceleration), and complex analysis (the Riemann hypothesis). These domains all eventually contribute to a detailed characterization of the complexity of comparison and sorting algorithms, either on average or in probability.
Dynamic indexing for multidimensional nonordered discrete data spaces using a datapartitioning approach
 ACM Trans. Datab. Syst
, 2006
"... Similarity searches in multidimensional Nonordered Discrete Data Spaces (NDDS) are becoming increasingly important for application areas such as bioinformatics, biometrics, data mining and Ecommerce. Efficient similarity searches require robust indexing techniques. Unfortunately, existing indexing ..."
Abstract

Cited by 9 (5 self)
 Add to MetaCart
Similarity searches in multidimensional Nonordered Discrete Data Spaces (NDDS) are becoming increasingly important for application areas such as bioinformatics, biometrics, data mining and Ecommerce. Efficient similarity searches require robust indexing techniques. Unfortunately, existing indexing methods developed for multidimensional (ordered) Continuous Data Spaces (CDS) such as the Rtree cannot be directly applied to an NDDS. This is because some essential geometric concepts/properties such as the minimum bounding region and the area of a region in a CDS are no longer valid in an NDDS. Other indexing methods based on metric spaces such as the Mtree and the Slimtrees are too general to effectively utilize the special characteristics of NDDSs, resulting in nonoptimized performance. In this paper, we propose a new dynamic datapartitioningbased indexing technique, called the NDtree, to support efficient similarity searches in an NDDS. The key idea is to extend the relevant geometric concepts as well as some indexing strategies used in CDSs to NDDSs. Efficient algorithms for NDtree construction and techniques to solve relevant issues such as handling dimensions with different alphabets in an NDDS are presented. Our experimental results on synthetic data and real genome sequence data demonstrate that the NDtree outperforms the linear scan, the Mtree and the Slimtrees for similarity searches in multidimensional NDDSs. A theoretical model is also developed to predict the performance of the NDtree for random data.
Hidden Word Statistics
"... We consider the sequence comparison problem, also known as "hidden" pattern problem, where one searches for a given subsequence in a text (rather than a string understood as a sequence of consecutive symbols). A characteristic parameter is... ..."
Abstract

Cited by 9 (3 self)
 Add to MetaCart
We consider the sequence comparison problem, also known as "hidden" pattern problem, where one searches for a given subsequence in a text (rather than a string understood as a sequence of consecutive symbols). A characteristic parameter is...
Distributional convergence for the number of symbol comparisons used by QuickSort
, 2012
"... Most previous studies of the sorting algorithm QuickSort have used the number of key comparisons as a measure of the cost of executing the algorithm. Here we suppose that the n independent and identically distributed (iid) keys are each represented as a sequence of symbols from a probabilistic sourc ..."
Abstract

Cited by 9 (3 self)
 Add to MetaCart
Most previous studies of the sorting algorithm QuickSort have used the number of key comparisons as a measure of the cost of executing the algorithm. Here we suppose that the n independent and identically distributed (iid) keys are each represented as a sequence of symbols from a probabilistic source and that QuickSort operates on individual symbols, and we measure the execution cost as the number of symbol comparisons. Assuming only a mild “tameness ” condition on the source, we show that there is a limiting distribution for the number of symbol comparisons after normalization: first centering by the mean and then dividing by n. Additionally, under a condition that grows more restrictive as p increases, we have convergence of moments of orders p and smaller. In particular, we have convergence in distribution and convergence of moments of every order whenever the source is memoryless, i.e., whenever each key is generated as an infinite string of iid symbols. This is somewhat surprising: Even for the classical model that each key is an iid string of unbiased (“fair”) bits, the mean exhibits periodic fluctuations of order n.