Results 1  10
of
39
Burst Tries: A Fast, Efficient Data Structure for String Keys
 ACM Transactions on Information Systems
, 2002
"... Many applications depend on efficient management of large sets of distinct strings in memory. For example, during index construction for text databases a record is held for each distinct word in the text, containing the word itself and information such as counters. We propose a new data structure, t ..."
Abstract

Cited by 28 (10 self)
 Add to MetaCart
Many applications depend on efficient management of large sets of distinct strings in memory. For example, during index construction for text databases a record is held for each distinct word in the text, containing the word itself and information such as counters. We propose a new data structure, the burst trie, that has significant advantages over existing options for such applications: it requires no more memory than a binary tree; it is as fast as a trie; and, while not as fast as a hash table, a burst trie maintains the strings in sorted or nearsorted order. In this paper we describe burst tries and explore the parameters that govern their performance. We experimentally determine good choices of parameters, and compare burst tries to other structures used for the same task, with a variety of data sets. These experiments show that the burst trie is particularly effective for the skewed frequency distributions common in text collections, and dramatically outperforms all other data structures for the task of managing strings while maintaining sort order.
Profile of Tries
, 2006
"... Tries (from retrieval) are one of the most popular data structures on words. They are pertinent to (internal) structure of stored words and several splitting procedures used in diverse contexts. The profile of a trie is a parameter that represents the number of nodes (either internal or external) wi ..."
Abstract

Cited by 18 (8 self)
 Add to MetaCart
Tries (from retrieval) are one of the most popular data structures on words. They are pertinent to (internal) structure of stored words and several splitting procedures used in diverse contexts. The profile of a trie is a parameter that represents the number of nodes (either internal or external) with the same distance from the root. It is a function of the number of strings stored in a trie and the distance from the root. Several, if not all, trie parameters such as height, size, depth, shortest path, and fillup level can be uniformly analyzed through the (external and internal) profiles. Although profiles represent one of the most fundamental parameters of tries, they have been hardly studied in the past. The analysis of profiles is surprisingly arduous but once it is carried out it reveals unusually intriguing and interesting behavior. We present a detailed study of the distribution of the profiles in a trie built over random strings generated by a memoryless source. We first derive recurrences satisfied by the expected profiles and solve them asymptotically for all possible ranges of the distance from the root. It appears that profiles of tries exhibit several fascinating phenomena. When moving from the root to the leaves of a trie, the growth of the expected profiles vary. Near the root, the external profiles tend to zero in an exponentially rate, then the rate gradually rises to being logarithmic; the external profiles then abruptly tend to infinity, first logarithmically
A probabilistic analysis of some tree algorithms, in "Annals of Applied Probability
, 2005
"... In this paper a general class of tree algorithms is analyzed. It is shown that, by using an appropriate probabilistic representation of the quantities of interest, the asymptotic behavior of these algorithms can be obtained quite easily without resorting to the usual complex analysis techniques. Thi ..."
Abstract

Cited by 16 (5 self)
 Add to MetaCart
In this paper a general class of tree algorithms is analyzed. It is shown that, by using an appropriate probabilistic representation of the quantities of interest, the asymptotic behavior of these algorithms can be obtained quite easily without resorting to the usual complex analysis techniques. This approach gives a unified probabilistic treatment of these questions. It simplifies and extends some of the results known in this domain. 1. Introduction. A
Laws of large numbers and tail inequalities for random tries and Patricia trees
 Journal of Computational and Applied Mathematics
, 2002
"... Abstract. We consider random tries and random patricia trees constructed from n independent strings of symbols drawn from any distribution on any discrete space. If Hn is the height of this tree, we show that Hn/E{Hn} tends to one in probability. Additional tail inequalities are given for the height ..."
Abstract

Cited by 15 (5 self)
 Add to MetaCart
Abstract. We consider random tries and random patricia trees constructed from n independent strings of symbols drawn from any distribution on any discrete space. If Hn is the height of this tree, we show that Hn/E{Hn} tends to one in probability. Additional tail inequalities are given for the height, depth, size, and profile of these trees and ordinary tries that apply without any conditions on the string distributions—they need not even be identically distributed.
The NDTree: A Dynamic Indexing Technique for Multidimensional Nonordered Discrete Data Spaces
 In Proc. of VLDB
, 2003
"... Similarity searches in multidimensional Nonordered Discrete Data Spaces (NDDS) are becoming increasingly important for application areas such as genome sequence databases. ..."
Abstract

Cited by 14 (8 self)
 Add to MetaCart
Similarity searches in multidimensional Nonordered Discrete Data Spaces (NDDS) are becoming increasingly important for application areas such as genome sequence databases.
The Complete Analysis of a Polynomial Factorization Algorithm Over Finite Fields
, 2001
"... This paper derives basic probabilistic properties of random polynomials over finite fields that are of interest in the study of polynomial factorization algorithms. We show that the main characteristics of random polynomial can be treated systematically by methods of "analytic combinatorics" based o ..."
Abstract

Cited by 14 (3 self)
 Add to MetaCart
This paper derives basic probabilistic properties of random polynomials over finite fields that are of interest in the study of polynomial factorization algorithms. We show that the main characteristics of random polynomial can be treated systematically by methods of "analytic combinatorics" based on the combined use of generating functions and of singularity analysis. Our object of study is the classical factorization chain which is described in Fig. 1 and which, despite its simplicity, does not appear to have been totally analysed so far. In this paper, we provide a complete averagecase analysis.
Continued Fractions, Comparison Algorithms, and Fine Structure Constants
, 2000
"... There are known algorithms based on continued fractions for comparing fractions and for determining the sign of 2x2 determinants. The analysis of such extremely simple algorithms leads to an incursion into a surprising variety of domains. We take the reader through a light tour of dynamical systems ..."
Abstract

Cited by 10 (2 self)
 Add to MetaCart
There are known algorithms based on continued fractions for comparing fractions and for determining the sign of 2x2 determinants. The analysis of such extremely simple algorithms leads to an incursion into a surprising variety of domains. We take the reader through a light tour of dynamical systems (symbolic dynamics), number theory (continued fractions), special functions (multiple zeta values), functional analysis (transfer operators), numerical analysis (series acceleration), and complex analysis (the Riemann hypothesis). These domains all eventually contribute to a detailed characterization of the complexity of comparison and sorting algorithms, either on average or in probability.
Dynamic indexing for multidimensional nonordered discrete data spaces using a datapartitioning approach
 ACM Trans. Datab. Syst
, 2006
"... Similarity searches in multidimensional Nonordered Discrete Data Spaces (NDDS) are becoming increasingly important for application areas such as bioinformatics, biometrics, data mining and Ecommerce. Efficient similarity searches require robust indexing techniques. Unfortunately, existing indexing ..."
Abstract

Cited by 9 (5 self)
 Add to MetaCart
Similarity searches in multidimensional Nonordered Discrete Data Spaces (NDDS) are becoming increasingly important for application areas such as bioinformatics, biometrics, data mining and Ecommerce. Efficient similarity searches require robust indexing techniques. Unfortunately, existing indexing methods developed for multidimensional (ordered) Continuous Data Spaces (CDS) such as the Rtree cannot be directly applied to an NDDS. This is because some essential geometric concepts/properties such as the minimum bounding region and the area of a region in a CDS are no longer valid in an NDDS. Other indexing methods based on metric spaces such as the Mtree and the Slimtrees are too general to effectively utilize the special characteristics of NDDSs, resulting in nonoptimized performance. In this paper, we propose a new dynamic datapartitioningbased indexing technique, called the NDtree, to support efficient similarity searches in an NDDS. The key idea is to extend the relevant geometric concepts as well as some indexing strategies used in CDSs to NDDSs. Efficient algorithms for NDtree construction and techniques to solve relevant issues such as handling dimensions with different alphabets in an NDDS are presented. Our experimental results on synthetic data and real genome sequence data demonstrate that the NDtree outperforms the linear scan, the Mtree and the Slimtrees for similarity searches in multidimensional NDDSs. A theoretical model is also developed to predict the performance of the NDtree for random data.
Hidden Word Statistics
"... We consider the sequence comparison problem, also known as "hidden" pattern problem, where one searches for a given subsequence in a text (rather than a string understood as a sequence of consecutive symbols). A characteristic parameter is... ..."
Abstract

Cited by 9 (3 self)
 Add to MetaCart
We consider the sequence comparison problem, also known as "hidden" pattern problem, where one searches for a given subsequence in a text (rather than a string understood as a sequence of consecutive symbols). A characteristic parameter is...
A spacepartitioningbased indexing method for multidimensional nonordered discrete data spaces
 ACM Trans. on Information Syst
, 2006
"... There is an increasing demand for similarity searches in a multidimensional nonordered discrete data space (NDDS) from application areas such as bioinformatics and data mining. The nonordered and discrete nature of an NDDS raises new challenges for developing efficient indexing methods for similar ..."
Abstract

Cited by 8 (4 self)
 Add to MetaCart
There is an increasing demand for similarity searches in a multidimensional nonordered discrete data space (NDDS) from application areas such as bioinformatics and data mining. The nonordered and discrete nature of an NDDS raises new challenges for developing efficient indexing methods for similarity searches. In this article, we propose a new indexing technique, called the NSPtree, to support efficient similarity searches in an NDDS. As we know, overlap causes a performance degradation for indexing methods (e.g., the Rtree) for a continuous data space. In an NDDS, this problem is even worse due to the limited number of elements available on each dimension of an NDDS. The key idea of the NSPtree is to use a novel discrete spacepartitioning (SP) scheme to ensure no overlap at each level in the tree. A number of heuristics and strategies are incorporated into the tree construction algorithms to deal with the challenges for developing an SPbased index tree for an NDDS. Our experiments demonstrate that the NSPtree is quite promising in supporting efficient similarity searches in NDDSs. We have compared the NSPtree with the NDtree, a datapartitioningbased indexing technique for NDDSs that was proposed recently, and the linear scan using different NDDSs. It was found that the search performance of the NSPtree was better than those of both methods.