Results 1 - 10
of
10
Red-Black Trie Hashing
, 1995
"... Trie hashing is a scheme, proposed by Litwin, for indexing records with very long alphanumeric keys. The records are grouped into buckets of capacity b and maintained on secondary storage. To retrieve a record, the memory resident trie is traversed from the root to a leaf node where the address of t ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Trie hashing is a scheme, proposed by Litwin, for indexing records with very long alphanumeric keys. The records are grouped into buckets of capacity b and maintained on secondary storage. To retrieve a record, the memory resident trie is traversed from the root to a leaf node where the address of the target bucket is found. Using the address found, the data bucket is read into memory and searched to determine the presence or absence of the record. The scheme, for all practical purposes, locates a record in one or two disk accesses. Unlike a trie, the scheme proposed suffers from potential degeneracy when the keys inserted are ordered and has an expensive reconstruction cost if a system failure occurs during a session. We present a new approach to implementing Trie Hashing that resolves the degeneracy problem. Our approach combines the basic trie hashing algorithm with the balancing techniques of the Red-Black Binary Search Tree, to produce a relatively balanced trie hashing scheme. As...
An analysis of the height of tries with random weights on the edges
- Combinatorics, Probability and Computing
"... We analyze the weighted height of random tries built from independent strings of i.i.d. symbols on the finite alphabet {1,..., d}. The edges receive random weights whose distribution depends upon the number of strings that visit that edge. Such a model covers the hybrid tries of de la Briandais (195 ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
We analyze the weighted height of random tries built from independent strings of i.i.d. symbols on the finite alphabet {1,..., d}. The edges receive random weights whose distribution depends upon the number of strings that visit that edge. Such a model covers the hybrid tries of de la Briandais (1959) and the TST of Bentley and Sedgewick (1997), where the search time for a string can be decomposed as a sum of processing times for each symbol in the string. Our weighted trie model also permits one to study maximal path imbalance. In all cases, the weighted height is shown be asymptotic to c log n in probability, where c is determined by the behavior of the core of the trie (the part where all nodes have a full set of children) and the fringe of the trie (the part of the trie where nodes have only one child and form spaghetti-like trees). It can be found by maximizing a function that is related to the Cramér exponent of the distribution of the edge weights.
Distribution of inter-node distances in digital trees
- in 2005 International Conference on Analysis of Algorithms, C. Martínez (ed.), Discrete Mathematics and Theoretical Computer Science, Proceedings AD
, 2005
"... We investigate distances between pairs of nodes in digital trees (digital search trees (DST), and tries). By analytic techniques, such as the Mellin Transform and poissonization, we describe a program to determine the moments of these distances. The program is illustrated on the mean and variance. O ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
We investigate distances between pairs of nodes in digital trees (digital search trees (DST), and tries). By analytic techniques, such as the Mellin Transform and poissonization, we describe a program to determine the moments of these distances. The program is illustrated on the mean and variance. One encounters delayed Mellin transform equations, which we solve by inspection. Interestingly, the unbiased case gives a bounded variance, whereas the biased case gives a variance growing with the number of keys. It is therefore possible in the biased case to show that an appropriately normalized version of the distance converges to a limit. The complexity of moment calculation increases substantially with each higher moment; A shortcut to the limit is needed via a method that avoids the computation of all moments. Toward this end, we utilize the contraction method to show that in biased digital search trees the distribution of a suitably normalized version of the distances approaches a limit that is the fixed-point solution (in the Wasserstein space) of a distributional equation. An explicit solution to the fixed-point equation is readily demonstrated to be Gaussian.
Distances in random digital search trees
, 2006
"... Distances between nodes in random trees is a popular topic, and several classes of trees have recently been investigated. We look into this matter in digital search trees. By analytic techniques, such as the Mellin Transform and poissonization, we describe a program to determine the moments of the ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Distances between nodes in random trees is a popular topic, and several classes of trees have recently been investigated. We look into this matter in digital search trees. By analytic techniques, such as the Mellin Transform and poissonization, we describe a program to determine the moments of these distances. The program is illustrated on the mean and variance. One encounters delayed Mellin transform equations, which we solve by inspection. In addition to various asymptotics, we give an exact expression for the mean and for the variance in the unbiased case. Interestingly, the unbiased case gives a bounded variance, whereas the biased case gives a variance growing with the number of keys. It is therefore possible in the biased case to show that an appropriately normalized version of the distance converges to a limit. The complexity of moment calculation increases substantially with each higher moment; it is prudent to seek a shortcut to the limit via a method that avoids the computation of all moments. Toward this end, we utilize the contraction method to show that in biased digital search trees the distribution of a suitably normalized version of the distances approaches a limit that is the fixed-point solution of a distributional equation (distances being measured in the Wasserstein metric space). An explicit solution to the fixed-point equation is readily demonstrated to be Gaussian.
Red-Black Balanced Trie Hashing
, 1995
"... Trie hashing is a scheme, proposed by Litwin, for indexing records with very long alphanumeric keys. The records are grouped into buckets of capacity b records per bucket and maintained on secondary storage. To retrieve a record, the memory resident trie is traversed from the root to a leaf node whe ..."
Abstract
- Add to MetaCart
Trie hashing is a scheme, proposed by Litwin, for indexing records with very long alphanumeric keys. The records are grouped into buckets of capacity b records per bucket and maintained on secondary storage. To retrieve a record, the memory resident trie is traversed from the root to a leaf node where the address of the target bucket is found. Using the address found, the data bucket is read into memory and searched to determine the presence or absence of the record. The scheme, for all practical purposes, locates a record in one or two disk accesses. Unlike a trie, the scheme suffers from: i) potential degeneracy when the keys inserted are ordered, ii) expensive reconstruction cost if a system failure occurs during a session. We present a new approach to implementing Trie Hashing that resolves the problem of potential degeneracy. Our approach combines the basic trie hashing algorithm with the balancing techniques of the Red-Black Binary Search Tree, to produce a relatively balanced tr...
Efficient Discovery of Common Substructures in Macromolecules
- In IEEE Intl. Conference on Data Mining ’02
, 2002
"... Biological macromolecules play a fundamental role in disease; therefore, they are of great interest to fields such as pharmacology and chemical genomics. Yet due to macromolecules ' complexity, development of effective techniques for elucidating structure-function macromolecular relationships has be ..."
Abstract
- Add to MetaCart
Biological macromolecules play a fundamental role in disease; therefore, they are of great interest to fields such as pharmacology and chemical genomics. Yet due to macromolecules ' complexity, development of effective techniques for elucidating structure-function macromolecular relationships has been ill explored. Previous techniques have either focused on sequence analysis, which only approximates structure-function relationships, or on small coordinate datasets, which does not scale to large datasets or handle noise. We present a novel scalable approach to efficiently discover macromolecule substructures based on three-dimensional coordinate data, without domain-specific knowledge. The approach combines structure-based frequent pattern discovery with search space reduction and coordinate noise handling. We analyze computational performance compared to traditional approaches, validate that our approach can discover meaningful substructures in noisy macromolecule data by automated discovery of primary and secondary protein structures, and show that our technique is superior to sequence-based approaches at determining structural, and thus functional, similarity between proteins.
The Height and Size of Random Hash Trees and Random Pebbled Hash Trees
, 1999
"... The random hash tree and the N-tree were introduced by Ehrlich in 1981. In the random hash tree, n data points are hashed to values X 1 ,...,X n , independently and identically distributed random variables taking values that are uniformly distributed on [0, 1]. Place the X i 's in n equal-sized buck ..."
Abstract
- Add to MetaCart
The random hash tree and the N-tree were introduced by Ehrlich in 1981. In the random hash tree, n data points are hashed to values X 1 ,...,X n , independently and identically distributed random variables taking values that are uniformly distributed on [0, 1]. Place the X i 's in n equal-sized buckets as in hashing with chaining. For each bucket with at least two points, repeat the same process, keeping the branch factor always equal to the number of bucketed points. If Hn is the height of tree obtained in this manner, we show that Hn/ log 2 n 1 in probability. In the random pebbled hash tree, we remove one point randomly and place it in the present node (as with the digital search tree modification of a trie) and perform the bucketing step as above on the remaining points (if any). With this simple modification, Hn in probability. We also show that the expected number of nodes in the random hash tree and random pebbled hash tree is asymptotic to 2.3020238 ...n and 1.4183342 ...n, respectively.
Contracted Suffix Trees: A Simple and Dynamic Text Indexing Data Structure
"... Abstract. We address the problem of finding the locations of all instances of a string P in a text T, where of T is allowed to facilitate the queries. Previous data structures for this problem include the suffix tree, the suffix array, and the compact DAWG. We modify a data structure called a sequen ..."
Abstract
- Add to MetaCart
Abstract. We address the problem of finding the locations of all instances of a string P in a text T, where of T is allowed to facilitate the queries. Previous data structures for this problem include the suffix tree, the suffix array, and the compact DAWG. We modify a data structure called a sequence tree, which was proposed by Coffman and Eve for hashing, and adapt it to the new problem. We can then produce a list of k occurrences of any string P in T in O(||P | | + k) time. Because of properties shared by suffixes of a text that are not shared by arbitrary hash keys, we can build the structure in O(||T ||) time, which is much faster than Coffman and Eve’s algorithm. These bounds are as good as those for the suffix tree, suffix array, and the compact DAWG. The advantages are the elementary nature of some of the algorithms for constructing and using the data structure and the asymptotic bounds we can give for updating the data structure when the text is edited. 1
The total path length of split trees
, 2011
"... We consider the model of random trees introduced by Devroye [SIAM J Comput 28, 409– 432, 1998]. The model encompasses many important randomized algorithms and data structures. The pieces of data (items) are stored in a randomized fashion in the nodes of a tree. The total path length (sum of depths o ..."
Abstract
- Add to MetaCart
We consider the model of random trees introduced by Devroye [SIAM J Comput 28, 409– 432, 1998]. The model encompasses many important randomized algorithms and data structures. The pieces of data (items) are stored in a randomized fashion in the nodes of a tree. The total path length (sum of depths of the items) is a natural measure of the efficiency of the algorithm/data structure. Using renewal theory, we prove convergence in distribution of the total path length towards a distribution characterized uniquely by a fixed point equation. Our result covers, using a unified approach, many data structures such as binary search trees, m-ary search trees, quad trees, median-of-(2k + 1) trees, and simplex trees. 1

