Results 1 -
8 of
8
Burst Tries: A Fast, Efficient Data Structure for String Keys
- ACM Transactions on Information Systems
, 2002
"... Many applications depend on efficient management of large sets of distinct strings in memory. For example, during index construction for text databases a record is held for each distinct word in the text, containing the word itself and information such as counters. We propose a new data structure, t ..."
Abstract
-
Cited by 21 (10 self)
- Add to MetaCart
Many applications depend on efficient management of large sets of distinct strings in memory. For example, during index construction for text databases a record is held for each distinct word in the text, containing the word itself and information such as counters. We propose a new data structure, the burst trie, that has significant advantages over existing options for such applications: it requires no more memory than a binary tree; it is as fast as a trie; and, while not as fast as a hash table, a burst trie maintains the strings in sorted or near-sorted order. In this paper we describe burst tries and explore the parameters that govern their performance. We experimentally determine good choices of parameters, and compare burst tries to other structures used for the same task, with a variety of data sets. These experiments show that the burst trie is particularly effective for the skewed frequency distributions common in text collections, and dramatically outperforms all other data structures for the task of managing strings while maintaining sort order.
Trie methods for text and spatial data on secondary storage
, 1995
"... ii Abstract This thesis presents three trie organizations for various binary tries. The new trie structures have two distinctive features: (1) they store no pointers and require two bits per node in the worst case, and (2) they partition tries into pages and are suitable for secondary storage. We ap ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
ii Abstract This thesis presents three trie organizations for various binary tries. The new trie structures have two distinctive features: (1) they store no pointers and require two bits per node in the worst case, and (2) they partition tries into pages and are suitable for secondary storage. We apply trie structures to indexing, storing and querying both text and spatial data on secondary storage. We are interested in practical problems such as storage compactness, I/O efficiency, and large trie construction. We use our tries to index and search arbitrary substrings of a text. For an index of 100 million keys, our trie is 10 %- 25 % smaller than the best known method. This difference is important since the index size is crucial for trie methods. We provide methods for dynamic tries and allow texts to be changed. We also use our tries to compress and approximately search large dictionaries. Our algorithm can find strings with k mismatches in sublinear time. To our knowledge, no other published sublinear algorithm is known for this problem. Besides, we use our tries to store and query spatial data such as maps. A trie structure is proposed to permit querying and retrieving spatial data at arbitrary levels of resolution, without reading from secondary storage any more data than is needed for the specified resolution. The trie structure also compresses spatial data substantially. The performance results on map data have confirmed our expectations: the querying cost is linear in the amount of data needed and independent of the data size in practice. We give algorithms for a set of sample queries including geometrical selection, geometrical join and the nearest neighbour. We also show how to control query cost by specifying an acceptable resolution.
Performance of Data Structures for Small Sets of Strings
- Proc. of the Australasian conference on Computer Science
, 2002
"... Fundamental structures such as trees and hash tables are used for managing data in a huge variety of circumstances. Making the right choice of structure is essential to efficiency. In previous work we have explored the performance of a range of data structures -- different forms of trees, tries, and ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Fundamental structures such as trees and hash tables are used for managing data in a huge variety of circumstances. Making the right choice of structure is essential to efficiency. In previous work we have explored the performance of a range of data structures -- different forms of trees, tries, and hash tables -- for the task of managing sets of millions of strings, and have developed new variants of each that are more efficient for this task than previous alternatives. In this paper we test the performance of the same data structures on small sets of strings, in the context of document processing for index construction. Our results show that the new structures, in particular our burst trie, are the most efficient choice for this task, thus demonstrating that they are suitable for managing sets of hundreds to millions of distinct strings, and for input of hundreds to billions of occurrences.
Improved Behaviour of Tries by the ”Symmetrization” of the Source
"... In this paper, we propose and study a pre-processing technique for improving performance of digital tree (trie)-based search algorithms under asymmetric memoryless sources. This technique (which we call a symmetrization of the source) bijectively maps the sequences of symbols from the original (asym ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
In this paper, we propose and study a pre-processing technique for improving performance of digital tree (trie)-based search algorithms under asymmetric memoryless sources. This technique (which we call a symmetrization of the source) bijectively maps the sequences of symbols from the original (asymmetric) source into symbols of an output alphabet resulting in a more uniform distribution. We introduce a criterion of efficiency for such a mapping, and demonstrate that a problem of finding an optimal for a given source (or universal) symmetrization transform is equivalent to a problem of constructing a minimum redundancy variable-length-to-block code for this source (or class of sources). Based on this result, we propose search algorithms that incorporate known (optimal for a given source and universal) variable-length-to-block codes and study their asymptotic behaviour. We complement our analysis with a description of an efficient algorithm for universal symmetrization of binary memoryless sources, and compare the performance of the resulting search structure with the standard tries. 1
© 2007 Science Publications Query Based Client Indexing in Client/Server Information Systems
"... Abstract: One issue in client/server information systems is the storage of the relationships between clients and data used by these clients. In particular in scenarios that allow the caching of data on the client site, this information can be used in order to keep the global database consistent. Thu ..."
Abstract
- Add to MetaCart
Abstract: One issue in client/server information systems is the storage of the relationships between clients and data used by these clients. In particular in scenarios that allow the caching of data on the client site, this information can be used in order to keep the global database consistent. Thus, if the data on the server are updated, it is possible to detect caches affected by the update. In a following Step it is possible either to patch or to invalidate these caches. In this study we discuss approaches that use posted queries in order to index the clients on the server site.
Analysis of a Class of Tries with Adaptive Multi-Digit Branching
"... Abstract. We study a class of adaptive multi-digit tries, in which the numbers of digits rn processed by nodes with n incoming strings are such that, in memoryless model (with n → ∞): rn → log n η (pr.) where η is an algorithm-specific constant. Examples of known data structures from this class incl ..."
Abstract
- Add to MetaCart
Abstract. We study a class of adaptive multi-digit tries, in which the numbers of digits rn processed by nodes with n incoming strings are such that, in memoryless model (with n → ∞): rn → log n η (pr.) where η is an algorithm-specific constant. Examples of known data structures from this class include LC-tries (Andersson and Nilsson, 1993), ”relaxed ” LC-tries (Nilsson and Tikkanen, 1998), tries with logarithmic selection of degrees of nodes, etc. We show, that the average depth Dn of such tries in asymmetric memoryless model has the following asymptotic behavior (with n → ∞): Dn = log log n (1 + o (1)) − log (1 − h/η) where n is the number of strings inserted in the trie, and h is the entropy of the source. We use this formula to compare performance of known adaptive trie structures, and to predict properties of other possible implementations of tries in this class. 1
On Time-Space Efficiency of Digital Trees with Adaptive Multi-Digit Branching ∗
, 2003
"... We consider a class of digital trees (tries) with adaptive selection of degrees of their nodes. This class includes LC-tries of Andersson and Nilsson (1993) which recursively replace all complete subtrees in the original tries with larger (multi-digit) nodes, as well as dynamic tries of Nilsson and ..."
Abstract
- Add to MetaCart
We consider a class of digital trees (tries) with adaptive selection of degrees of their nodes. This class includes LC-tries of Andersson and Nilsson (1993) which recursively replace all complete subtrees in the original tries with larger (multi-digit) nodes, as well as dynamic tries of Nilsson and Tikkanen (1998) which recursively replace all subtrees of bounded sparseness (a ratio of the number of missing nodes at the last level to the total number of nodes at this level). In this paper we study the average behavior of such tries with respect to a hybrid time/space efficiency criterion. We demonstrate that there exists an interesting connection between the efficiency and sparseness of nodes in adaptive tries. In particular, we show that in a symmetric memoryless model, the optimal in a sense of time/space efficiency nodes, are 1/e-times ( ≈ 36.8%) sparse. On the other hand, if the source is asymmetric, the sparseness of the time/space efficient nodes is somewhat larger, asymptotically (with large number of strings) approaching 50%. These results can be used to support the trie construction algorithm of Nilsson and Tikkanen, and suggest the optimal choice of constants in this procedure. 1
Redesigning the String Hash Table, Burst Trie, and BST to Exploit Cache
, 2011
"... A key decision when developing in-memory computing applications is choice of a mechanism to store and retrieve strings. The most efficient current data structures for this task are the hash table with move-to-front chains and the burst trie, both of which use linked lists as a substructure, and vari ..."
Abstract
- Add to MetaCart
A key decision when developing in-memory computing applications is choice of a mechanism to store and retrieve strings. The most efficient current data structures for this task are the hash table with move-to-front chains and the burst trie, both of which use linked lists as a substructure, and variants of binary search tree. These data structures are computationally efficient, but typical implementations use large numbers of nodes and pointers to manage strings, which is not efficient in use of cache. In this article, we explore two alternatives to the standard representation: the simple expedient of including the string in its node, and, for linked lists, the more drastic step of replacing each list of nodes by a contiguous array of characters. Our experiments show that, for large sets of strings, the improvement is dramatic. For hashing, in the best case the total space overhead is reduced to less than 1 bit per string. For the burst trie, over 300MB of strings can be stored in a total of under 200MB of memory with significantly improved search time. These results, on a variety of data sets, show that cache-friendly variants of fundamental data structures can yield remarkable gains in performance.

