Results 1  10
of
15
Burst Tries: A Fast, Efficient Data Structure for String Keys
 ACM Transactions on Information Systems
, 2002
"... Many applications depend on efficient management of large sets of distinct strings in memory. For example, during index construction for text databases a record is held for each distinct word in the text, containing the word itself and information such as counters. We propose a new data structure, t ..."
Abstract

Cited by 31 (10 self)
 Add to MetaCart
Many applications depend on efficient management of large sets of distinct strings in memory. For example, during index construction for text databases a record is held for each distinct word in the text, containing the word itself and information such as counters. We propose a new data structure, the burst trie, that has significant advantages over existing options for such applications: it requires no more memory than a binary tree; it is as fast as a trie; and, while not as fast as a hash table, a burst trie maintains the strings in sorted or nearsorted order. In this paper we describe burst tries and explore the parameters that govern their performance. We experimentally determine good choices of parameters, and compare burst tries to other structures used for the same task, with a variety of data sets. These experiments show that the burst trie is particularly effective for the skewed frequency distributions common in text collections, and dramatically outperforms all other data structures for the task of managing strings while maintaining sort order.
Trie methods for text and spatial data on secondary storage
, 1995
"... ii Abstract This thesis presents three trie organizations for various binary tries. The new trie structures have two distinctive features: (1) they store no pointers and require two bits per node in the worst case, and (2) they partition tries into pages and are suitable for secondary storage. We ap ..."
Abstract

Cited by 7 (2 self)
 Add to MetaCart
ii Abstract This thesis presents three trie organizations for various binary tries. The new trie structures have two distinctive features: (1) they store no pointers and require two bits per node in the worst case, and (2) they partition tries into pages and are suitable for secondary storage. We apply trie structures to indexing, storing and querying both text and spatial data on secondary storage. We are interested in practical problems such as storage compactness, I/O efficiency, and large trie construction. We use our tries to index and search arbitrary substrings of a text. For an index of 100 million keys, our trie is 10 % 25 % smaller than the best known method. This difference is important since the index size is crucial for trie methods. We provide methods for dynamic tries and allow texts to be changed. We also use our tries to compress and approximately search large dictionaries. Our algorithm can find strings with k mismatches in sublinear time. To our knowledge, no other published sublinear algorithm is known for this problem. Besides, we use our tries to store and query spatial data such as maps. A trie structure is proposed to permit querying and retrieving spatial data at arbitrary levels of resolution, without reading from secondary storage any more data than is needed for the specified resolution. The trie structure also compresses spatial data substantially. The performance results on map data have confirmed our expectations: the querying cost is linear in the amount of data needed and independent of the data size in practice. We give algorithms for a set of sample queries including geometrical selection, geometrical join and the nearest neighbour. We also show how to control query cost by specifying an acceptable resolution.
Triejoin: a triebased method for efficient string similarity joins
 THE VLDB JOURNAL
, 2012
"... A string similarity join finds similar pairs between two collections of strings. Many applications, e.g., data integration and cleaning, can significantly benefit from an efficient stringsimilarityjoin algorithm. In this paper, we study string similarity joins with editdistance constraints. Exis ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
A string similarity join finds similar pairs between two collections of strings. Many applications, e.g., data integration and cleaning, can significantly benefit from an efficient stringsimilarityjoin algorithm. In this paper, we study string similarity joins with editdistance constraints. Existing methods usually employ a filterandrefine framework and suffer from the following limitations: (1) They are inefficient for the data sets with short strings (the average string length is not larger than 30); (2) They involve large indexes; (3) They are expensive to support dynamic update of data sets. To address these problems, we propose a novel method called triejoin, which can generate results efficiently with small indexes. We use a trie structure to index the strings and utilize the trie structure to efficiently find similar string pairs based on subtrie pruning. We devise efficient triejoin algorithms and pruning techniques to achieve high performance. Our method can be easily extended to support dynamic update of data sets efficiently. We conducted extensive experiments on four real data sets. Experimental results show that our algorithms outperform stateoftheart methods by an order of magnitude on the data sets with short strings.
Redesigning the String Hash Table, Burst Trie, and BST to Exploit Cache
, 2011
"... A key decision when developing inmemory computing applications is choice of a mechanism to store and retrieve strings. The most efficient current data structures for this task are the hash table with movetofront chains and the burst trie, both of which use linked lists as a substructure, and vari ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
A key decision when developing inmemory computing applications is choice of a mechanism to store and retrieve strings. The most efficient current data structures for this task are the hash table with movetofront chains and the burst trie, both of which use linked lists as a substructure, and variants of binary search tree. These data structures are computationally efficient, but typical implementations use large numbers of nodes and pointers to manage strings, which is not efficient in use of cache. In this article, we explore two alternatives to the standard representation: the simple expedient of including the string in its node, and, for linked lists, the more drastic step of replacing each list of nodes by a contiguous array of characters. Our experiments show that, for large sets of strings, the improvement is dramatic. For hashing, in the best case the total space overhead is reduced to less than 1 bit per string. For the burst trie, over 300MB of strings can be stored in a total of under 200MB of memory with significantly improved search time. These results, on a variety of data sets, show that cachefriendly variants of fundamental data structures can yield remarkable gains in performance.
Performance of Data Structures for Small Sets of Strings
 Proc. of the Australasian conference on Computer Science
, 2002
"... Fundamental structures such as trees and hash tables are used for managing data in a huge variety of circumstances. Making the right choice of structure is essential to efficiency. In previous work we have explored the performance of a range of data structures  different forms of trees, tries, and ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Fundamental structures such as trees and hash tables are used for managing data in a huge variety of circumstances. Making the right choice of structure is essential to efficiency. In previous work we have explored the performance of a range of data structures  different forms of trees, tries, and hash tables  for the task of managing sets of millions of strings, and have developed new variants of each that are more efficient for this task than previous alternatives. In this paper we test the performance of the same data structures on small sets of strings, in the context of document processing for index construction. Our results show that the new structures, in particular our burst trie, are the most efficient choice for this task, thus demonstrating that they are suitable for managing sets of hundreds to millions of distinct strings, and for input of hundreds to billions of occurrences.
Improved Behaviour of Tries by the ”Symmetrization” of the Source
"... In this paper, we propose and study a preprocessing technique for improving performance of digital tree (trie)based search algorithms under asymmetric memoryless sources. This technique (which we call a symmetrization of the source) bijectively maps the sequences of symbols from the original (asym ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
In this paper, we propose and study a preprocessing technique for improving performance of digital tree (trie)based search algorithms under asymmetric memoryless sources. This technique (which we call a symmetrization of the source) bijectively maps the sequences of symbols from the original (asymmetric) source into symbols of an output alphabet resulting in a more uniform distribution. We introduce a criterion of efficiency for such a mapping, and demonstrate that a problem of finding an optimal for a given source (or universal) symmetrization transform is equivalent to a problem of constructing a minimum redundancy variablelengthtoblock code for this source (or class of sources). Based on this result, we propose search algorithms that incorporate known (optimal for a given source and universal) variablelengthtoblock codes and study their asymptotic behaviour. We complement our analysis with a description of an efficient algorithm for universal symmetrization of binary memoryless sources, and compare the performance of the resulting search structure with the standard tries. 1
Analysis of a Class of Tries with Adaptive MultiDigit Branching
"... Abstract. We study a class of adaptive multidigit tries, in which the numbers of digits rn processed by nodes with n incoming strings are such that, in memoryless model (with n → ∞): rn → log n η (pr.) where η is an algorithmspecific constant. Examples of known data structures from this class incl ..."
Abstract
 Add to MetaCart
Abstract. We study a class of adaptive multidigit tries, in which the numbers of digits rn processed by nodes with n incoming strings are such that, in memoryless model (with n → ∞): rn → log n η (pr.) where η is an algorithmspecific constant. Examples of known data structures from this class include LCtries (Andersson and Nilsson, 1993), ”relaxed ” LCtries (Nilsson and Tikkanen, 1998), tries with logarithmic selection of degrees of nodes, etc. We show, that the average depth Dn of such tries in asymmetric memoryless model has the following asymptotic behavior (with n → ∞): Dn = log log n (1 + o (1)) − log (1 − h/η) where n is the number of strings inserted in the trie, and h is the entropy of the source. We use this formula to compare performance of known adaptive trie structures, and to predict properties of other possible implementations of tries in this class. 1
On TimeSpace Efficiency of Digital Trees with Adaptive MultiDigit Branching
, 2003
"... We consider a class of digital trees (tries) with adaptive selection of degrees of their nodes. This class includes LCtries of Andersson and Nilsson (1993) which recursively replace all complete subtrees in the original tries with larger (multidigit) nodes, as well as dynamic tries of Nilsson and ..."
Abstract
 Add to MetaCart
We consider a class of digital trees (tries) with adaptive selection of degrees of their nodes. This class includes LCtries of Andersson and Nilsson (1993) which recursively replace all complete subtrees in the original tries with larger (multidigit) nodes, as well as dynamic tries of Nilsson and Tikkanen (1998) which recursively replace all subtrees of bounded sparseness (a ratio of the number of missing nodes at the last level to the total number of nodes at this level). In this paper we study the average behavior of such tries with respect to a hybrid time/space efficiency criterion. We demonstrate that there exists an interesting connection between the efficiency and sparseness of nodes in adaptive tries. In particular, we show that in a symmetric memoryless model, the optimal in a sense of time/space efficiency nodes, are 1/etimes ( ≈ 36.8%) sparse. On the other hand, if the source is asymmetric, the sparseness of the time/space efficient nodes is somewhat larger, asymptotically (with large number of strings) approaching 50%. These results can be used to support the trie construction algorithm of Nilsson and Tikkanen, and suggest the optimal choice of constants in this procedure.
Information filtering and query . . .
, 2009
"... In the information filtering paradigm, clients subscribe to a server with continuous queries or profiles that express their information needs. Clients can also publish documents to servers. Whenever a document is published, the continuous queries satisfying this document are found and notifications ..."
Abstract
 Add to MetaCart
In the information filtering paradigm, clients subscribe to a server with continuous queries or profiles that express their information needs. Clients can also publish documents to servers. Whenever a document is published, the continuous queries satisfying this document are found and notifications are sent to appropriate clients. This article deals with the filtering problem that needs to be solved efficiently by each server: Given a database of continuous queries db and a document d, find all queries q ∈ db that match d. We present data structures and indexing algorithms that enable us to solve the filtering problem efficiently for large databases of queries expressed in the model AWP. AWP is based on named attributes with values of type text, and its query language includes
Jl I. A PROGRESS REPORT ON SMART
, 1965
"... The SMART project was initiated in the fall of 1962 with the objective of designing, and implementing on a computer, a fully automatic document retrieval system, capable of processing documents and search requests available in English, and of retrieving those documents most nearly ..."
Abstract
 Add to MetaCart
The SMART project was initiated in the fall of 1962 with the objective of designing, and implementing on a computer, a fully automatic document retrieval system, capable of processing documents and search requests available in English, and of retrieving those documents most nearly