Results 1  10
of
17
Burst Tries: A Fast, Efficient Data Structure for String Keys
 ACM Transactions on Information Systems
, 2002
"... Many applications depend on efficient management of large sets of distinct strings in memory. For example, during index construction for text databases a record is held for each distinct word in the text, containing the word itself and information such as counters. We propose a new data structure, t ..."
Abstract

Cited by 31 (10 self)
 Add to MetaCart
Many applications depend on efficient management of large sets of distinct strings in memory. For example, during index construction for text databases a record is held for each distinct word in the text, containing the word itself and information such as counters. We propose a new data structure, the burst trie, that has significant advantages over existing options for such applications: it requires no more memory than a binary tree; it is as fast as a trie; and, while not as fast as a hash table, a burst trie maintains the strings in sorted or nearsorted order. In this paper we describe burst tries and explore the parameters that govern their performance. We experimentally determine good choices of parameters, and compare burst tries to other structures used for the same task, with a variety of data sets. These experiments show that the burst trie is particularly effective for the skewed frequency distributions common in text collections, and dramatically outperforms all other data structures for the task of managing strings while maintaining sort order.
Fast construction of overlay networks
 SPAA
"... An asynchronous algorithm is described for rapidly constructing an overlay network in a peertopeer system where all nodes can in principle communicate with each other directly through an underlying network, but each participating node initially has pointers to only a handful of other participants. ..."
Abstract

Cited by 24 (2 self)
 Add to MetaCart
An asynchronous algorithm is described for rapidly constructing an overlay network in a peertopeer system where all nodes can in principle communicate with each other directly through an underlying network, but each participating node initially has pointers to only a handful of other participants. The output of the mechanism is a linked list of all participants sorted by their identifiers, which can be used as a foundation for building various linear overlay networks such as Chord or skip graphs. Assuming the initial pointer graph is weaklyconnected with maximum degree d and the length of a node identifier is W, the mechanism constructs a binary search tree of nodes of depth O(W) in expected O(W log n) time using expected O((d+W)nlog n) messages of size O(W) each. Furthermore, the algorithm has low contention: at any time there are only O(d) undelivered messages for any given recipient. A lower bound of Ω(d + log n) is given for the running time of any procedure in a related synchronous model that yields a sorted list from a degreed weaklyconnected graph of n nodes. We conjecture that this lower bound is tight and could be attained by further improvements to our algorithms.
Profile of Tries
, 2006
"... Tries (from retrieval) are one of the most popular data structures on words. They are pertinent to (internal) structure of stored words and several splitting procedures used in diverse contexts. The profile of a trie is a parameter that represents the number of nodes (either internal or external) wi ..."
Abstract

Cited by 17 (7 self)
 Add to MetaCart
Tries (from retrieval) are one of the most popular data structures on words. They are pertinent to (internal) structure of stored words and several splitting procedures used in diverse contexts. The profile of a trie is a parameter that represents the number of nodes (either internal or external) with the same distance from the root. It is a function of the number of strings stored in a trie and the distance from the root. Several, if not all, trie parameters such as height, size, depth, shortest path, and fillup level can be uniformly analyzed through the (external and internal) profiles. Although profiles represent one of the most fundamental parameters of tries, they have been hardly studied in the past. The analysis of profiles is surprisingly arduous but once it is carried out it reveals unusually intriguing and interesting behavior. We present a detailed study of the distribution of the profiles in a trie built over random strings generated by a memoryless source. We first derive recurrences satisfied by the expected profiles and solve them asymptotically for all possible ranges of the distance from the root. It appears that profiles of tries exhibit several fascinating phenomena. When moving from the root to the leaves of a trie, the growth of the expected profiles vary. Near the root, the external profiles tend to zero in an exponentially rate, then the rate gradually rises to being logarithmic; the external profiles then abruptly tend to infinity, first logarithmically
An extensible index for spatial databases
 In Statistical and Scientific Database Management
, 2001
"... Emerging database applications require the use of new indexing structures beyond Btrees and Rtrees. Examples are the kD tree, the trie, the quadtree, and their variants. They are often proposed as supporting structures in data mining, GIS, and CAD/CAM applications. A common feature of all these i ..."
Abstract

Cited by 10 (7 self)
 Add to MetaCart
Emerging database applications require the use of new indexing structures beyond Btrees and Rtrees. Examples are the kD tree, the trie, the quadtree, and their variants. They are often proposed as supporting structures in data mining, GIS, and CAD/CAM applications. A common feature of all these indexes is that they recursively divide the space into partitions. A new extensible index structure, termed SPGiST, is presented that supports this class of data structures, mainly the class of space partitioning unbalanced trees. Simple method implementations are provided that demonstrate how SPGiST can behave as a kD tree, a trie, a quadtree, or any of their variants. Issues related to clustering tree nodes into pages as well as concurrency control for SPGiST are addressed. A dynamic minimumheight clustering technique is applied to minimize disk accesses and to make using such trees in database systems possible and efficient. A prototype implementation of SPGiST is presented as well as performance studies of the various SPGiST’s tuning parameters. 1.
Sail: A spatial index library for efficient application integration
 GeoInformatica
, 2005
"... With the proliferation of spatial and spatiotemporal data that are produced everyday by a wide range of applications, Geographic Information Systems (GIS) have to cope with millions of objects with diverse spatial characteristics. Clearly, under these circumstances, substantial performance speed up ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
With the proliferation of spatial and spatiotemporal data that are produced everyday by a wide range of applications, Geographic Information Systems (GIS) have to cope with millions of objects with diverse spatial characteristics. Clearly, under these circumstances, substantial performance speed up can be achieved with the use of spatial, spatiotemporal and other multidimensional indexing techniques. Due to the increasing research effort on developing new indexing methods, the number of available alternatives is becoming overwhelming, making the task of selecting the most appropriate method for indexing the data according to application needs rather challenging. Therefore, developing a library that can combine a variety of indexing techniques under a common application programming interface can prove to be a valuable tool. In this paper we present SaIL (SpAtial Index Library), an extensible framework that enables easy integration of spatial and spatiotemporal index structures into existing applications. We focus on design issues and elaborate on techniques for making the framework generic enough, so that it can support user defined data types, customizable spatial queries, and a broad range of spatial (and spatiotemporal) index structures, in a way that does not compromise functionality, extensibility and, primarily, ease of use. SaIL is publicly available and has already been successfully utilized for research and commercial applications. 1
Pushing XPath Accelerator to its Limits
, 2006
"... Two competing encoding concepts are known to scale well with growing amounts of XML data: XPath Accelerator encoding implemented by MonetDB for inmemory documents and XHive’s Persistent DOM for ondisk storage. We identified two ways to improve XPath Accelerator and present prototypes for the resp ..."
Abstract

Cited by 7 (5 self)
 Add to MetaCart
Two competing encoding concepts are known to scale well with growing amounts of XML data: XPath Accelerator encoding implemented by MonetDB for inmemory documents and XHive’s Persistent DOM for ondisk storage. We identified two ways to improve XPath Accelerator and present prototypes for the respective techniques: BaseX boosts inmemory performance with optimized data and value index structures while Idefix introduces native blockoriented persistence with logarithmic update behavior for true scalability, overcoming mainmemory constraints. An easytouse Javabased benchmarking framework was developed and used to consistently compare these competing techniques and perform scalability measurements. The established XMark benchmark was applied to all four systems under test. Additional fulltextsensitive queries against the wellknown DBLP database complement the XMark results. Not only did the latest version of XHive finally surprise with good scalability and performance numbers. Also, both BaseX and Idefix hold their promise to push XPath Accelerator to its limits: BaseX efficiently exploits available main memory to speedup XML queries while Idefix surpasses mainmemory constraints and rivals the ondisk leadership of XHive. The competition between XPath Accelerator and Persistent DOM definitely is relaunched.
Weighted height of random trees
 Manuscript
"... We consider a model of random trees similar to the split trees of Devroye [30] in which a set of items is recursively partitioned. Our model allows for more flexibility in the choice of the partitioning procedure, and has weighted edges. We prove that for this model, the height H n of a random tree ..."
Abstract

Cited by 5 (4 self)
 Add to MetaCart
We consider a model of random trees similar to the split trees of Devroye [30] in which a set of items is recursively partitioned. Our model allows for more flexibility in the choice of the partitioning procedure, and has weighted edges. We prove that for this model, the height H n of a random tree is asymptotic to c log n in probability for a constant c that is uniquely characterized in terms of multivariate large deviations rate functions. This extension permits us to obtain the height of pebbled tries, pebbled ternary search tries, dary pyramids, and to study geometric properties of partitions generated by kd trees. The model also includes all polynomial families of increasing trees recently studied by Broutin, Devroye, McLeish, and de la Salle [17].
An analysis of the height of tries with random weights on the edges
 Combinatorics, Probability and Computing
"... We analyze the weighted height of random tries built from independent strings of i.i.d. symbols on the finite alphabet {1,..., d}. The edges receive random weights whose distribution depends upon the number of strings that visit that edge. Such a model covers the hybrid tries of de la Briandais (195 ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
We analyze the weighted height of random tries built from independent strings of i.i.d. symbols on the finite alphabet {1,..., d}. The edges receive random weights whose distribution depends upon the number of strings that visit that edge. Such a model covers the hybrid tries of de la Briandais (1959) and the TST of Bentley and Sedgewick (1997), where the search time for a string can be decomposed as a sum of processing times for each symbol in the string. Our weighted trie model also permits one to study maximal path imbalance. In all cases, the weighted height is shown be asymptotic to c log n in probability, where c is determined by the behavior of the core of the trie (the part where all nodes have a full set of children) and the fringe of the trie (the part of the trie where nodes have only one child and form spaghettilike trees). It can be found by maximizing a function that is related to the Cramér exponent of the distribution of the edge weights.
A framework for supporting the class of spacepartitioning trees
, 2001
"... Emerging database applications require the use of new indexing structures beyond Btrees and Rtrees. Examples are the kD tree, the trie, the quadtree, and their variants. They are often proposed as supporting structures in data mining, GIS, and CAD/CAM applications. A common feature of all these i ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
Emerging database applications require the use of new indexing structures beyond Btrees and Rtrees. Examples are the kD tree, the trie, the quadtree, and their variants. They are often proposed as supporting structures in data mining, GIS, and CAD/CAM applications. A common feature of all these indexes is that they recursively divide the space into partitions. A new extensible index structure, termed SPGiST is presented that supports this class of data structures, mainly the class of space partitioning unbalanced trees. Simple method implementations are provided that demonstrate how SPGiST can behave as a kD tree, a trie, a quadtree, or any of their variants. Issues related to clustering tree nodes into pages as well as concurrency control for SPGiST are addressed. A dynamic minimumheight clustering technique is applied to minimize disk accesses and to make using such trees in database systems possible and efficient. A prototype implementation of SPGiST is presented as well as performance studies of the various SPGiST’s tuning parameters. Keywords: SPGiST, spacepartitioning trees, GiST, spatial tree indexes, access methods, clustering. 1.
Performance of Data Structures for Small Sets of Strings
 Proc. of the Australasian conference on Computer Science
, 2002
"... Fundamental structures such as trees and hash tables are used for managing data in a huge variety of circumstances. Making the right choice of structure is essential to efficiency. In previous work we have explored the performance of a range of data structures  different forms of trees, tries, and ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Fundamental structures such as trees and hash tables are used for managing data in a huge variety of circumstances. Making the right choice of structure is essential to efficiency. In previous work we have explored the performance of a range of data structures  different forms of trees, tries, and hash tables  for the task of managing sets of millions of strings, and have developed new variants of each that are more efficient for this task than previous alternatives. In this paper we test the performance of the same data structures on small sets of strings, in the context of document processing for index construction. Our results show that the new structures, in particular our burst trie, are the most efficient choice for this task, thus demonstrating that they are suitable for managing sets of hundreds to millions of distinct strings, and for input of hundreds to billions of occurrences.