Results 1 - 10
of
18
The Priority R-Tree: A Practically Efficient and Worst-Case Optimal R-Tree
- SIGMOD 2004 JUNE 13-18, 2004, PARIS, FRANCE
, 2004
"... We present the Priority R-tree, or PR-tree, which is the first R-tree variant that always answers a window query using O((N/B) 1−1/d + T/B) I/Os, where N is the number of ddimensional (hyper-) rectangles stored in the R-tree, B is the disk block size, and T is the output size. This is provably asymp ..."
Abstract
-
Cited by 46 (5 self)
- Add to MetaCart
We present the Priority R-tree, or PR-tree, which is the first R-tree variant that always answers a window query using O((N/B) 1−1/d + T/B) I/Os, where N is the number of ddimensional (hyper-) rectangles stored in the R-tree, B is the disk block size, and T is the output size. This is provably asymptotically optimal and significantly better than other Rtree variants, where a query may visit all N/B leaves in the tree even when T = 0. We also present an extensive experimental study of the practical performance of the PR-tree using both real-life and synthetic data. This study shows that the PR-tree performs similar to the best known R-tree variants on real-life and relatively nicely distributed data, but outperforms them significantly on more extreme data.
STXXL: Standard template library for XXL data sets
- In: Proc. of ESA 2005. Volume 3669 of LNCS
, 2005
"... for processing huge data sets that can fit only on hard disks. It supports parallel disks, overlapping between disk I/O and computation and it is the first I/O-efficient algorithm library that supports the pipelining technique that can save more than half of the I/Os. STXXL has been applied both in ..."
Abstract
-
Cited by 30 (4 self)
- Add to MetaCart
for processing huge data sets that can fit only on hard disks. It supports parallel disks, overlapping between disk I/O and computation and it is the first I/O-efficient algorithm library that supports the pipelining technique that can save more than half of the I/Os. STXXL has been applied both in academic and industrial environments for a range of problems including text processing, graph algorithms, computational geometry, gaussian elimination, visualization, and analysis of microscopic images, differential cryptographic analysis, etc. The performance of STXXL and its applications is evaluated on synthetic and real-world inputs. We present the design of the library, how its performance features are supported, and demonstrate how the library integrates with STL. KEY WORDS: very large data sets; software library; C++ standard template library; algorithm engineering 1.
Bkd-tree: A dynamic scalable kd-tree
- In Proc. International Symposium on Spatial and Temporal Databases
, 2003
"... ..."
Better external memory suffix array construction
- In: Workshop on Algorithm Engineering & Experiments
, 2005
"... Suffix arrays are a simple and powerful data structure for text processing that can be used for full text indexes, data compression, and many other applications in particular in bioinformatics. However, so far it has looked prohibitive to build suffix arrays for huge inputs that do not fit into main ..."
Abstract
-
Cited by 24 (5 self)
- Add to MetaCart
Suffix arrays are a simple and powerful data structure for text processing that can be used for full text indexes, data compression, and many other applications in particular in bioinformatics. However, so far it has looked prohibitive to build suffix arrays for huge inputs that do not fit into main memory. This paper presents design, analysis, implementation, and experimental evaluation of several new and improved algorithms for suffix array construction. The algorithms are asymptotically optimal in the worst case or on the average. Our implementation can construct suffix arrays for inputs of up to 4GBytes in hours on a low cost machine. As a tool of possible independent interest we present a systematic way to design, analyze, and implement pipelined algorithms.
Asynchronous Parallel Disk Sorting
- IN 15TH ACM SYMPOSIUM ON PARALLELISM IN ALGORITHMS AND ARCHITECTURES
, 2003
"... We develop an algorithm for parallel disk sorting, whose I/O cost approaches the lower bound and that guarantees almost perfect overlap between I/O and computation. Previous algorithms have either suboptimal I/O volume or cannot guarantee that I/O and computations can always be overlapped. We give a ..."
Abstract
-
Cited by 22 (8 self)
- Add to MetaCart
We develop an algorithm for parallel disk sorting, whose I/O cost approaches the lower bound and that guarantees almost perfect overlap between I/O and computation. Previous algorithms have either suboptimal I/O volume or cannot guarantee that I/O and computations can always be overlapped. We give an efficient implementation that can (at least) compete with the best practical implementations but gives additional performance guarantees. For the experiments we have configured a state of the art machine that can sustain full bandwidth I/O with eight disks and is very cost effective.
Boxes: Efficient maintenance of order-based labeling for dynamic XML data
- In Proc. of ICDE
, 2005
"... Order-based element labeling for tree-structured XML data is an important technique in XML processing. It lies at the core of many fundamental XML operations such as containment join and twig matching. While labeling for static XML documents is well understood, less is known about how to maintain ac ..."
Abstract
-
Cited by 15 (2 self)
- Add to MetaCart
Order-based element labeling for tree-structured XML data is an important technique in XML processing. It lies at the core of many fundamental XML operations such as containment join and twig matching. While labeling for static XML documents is well understood, less is known about how to maintain accurate labeling for dynamic XML documents, when elements and subtrees are inserted and deleted. Most existing approaches do not work well for arbitrary update patterns; they either produce unacceptably long labels or incur enormous relabeling costs. We present two novel I/O-efficient data structures, W-BOX and B-BOX, that efficiently maintain labeling for large, dynamic XML documents. We show analytically and experimentally that both, despite consuming minimal amounts of storage, gracefully handle arbitrary update patterns without sacrificing lookup efficiency. The two structures together provide a nice tradeoff between update and lookup costs: W-BOX has logarithmic amortized update cost and constant worst-case lookup cost, while B-BOX has constant amortized update cost and logarithmic worst-case lookup cost. We further propose techniques to eliminate the lookup cost for read-heavy workloads. 1.
I/O-efficient batched union-find and its applications to terrain analysis
- In Proc. 22nd Annual Symposium on Computational Geometry
, 2006
"... Despite extensive study over the last four decades and numerous applications, no I/O-efficient algorithm is known for the union-find problem. In this paper we present an I/O-efficient algorithm for the batched (off-line) version of the union-find problem. Given any sequence of N union and find opera ..."
Abstract
-
Cited by 14 (8 self)
- Add to MetaCart
Despite extensive study over the last four decades and numerous applications, no I/O-efficient algorithm is known for the union-find problem. In this paper we present an I/O-efficient algorithm for the batched (off-line) version of the union-find problem. Given any sequence of N union and find operations, where each union operation joins two distinct sets, our algorithm uses O(SORT(N)) = O ( N B log M/B N I/Os, where M is the memory size and B is the disk block size. This bound is asymptotically optimal in the worst case. If there are union operations that join a set with itself, our algorithm uses O(SORT(N) + MST(N)) I/Os, where MST(N) is the number of I/Os needed to compute the minimum spanning tree of a graph with N edges. We also describe a simple and practical O(SORT(N) log ( N M))-I/O algorithm for this problem, which we have implemented. We are interested in the union-find problem because of its applications in terrain analysis. A terrain can be abstracted as a height function defined over R2, and many problems that deal with such functions require a union-find data structure. With the emergence of modern mapping technologies, huge amount of elevation data is being generated that is too large to fit in memory, thus I/O-efficient algorithms are needed to process this data efficiently. In this paper, we study two terrain-analysis problems that benefit from a union-find data structure: (i) computing topological persistence and (ii) constructing the contour tree. We give the first O(SORT(N))-I/O algorithms for these two problems, assuming that the input terrain is represented as a triangular mesh with N vertices. Finally, we report some preliminary experimental results, showing that our algorithms give order-ofmagnitude improvement over previous methods on large data sets that do not fit in memory. 1
I/O-efficient point location using persistent B-trees
- In Proc. Workshop on Algorithm Engineering and Experimentation
, 2003
"... Abstract We present an external planar point location data structure that is I/O-efficient both in theory and practice. The developed structure uses linear space and answers a query in optimal O(logB N) I/Os, where B is the disk block size. It is based on a persistent B-tree, and all previously deve ..."
Abstract
-
Cited by 13 (8 self)
- Add to MetaCart
Abstract We present an external planar point location data structure that is I/O-efficient both in theory and practice. The developed structure uses linear space and answers a query in optimal O(logB N) I/Os, where B is the disk block size. It is based on a persistent B-tree, and all previously developed such structures assume a total order on the elements in the structure. As a theoretical result of independent interest, we show how to remove this assumption. Most previous theoretical I/O-efficient planer point location structures are relatively complicated and have not been implemented. Based on a bucket approach, Vahrenhold and Hinrichs therefore developed a simple and practical, but theoretically non-optimal, heuristic structure. We present an extensive experimental evaluation that shows that on a range of real-world Geographic Information Systems (GIS) data, our structure uses fewer I/Os than the structure of Vahrenhold and Hinrichs to answer a query. On a synthetically generated worst-case dataset, our structure uses significantly fewer I/Os. 1 Introduction The planar point location problem is the problem ofstoring a planar subdivision defined by N segmentssuch that the region containing a query point
Privacy-preserving Queries over Relational Databases
"... Abstract—We explore how Private Information Retrieval (PIR) can help users keep their sensitive information from being leaked in an SQL query. We show how to retrieve data from a relational database with PIR by hiding sensitive constants contained in the predicates of a query. Experimental results a ..."
Abstract
-
Cited by 7 (3 self)
- Add to MetaCart
Abstract—We explore how Private Information Retrieval (PIR) can help users keep their sensitive information from being leaked in an SQL query. We show how to retrieve data from a relational database with PIR by hiding sensitive constants contained in the predicates of a query. Experimental results and microbenchmarking tests show our approach incurs reasonable storage overhead for the added privacy benefit and performs between 3 and 343 times faster than previous work. I.
A Status Report on XXL - a Software Infrastructure for Efficient Query Processing
- Data Engineering Bulletin
, 2003
"... XXL is a Java library that contains a rich infrastructure for implementing advanced query processing functionality. The library offers low-level components like access to raw disks as well as high-level ones like a query optimizer. On the intermediate levels, XXL provides a demand-driven cursor alge ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
XXL is a Java library that contains a rich infrastructure for implementing advanced query processing functionality. The library offers low-level components like access to raw disks as well as high-level ones like a query optimizer. On the intermediate levels, XXL provides a demand-driven cursor algebra, a framework for indexing and a powerful package for supporting aggregation. The library is publicly available under GNU LGPL and comes with a full documentation. 1

