Results 1  10
of
32
The Priority RTree: A Practically Efficient and WorstCase Optimal RTree
 SIGMOD 2004 JUNE 1318, 2004, PARIS, FRANCE
, 2004
"... We present the Priority Rtree, or PRtree, which is the first Rtree variant that always answers a window query using O((N/B) 1−1/d + T/B) I/Os, where N is the number of ddimensional (hyper) rectangles stored in the Rtree, B is the disk block size, and T is the output size. This is provably asymp ..."
Abstract

Cited by 56 (7 self)
 Add to MetaCart
We present the Priority Rtree, or PRtree, which is the first Rtree variant that always answers a window query using O((N/B) 1−1/d + T/B) I/Os, where N is the number of ddimensional (hyper) rectangles stored in the Rtree, B is the disk block size, and T is the output size. This is provably asymptotically optimal and significantly better than other Rtree variants, where a query may visit all N/B leaves in the tree even when T = 0. We also present an extensive experimental study of the practical performance of the PRtree using both reallife and synthetic data. This study shows that the PRtree performs similar to the best known Rtree variants on reallife and relatively nicely distributed data, but outperforms them significantly on more extreme data.
STXXL: Standard template library for XXL data sets
 In: Proc. of ESA 2005. Volume 3669 of LNCS
, 2005
"... for processing huge data sets that can fit only on hard disks. It supports parallel disks, overlapping between disk I/O and computation and it is the first I/Oefficient algorithm library that supports the pipelining technique that can save more than half of the I/Os. STXXL has been applied both in ..."
Abstract

Cited by 39 (5 self)
 Add to MetaCart
for processing huge data sets that can fit only on hard disks. It supports parallel disks, overlapping between disk I/O and computation and it is the first I/Oefficient algorithm library that supports the pipelining technique that can save more than half of the I/Os. STXXL has been applied both in academic and industrial environments for a range of problems including text processing, graph algorithms, computational geometry, gaussian elimination, visualization, and analysis of microscopic images, differential cryptographic analysis, etc. The performance of STXXL and its applications is evaluated on synthetic and realworld inputs. We present the design of the library, how its performance features are supported, and demonstrate how the library integrates with STL. KEY WORDS: very large data sets; software library; C++ standard template library; algorithm engineering 1.
Better external memory suffix array construction
 In: Workshop on Algorithm Engineering & Experiments
, 2005
"... Suffix arrays are a simple and powerful data structure for text processing that can be used for full text indexes, data compression, and many other applications in particular in bioinformatics. However, so far it has looked prohibitive to build suffix arrays for huge inputs that do not fit into main ..."
Abstract

Cited by 30 (5 self)
 Add to MetaCart
Suffix arrays are a simple and powerful data structure for text processing that can be used for full text indexes, data compression, and many other applications in particular in bioinformatics. However, so far it has looked prohibitive to build suffix arrays for huge inputs that do not fit into main memory. This paper presents design, analysis, implementation, and experimental evaluation of several new and improved algorithms for suffix array construction. The algorithms are asymptotically optimal in the worst case or on the average. Our implementation can construct suffix arrays for inputs of up to 4GBytes in hours on a low cost machine. As a tool of possible independent interest we present a systematic way to design, analyze, and implement pipelined algorithms.
Bkdtree: A dynamic scalable kdtree
 In Proc. International Symposium on Spatial and Temporal Databases
, 2003
"... ..."
Asynchronous Parallel Disk Sorting
 IN 15TH ACM SYMPOSIUM ON PARALLELISM IN ALGORITHMS AND ARCHITECTURES
, 2003
"... We develop an algorithm for parallel disk sorting, whose I/O cost approaches the lower bound and that guarantees almost perfect overlap between I/O and computation. Previous algorithms have either suboptimal I/O volume or cannot guarantee that I/O and computations can always be overlapped. We give a ..."
Abstract

Cited by 22 (8 self)
 Add to MetaCart
We develop an algorithm for parallel disk sorting, whose I/O cost approaches the lower bound and that guarantees almost perfect overlap between I/O and computation. Previous algorithms have either suboptimal I/O volume or cannot guarantee that I/O and computations can always be overlapped. We give an efficient implementation that can (at least) compete with the best practical implementations but gives additional performance guarantees. For the experiments we have configured a state of the art machine that can sustain full bandwidth I/O with eight disks and is very cost effective.
A computational study of externalmemory BFS algorithms
 In SODA
, 2006
"... Breadth First Search (BFS) traversal is an archetype for many important graph problems. However, computing a BFS level decomposition for massive graphs was considered nonviable so far, because of the large number of I/Os it incurs. This paper presents the first experimental evaluation of recent exte ..."
Abstract

Cited by 18 (4 self)
 Add to MetaCart
Breadth First Search (BFS) traversal is an archetype for many important graph problems. However, computing a BFS level decomposition for massive graphs was considered nonviable so far, because of the large number of I/Os it incurs. This paper presents the first experimental evaluation of recent externalmemory BFS algorithms for general graphs. With our STXXL based implementations exploiting pipelining and diskparallelism, we were able to compute the BFS level decomposition of a webcrawl based graph of around 130 million nodes and 1.4 billion edges in less than 4 hours using single disk and 2.3 hours using 4 disks. We demonstrate that some rather simple externalmemory algorithms perform significantly better (minutes as compared to hours) than internalmemory BFS, even if more than half of the input resides internally. 1
Boxes: Efficient maintenance of orderbased labeling for dynamic XML data
 In Proc. of ICDE
, 2005
"... Orderbased element labeling for treestructured XML data is an important technique in XML processing. It lies at the core of many fundamental XML operations such as containment join and twig matching. While labeling for static XML documents is well understood, less is known about how to maintain ac ..."
Abstract

Cited by 16 (2 self)
 Add to MetaCart
Orderbased element labeling for treestructured XML data is an important technique in XML processing. It lies at the core of many fundamental XML operations such as containment join and twig matching. While labeling for static XML documents is well understood, less is known about how to maintain accurate labeling for dynamic XML documents, when elements and subtrees are inserted and deleted. Most existing approaches do not work well for arbitrary update patterns; they either produce unacceptably long labels or incur enormous relabeling costs. We present two novel I/Oefficient data structures, WBOX and BBOX, that efficiently maintain labeling for large, dynamic XML documents. We show analytically and experimentally that both, despite consuming minimal amounts of storage, gracefully handle arbitrary update patterns without sacrificing lookup efficiency. The two structures together provide a nice tradeoff between update and lookup costs: WBOX has logarithmic amortized update cost and constant worstcase lookup cost, while BBOX has constant amortized update cost and logarithmic worstcase lookup cost. We further propose techniques to eliminate the lookup cost for readheavy workloads. 1.
I/Oefficient point location using persistent Btrees
 In Proc. Workshop on Algorithm Engineering and Experimentation
, 2003
"... Abstract We present an external planar point location data structure that is I/Oefficient both in theory and practice. The developed structure uses linear space and answers a query in optimal O(logB N) I/Os, where B is the disk block size. It is based on a persistent Btree, and all previously deve ..."
Abstract

Cited by 14 (8 self)
 Add to MetaCart
Abstract We present an external planar point location data structure that is I/Oefficient both in theory and practice. The developed structure uses linear space and answers a query in optimal O(logB N) I/Os, where B is the disk block size. It is based on a persistent Btree, and all previously developed such structures assume a total order on the elements in the structure. As a theoretical result of independent interest, we show how to remove this assumption. Most previous theoretical I/Oefficient planer point location structures are relatively complicated and have not been implemented. Based on a bucket approach, Vahrenhold and Hinrichs therefore developed a simple and practical, but theoretically nonoptimal, heuristic structure. We present an extensive experimental evaluation that shows that on a range of realworld Geographic Information Systems (GIS) data, our structure uses fewer I/Os than the structure of Vahrenhold and Hinrichs to answer a query. On a synthetically generated worstcase dataset, our structure uses significantly fewer I/Os. 1 Introduction The planar point location problem is the problem ofstoring a planar subdivision defined by N segmentssuch that the region containing a query point
Privacypreserving Queries over Relational Databases
"... Abstract—We explore how Private Information Retrieval (PIR) can help users keep their sensitive information from being leaked in an SQL query. We show how to retrieve data from a relational database with PIR by hiding sensitive constants contained in the predicates of a query. Experimental results a ..."
Abstract

Cited by 11 (5 self)
 Add to MetaCart
Abstract—We explore how Private Information Retrieval (PIR) can help users keep their sensitive information from being leaked in an SQL query. We show how to retrieve data from a relational database with PIR by hiding sensitive constants contained in the predicates of a query. Experimental results and microbenchmarking tests show our approach incurs reasonable storage overhead for the added privacy benefit and performs between 3 and 343 times faster than previous work. I.
From point cloud to grid DEM: A scalable approach
 In Proc. 12th International Symposium on Spatial Data Handling
, 2006
"... Summary. Given a set S of points in R 3 sampled from an elevation function H: R 2 → R, we present a scalable algorithm for constructing a grid digital elevation model (DEM). Our algorithm consists of three stages: First, we construct a quad tree on S to partition the point set into a set of nonover ..."
Abstract

Cited by 9 (8 self)
 Add to MetaCart
Summary. Given a set S of points in R 3 sampled from an elevation function H: R 2 → R, we present a scalable algorithm for constructing a grid digital elevation model (DEM). Our algorithm consists of three stages: First, we construct a quad tree on S to partition the point set into a set of nonoverlapping segments. Next, for each segment q, we compute the set of points in q and all segments neighboring q. Finally, we interpolate each segment independently using points within the segment and its neighboring segments. Data sets acquired by LIDAR and other modern mapping technologies consist of hundreds of millions of points and are too large to fit in main memory. When processing such massive data sets, the transfer of data between disk and main memory (also called I/O), rather than the CPU time, becomes the performance bottleneck. We therefore present an I/Oefficient algorithm for constructing a grid DEM. Our experiments show that the algorithm scales to data sets much larger than the size of main memory, while existing algorithms do not scale. For example, using a machine with 1GB RAM, we were able to construct a grid DEM containing 1.3 billion cells (occupying 1.2GB) from a LIDAR data set of over 390 million points (occupying 20GB) in about 53 hours. Neither ArcGIS nor GRASS, two popular GIS products, were able to process this data set. 1