Results 1 - 10
of
50
Wavelet-Based Histograms for Selectivity Estimation
- in SIGMOD
, 1998
"... Query optimization is an integral part of relational database management systems. One important task in query optimization is selectivity estimation, that is, given a query P , we need to estimate the fraction of records in the database that satisfy P . Many commercial database systems maintain hist ..."
Abstract
-
Cited by 191 (16 self)
- Add to MetaCart
Query optimization is an integral part of relational database management systems. One important task in query optimization is selectivity estimation, that is, given a query P , we need to estimate the fraction of records in the database that satisfy P . Many commercial database systems maintain histograms to approximate the frequency distribution of values in the attributes of relations. In this paper, we present a technique based upon a multiresolution wavelet decomposition for building histograms on the underlying data distributions, with applications to databases, statistics, and simulation. Histograms built on the cumulative data distributions give very good approximations with limited space usage. We give fast algorithms for constructing histograms and using them in an on-line fashion for selectivity estimation. Our histograms also provide quick approximate answers to OLAP queries when the exact answers are not required. Our method captures the joint distribution of multiple attri...
Cache-oblivious B-trees
, 2000
"... Abstract. This paper presents two dynamic search trees attaining near-optimal performance on any hierarchical memory. The data structures are independent of the parameters of the memory hierarchy, e.g., the number of memory levels, the block-transfer size at each level, and the relative speeds of me ..."
Abstract
-
Cited by 119 (21 self)
- Add to MetaCart
Abstract. This paper presents two dynamic search trees attaining near-optimal performance on any hierarchical memory. The data structures are independent of the parameters of the memory hierarchy, e.g., the number of memory levels, the block-transfer size at each level, and the relative speeds of memory levels. The performance is analyzed in terms of the number of memory transfers between two memory levels with an arbitrary block-transfer size of B; this analysis can then be applied to every adjacent pair of levels in a multilevel memory hierarchy. Both search trees match the optimal search bound of Θ(1+logB+1 N) memory transfers. This bound is also achieved by the classic B-tree data structure on a two-level memory hierarchy with a known block-transfer size B. The first search tree supports insertions and deletions in Θ(1 + logB+1 N) amortized memory transfers, which matches the B-tree’s worst-case bounds. The second search tree supports scanning S consecutive elements optimally in Θ(1 + S/B) memory transfers and supports insertions and deletions in Θ(1 + logB+1 N + log2 N) amortized memory transfers, matching the performance of the B-tree for B = B Ω(log N log log N).
Synopsis Data Structures for Massive Data Sets
"... Abstract. Massive data sets with terabytes of data are becoming commonplace. There is an increasing demand for algorithms and data structures that provide fast response times to queries on such data sets. In this paper, we describe a context for algorithmic work relevant to massive data sets and a f ..."
Abstract
-
Cited by 96 (13 self)
- Add to MetaCart
Abstract. Massive data sets with terabytes of data are becoming commonplace. There is an increasing demand for algorithms and data structures that provide fast response times to queries on such data sets. In this paper, we describe a context for algorithmic work relevant to massive data sets and a framework for evaluating such work. We consider the use of "synopsis" data structures, which use very little space and provide fast (typically approximated) answers to queries. The design and analysis of effective synopsis data structures o er many algorithmic challenges. We discuss a number of concrete examples of synopsis data structures, and describe fast algorithms for keeping them up-to-date in the presence of online updates to the data sets.
Data Cube Approximation and Histograms via Wavelets (Extended Abstract)
- In CIKM
, 1998
"... ) Jeffrey Scott Vitter Center for Geometric Computing and Department of Computer Science Duke University Durham, NC 27708--0129 USA jsv@cs.duke.edu Min Wang y Center for Geometric Computing and Department of Computer Science Duke University Durham, NC 27708--0129 USA minw@cs.duke.edu Bala Iyer ..."
Abstract
-
Cited by 86 (2 self)
- Add to MetaCart
) Jeffrey Scott Vitter Center for Geometric Computing and Department of Computer Science Duke University Durham, NC 27708--0129 USA jsv@cs.duke.edu Min Wang y Center for Geometric Computing and Department of Computer Science Duke University Durham, NC 27708--0129 USA minw@cs.duke.edu Bala Iyer Database Technology Institute IBM Santa Teresa Laboratory P.O. Box 49023 San Jose, CA 95161 USA balaiyer@vnet.ibm.com Abstract There has recently been an explosion of interest in the analysis of data in data warehouses in the field of On-Line Analytical Processing (OLAP). Data warehouses can be extremely large, yet obtaining quick answers to queries is important. In many situations, obtaining the exact answer to an OLAP query is prohibitively expensive in terms of time and/or storage space. It can be advantageous to have fast, approximate answers to queries. In this paper, we present an I/O-efficient technique based upon a multiresolution wavelet decomposition that yields an approximate a...
A Functional Approach to External Graph Algorithms
- Algorithmica
, 1998
"... . We present a new approach for designing external graph algorithms and use it to design simple external algorithms for computing connected components, minimum spanning trees, bottleneck minimum spanning trees, and maximal matchings in undirected graphs and multi-graphs. Our I/O bounds compete w ..."
Abstract
-
Cited by 81 (1 self)
- Add to MetaCart
. We present a new approach for designing external graph algorithms and use it to design simple external algorithms for computing connected components, minimum spanning trees, bottleneck minimum spanning trees, and maximal matchings in undirected graphs and multi-graphs. Our I/O bounds compete with those of previous approaches. Unlike previous approaches, ours is purely functional---without side effects---and is thus amenable to standard checkpointing and programming language optimization techniques. This is an important practical consideration for applications that may take hours to run. 1 Introduction We present a divide-and-conquer approach for designing external graph algorithms, i.e., algorithms on graphs that are too large to fit in main memory. Our approach is simple to describe and implement: it builds a succession of graph transformations that reduce to sorting, selection, and a recursive bucketing technique. No sophisticated data structures are needed. We apply our t...
Linear-time longest-common-prefix computation in suffix arrays and its applications
, 2001
"... Abstract. We present a linear-time algorithm to compute the longest common prefix information in suffix arrays. As two applications of our algorithm, we show that our algorithm is crucial to the effective use of block-sorting compression, and we present a linear-time algorithm to simulate the bottom ..."
Abstract
-
Cited by 67 (2 self)
- Add to MetaCart
Abstract. We present a linear-time algorithm to compute the longest common prefix information in suffix arrays. As two applications of our algorithm, we show that our algorithm is crucial to the effective use of block-sorting compression, and we present a linear-time algorithm to simulate the bottom-up traversal of a suffix tree with a suffix array combined with the longest common prefix information. 1
Sequoia: Programming the Memory Hierarchy
, 2006
"... We present Sequoia, a programming language designed to facilitate the development of memory hierarchy aware parallel programs that remain portable across modern machines featuring different memory hierarchy configurations. Sequoia abstractly exposes hierarchical memory in the programming model and p ..."
Abstract
-
Cited by 63 (4 self)
- Add to MetaCart
We present Sequoia, a programming language designed to facilitate the development of memory hierarchy aware parallel programs that remain portable across modern machines featuring different memory hierarchy configurations. Sequoia abstractly exposes hierarchical memory in the programming model and provides language mechanisms to describe communication vertically through the machine and to localize computation to particular memory locations within it. We have implemented a complete programming system, including a compiler and runtime systems for Cell processor-based blade systems and distributed memory clusters, and demonstrate efficient performance running Sequoia programs on both of these platforms.
Fast Priority Queues for Cached Memory
- ACM Journal of Experimental Algorithmics
, 1999
"... This paper advocates the adaption of external memory algorithms to this purpose. This idea and the practical issues involved are exemplified by engineering a fast priority queue suited to external memory and cached memory that is based on k-way merging. It improves previous external memory algorithm ..."
Abstract
-
Cited by 46 (6 self)
- Add to MetaCart
This paper advocates the adaption of external memory algorithms to this purpose. This idea and the practical issues involved are exemplified by engineering a fast priority queue suited to external memory and cached memory that is based on k-way merging. It improves previous external memory algorithms by constant factors crucial for transferring it to cached memory. Running in the cache hierarchy of a workstation the algorithm is at least two times faster than an optimized implementation of binary heaps and 4-ary heaps for large inputs
Comparing hierarchical data in external memory
- In 25th Very Large Data Base Conference (VLDB
, 1999
"... We present an external-memory algorithm for computing a minimum-cost edit script between two rooted, ordered, labeled trees. The I/O, RAM, and CPU costs of our algorithm are, respectively, 4mn+7m+5n, 6S, andO(MN+(M+N)S1:5), whereMandNare the input tree sizes,Sis the block size,m=M=S, andn=N=S. This ..."
Abstract
-
Cited by 45 (2 self)
- Add to MetaCart
We present an external-memory algorithm for computing a minimum-cost edit script between two rooted, ordered, labeled trees. The I/O, RAM, and CPU costs of our algorithm are, respectively, 4mn+7m+5n, 6S, andO(MN+(M+N)S1:5), whereMandNare the input tree sizes,Sis the block size,m=M=S, andn=N=S. This algorithm can make effective use of surplus RAM capacity to quadratically reduce I/O cost. We extend to trees the commonly used mapping from sequence comparison problems to shortest-path problems in edit graphs. 1
Fast Concurrent Access to Parallel Disks
- In 11th ACM-SIAM Symposium on Discrete Algorithms
, 1999
"... High performance applications involving large data sets require the efficient and flexible use of multiple disks. In an external memory machine with D parallel, independent disks, only one block can be accessed on each disk in one I/O step. This restriction leads to a load balancing problem that is ..."
Abstract
-
Cited by 44 (11 self)
- Add to MetaCart
High performance applications involving large data sets require the efficient and flexible use of multiple disks. In an external memory machine with D parallel, independent disks, only one block can be accessed on each disk in one I/O step. This restriction leads to a load balancing problem that is perhaps the main inhibitor for adapting single-disk external memory algorithms to multiple disks. This paper shows that this problem can be solved efficiently using a combination of randomized placement, redundancy and an optimal scheduling algorithm. A buffer of O(D) blocks suffices to support efficient writing of arbitrary blocks if blocks are distributed uniformly at random to the disks (e.g., by hashing). If two randomly allocated copies of each block exist, N arbitrary blocks can be read within dN=De + 1 I/O steps with high probability. In addition, the redundancy can be reduced from 2 to 1 + 1=r for any integer r. These results can be used to emulate the simple and powerful "single-disk multi-head" model of external computing [1] on the physically more realistic independent disk model [33] with small constant overhead. This is faster than a lower bound for deterministic emulation [3].

