Results 11 
18 of
18
Duality between prefetching and queued writing with applications to external sorting
 IN EUROPEAN SYMPOSIUM ON ALGORITHMS, VOLUME 2161 OF LECTURE NOTES IN COMPUTER SCIENCE
, 1998
"... Parallel disks promise to be a cost effective means for achieving high bandwidth in applications involving massive data sets, but algorithms for parallel disks can be difficult to devise. To combat this problem, we define a useful and natural duality between writing to parallel disks and the seeming ..."
Abstract

Cited by 17 (6 self)
 Add to MetaCart
Parallel disks promise to be a cost effective means for achieving high bandwidth in applications involving massive data sets, but algorithms for parallel disks can be difficult to devise. To combat this problem, we define a useful and natural duality between writing to parallel disks and the seemingly more difficult problem of prefetching. We first explore this duality for applications involving readonce accesses using parallel disks. We get a simple linear time algorithm for computing optimal prefetch schedules and analyze the efficiency of the resulting schedules for randomly placed data and for arbitrary interleaved accesses to striped sequences. Duality also provides an optimal schedule for the integrated caching and prefetching problem, in which blocks can be accessed multiple times. Another application of this duality gives us the rst parallel disk sorting algorithms that are provably optimal up to lower order terms. One of these algorithms is a simple and practical variant of multiway merge sort, addressing a question that has been open for some time.
Distribution Sort with Randomized Cycling
 In Proceedings of the Twelfth Annual ACMSIAM Symposium on Discrete Algorithms (SODA01
, 2001
"... Paxallel independent disks can enhance the performance of external memory (EM) algorithms, but the programming task is often difficult. In this paper we develop randomized vaxiants of distribution sort for use with paxallel independent disks. We propose a simple vaxiant called randomized cycling dis ..."
Abstract

Cited by 13 (3 self)
 Add to MetaCart
Paxallel independent disks can enhance the performance of external memory (EM) algorithms, but the programming task is often difficult. In this paper we develop randomized vaxiants of distribution sort for use with paxallel independent disks. We propose a simple vaxiant called randomized cycling distribution sort (RCD) and prove that it has optimal expected I/O complexity. The analysis uses a novel reduction to a model with significantly fewer probabilistic interdependencies. Experimental evidence is provided to support its practicality. Other simple vaxiants axe also examined experimentally and appeax to offer similax advantages to RCD. Based upon ideas in RCD we propose general techniques that transpaxently simulate algorithms developed for the unrealistic multihead disk model so that they can be run on the realistic paxallel disk model. The simulation is optimal for two important classes of algorithms: the class of multipass algorithms, which make a complete pass through their data before accessing any element a second time, and the algorithms based upon the wellknown distribution paxadigm of EM computation.
Cleaning Uncertain Data with Quality Guarantees
, 2008
"... Uncertain or imprecise data are pervasive in applications like locationbased services, sensor monitoring, and data collection and integration. For these applications, probabilistic databases can be used to store uncertain data, and querying facilities are provided to yield answers with statistical ..."
Abstract

Cited by 9 (1 self)
 Add to MetaCart
Uncertain or imprecise data are pervasive in applications like locationbased services, sensor monitoring, and data collection and integration. For these applications, probabilistic databases can be used to store uncertain data, and querying facilities are provided to yield answers with statistical confidence. Given that a limited amount of resources is available to “clean” the database (e.g., by probing some sensor data values to get their latest values), we address the problem of choosing the set of uncertain objects to be cleaned, in order to achieve the best improvement in the quality of query answers. For this purpose, we present the PWSquality metric, which is a universal measure that quantifies the ambiguity of query answers under the possible world semantics. We study how PWSquality can be efficiently evaluated for two major query classes: (1) queries that examine the satisfiability of tuples independent of other tuples (e.g., range queries); and (2) queries that require the knowledge of the relative ranking of the tuples (e.g., MAX queries). We then propose a polynomialtime solution to achieve an optimal improvement in PWSquality. Other fast heuristics are presented as well. Experiments, performed on both real and synthetic datasets, show that the PWSquality metric can be evaluated quickly, and that our cleaning algorithm provides an optimal solution with high efficiency. To our best knowledge, this is the first work that develops a quality metric for a probabilistic database, and investigates how such a metric can be used for data cleaning purposes.
A Simple and Efficient Parallel Disk Mergesort
, 2002
"... External sorting—the process of sorting a file that is too large to fit into the computer’s internal memory and must be stored externally on disks—is a fundamental subroutine in database systems [G], [IBM]. Of prime importance are techniques that use multiple disks in parallel in order to speed up t ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
External sorting—the process of sorting a file that is too large to fit into the computer’s internal memory and must be stored externally on disks—is a fundamental subroutine in database systems [G], [IBM]. Of prime importance are techniques that use multiple disks in parallel in order to speed up the performance of external sorting. The simple randomized merging (SRM) mergesort algorithm proposed by Barve et al. [BGV] is the first parallel disk sorting algorithm that requires a provably optimal number of passes and that is fast in practice. Knuth [K, Section 5.4.9] recently identified SRM (which he calls “randomized striping”) as the method of choice for sorting with parallel disks. In this paper we present an efficient implementation of SRM, based upon novel and elegant data structures. We give a new implementation for SRM’s lookahead forecasting technique for parallel prefetching and its forecast and flush technique for buffer management. Our techniques amount to a significant improvement in the way SRM carries out the parallel, independent disk accesses necessary to read blocks of input runs efficiently during external merging. Our implementation is
Parallel External Memory Graph Algorithms
"... In this paper, we study parallel I/O efficient graph algorithms in the Parallel External Memory (PEM) model, one of the privatecache chip multiprocessor (CMP) models. We study the fundamental problem of list ranking which leads to solutions for many problems on trees, such as computing the Euler to ..."
Abstract

Cited by 8 (3 self)
 Add to MetaCart
In this paper, we study parallel I/O efficient graph algorithms in the Parallel External Memory (PEM) model, one of the privatecache chip multiprocessor (CMP) models. We study the fundamental problem of list ranking which leads to solutions for many problems on trees, such as computing the Euler tour, preorder and postorder numbering of the vertices, the depth of each vertex and the sizes of subtrees rooted at each vertex of the tree. We also study the problems of computing the connected components of a graph and minimum spanning tree of a connected graph. All our solutions provide an optimal speedup of O(p) in parallel I/O complexity compared to the singleprocessor external memory versions of the algorithms. 1
On sorting strings in external memory (extended abstract
 In STOC ’97: Proceedings of the twentyninth annual ACM symposium on Theory of computing
, 1997
"... Abstract. In this paper we address for the first time the I/O complexity of the problem of sorting strings in external memory, which is a fundamental component of many largescale text applications. In the standard unitcost RAM comparison model, the complexity of sorting K strings of total length N ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
Abstract. In this paper we address for the first time the I/O complexity of the problem of sorting strings in external memory, which is a fundamental component of many largescale text applications. In the standard unitcost RAM comparison model, the complexity of sorting K strings of total length N is (K log2 K +N). By analogy, in the external memory (or I/O) model, where the internal memory has size M and the block transfer size is B, it would be natural to guess that the I/O complexity of sorting strings is ( K B logM=B K N
Sorting in parallel externalmemory multicores
, 2007
"... Abstract. In this paper, we introduce a model for multicore architectures, which takes into explicit consideration the cacheoriented nature of inputs and outputs in modern CPUs. In addition, we study the fundamental problem of sorting comparable items using this model. We provide algorithms that ar ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Abstract. In this paper, we introduce a model for multicore architectures, which takes into explicit consideration the cacheoriented nature of inputs and outputs in modern CPUs. In addition, we study the fundamental problem of sorting comparable items using this model. We provide algorithms that are efficient in terms of the number of parallel I/O’s. We also provide lower bounds that show that our algorithms are within a constant factor of optimal, for reasonable values of parameters characterizing the number of processors, the size of each processors memory, the size of cache blocks, and the number of items to be sorted. 1
Machine Models for Query Processing
"... The massive data sets that have to be processed in many application areas are often far too large to fit completely into a computer’s internal memory. When evaluating queries on such large data sets, the resulting communication ..."
Abstract
 Add to MetaCart
The massive data sets that have to be processed in many application areas are often far too large to fit completely into a computer’s internal memory. When evaluating queries on such large data sets, the resulting communication