Results 1 
8 of
8
Asymptotically Tight Bounds for Performing BMMC Permutations on Parallel Disk Systems
, 1994
"... This paper presents asymptotically equal lower and upper bounds for the number of parallel I/O operations required to perform bitmatrixmultiply/complement (BMMC) permutations on the Parallel Disk Model proposed by Vitter and Shriver. A BMMC permutation maps a source index to a target index by an a ..."
Abstract

Cited by 61 (19 self)
 Add to MetaCart
This paper presents asymptotically equal lower and upper bounds for the number of parallel I/O operations required to perform bitmatrixmultiply/complement (BMMC) permutations on the Parallel Disk Model proposed by Vitter and Shriver. A BMMC permutation maps a source index to a target index by an affine transformation over GF (2), where the source and target indices are treated as bit vectors. The class of BMMC permutations includes many common permutations, such as matrix transposition (when dimensions are powers of 2), bitreversal permutations, vectorreversal permutations, hypercube permutations, matrix reblocking, Graycode permutations, and inverse Graycode permutations. The upper bound improves upon the asymptotic bound in the previous best known BMMC algorithm and upon the constant factor in the previous best known bitpermute/complement (BPC) permutation algorithm. The algorithm achieving the upper bound uses basic linearalgebra techniques to factor the characteristic matrix...
Speeding up External Mergesort
 IEEE Transactions on Knowledge and Data Engineering
"... External mergesort is normally implemented so that each run is stored contiguously on disk and blocks of data are read exactly in the order they are needed during merging. We investigate two ideas for improving the performance of external mergesort: interleaved layout and a new reading strategy. Int ..."
Abstract

Cited by 19 (0 self)
 Add to MetaCart
External mergesort is normally implemented so that each run is stored contiguously on disk and blocks of data are read exactly in the order they are needed during merging. We investigate two ideas for improving the performance of external mergesort: interleaved layout and a new reading strategy. Interleaved layout places blocks from different runs in consecutive disk addresses. This is done in the hope that interleaving will reduce seek overhead during merging. The new reading strategy precomputes the order in which data blocks are to be read according to where they are located on disk and when they are needed for merging. Extra buffer space makes it possible to read blocks in an order that reduces seek overhead, instead of reading them exactly in the order they are needed for merging. A detailed simulation model was used to compare the two layout strategies and three reading strategies. The effects of using multiple work disks were also investigated. We found that, in most cases, inte...
Efficient parallel algorithms for closest point problems
, 1994
"... This dissertation develops and studies fast algorithms for solving closest point problems. Algorithms for such problems have applications in many areas including statistical classification, crystallography, data compression, and finite element analysis. In addition to a comprehensive empirical study ..."
Abstract

Cited by 10 (1 self)
 Add to MetaCart
This dissertation develops and studies fast algorithms for solving closest point problems. Algorithms for such problems have applications in many areas including statistical classification, crystallography, data compression, and finite element analysis. In addition to a comprehensive empirical study of known sequential methods, I introduce new parallel algorithms for these problems that are both efficient and practical. I present a simple and flexible programming model for designing and analyzing parallel algorithms. Also, I describe fast parallel algorithms for nearestneighbor searching and constructing Voronoi diagrams. Finally, I demonstrate that my algorithms actually obtain good performance on a wide variety of machine architectures. The key algorithmic ideas that I examine are exploiting spatial locality, and random sampling. Spatial decomposition provides allows many concurrent threads to work independently of one another in local areas of a shared data structure. Random sampling provides a simple way to adaptively decompose irregular problems, and to balance workload among many threads. Used together, these techniques result in effective algorithms for a wide range of geometric problems. The key
Optimal Deterministic Sorting in Parallel Memory Hierarchies
, 1992
"... We present a general deterministic sorting strategy that is applicable to a wide variety of parallel memory hierarchies with parallel processors. The simplest incarnation of the strategy is an optimal deterministic algorithm called Balance Sort for external sorting on multiple disks with a single ..."
Abstract

Cited by 3 (3 self)
 Add to MetaCart
We present a general deterministic sorting strategy that is applicable to a wide variety of parallel memory hierarchies with parallel processors. The simplest incarnation of the strategy is an optimal deterministic algorithm called Balance Sort for external sorting on multiple disks with a single CPU. Balance Sort was the topic of a previous report. This report shows how to adapt Balance Sort to sort deterministically in parallel memory hierarchies. The algorithms so derived will be optimal for all parallel memory hierarchies for which an optimal algorithm is known for a single hierarchy. In the case of D disks, P processors, block size B, and internal memory size M, they are optimal in terms of I/Os for any P _< M and DB < M, and in terms of internal processing time except when P > M log rain{M/B, log M}/log M and log M/B: o(1% M).
Minimizing the Input/Output Bottleneck
, 1992
"... this paper, we assume that all graphs are undirected, an assumption that may not hold for certain applications such as hypertext and objectoriented databases. One important assumption of our model is that data may be multiply represented in blocks. This is a stronger assumption than that used, for ..."
Abstract
 Add to MetaCart
this paper, we assume that all graphs are undirected, an assumption that may not hold for certain applications such as hypertext and objectoriented databases. One important assumption of our model is that data may be multiply represented in blocks. This is a stronger assumption than that used, for example, by external
Vector Layout in VirtualMemory Systems for DataParallel Computing
"... In a dataparallel computer with virtual memory, the way in which vectors are laid out on the disk system affects the performance of dataparallel operations. We present a general method of vector layout called banded layout, in which we divide a vector into bands of a number of consecutive vector e ..."
Abstract
 Add to MetaCart
In a dataparallel computer with virtual memory, the way in which vectors are laid out on the disk system affects the performance of dataparallel operations. We present a general method of vector layout called banded layout, in which we divide a vector into bands of a number of consecutive vector elements laid out in columnmajor order, and we analyze the effect of the band size on the major classes of dataparallel operations. We find that although the best band size varies among the operations, choosing fairly small band sizes—at most a track—works well in general. 1
Asymptotically Tight Bounds forPerforming BMMC Permutations on Parallel Disk Systems
"... We give asymptotically equal lowerand upper bounds for the number of parallel I/O operations required to perform bitmatrixmultiply/complement (BMMC)permutations on parallel disk systems. In a BMMC permutation onN records, where N is a power of 2, each(lgN)bit source address x maps to a correspond ..."
Abstract
 Add to MetaCart
We give asymptotically equal lowerand upper bounds for the number of parallel I/O operations required to perform bitmatrixmultiply/complement (BMMC)permutations on parallel disk systems. In a BMMC permutation onN records, where N is a power of 2, each(lgN)bit source address x maps to a corresponding (lg N)bit target address y by thematrix equation y = Ax c, where matrix multiplication is performed over GF(2). The characteristic matrix A is (lg N) (lg N) and nonsingular over GF(2). Under the VitterShriver paralleldisk model with N records, D disks, B records per block, and M records of memory, weshowauniversal lower N bound of
with Parallel Block Transfer
"... I wish to thank a number of people without whose help and support this work would never have been finished: Jin Joo Lee, for all her help with the math analysis; Tom Swartz, for his patience; and most of all, my advisor Jeff Vitter. This research was done in partial fulfillment of the requirements f ..."
Abstract
 Add to MetaCart
I wish to thank a number of people without whose help and support this work would never have been finished: Jin Joo Lee, for all her help with the math analysis; Tom Swartz, for his patience; and most of all, my advisor Jeff Vitter. This research was done in partial fulfillment of the requirements for my Master's degree We provide the first optimal algorithms in terms of the number of input/outputs (1/0s) required between internal memory and multiple disk drives for sorting, FFT, matrix transposition, standard matrix multiplication, and related problems. Our twolevel memory model is new and gives a realistic treatment of parallel block transfer, in which during a single I/O each of the P disks can simultaneously transfer a contiguous block of B records. We also introduce parallel variants of the hierarchical memory models of [AAC, ACS] and give optimal algorithms. In our parallel models, there are P hierarchies operating in parallel; communication among the hierarchies takes place at a base memory level. The difficulty in developing optimal algorithms in our twolevel and hierarchical models is to cope with the partitioning of memory into P separate physical devices. The popular