Results 1 -
7 of
7
Asymptotically Tight Bounds for Performing BMMC Permutations on Parallel Disk Systems
, 1994
"... This paper presents asymptotically equal lower and upper bounds for the number of parallel I/O operations required to perform bit-matrix-multiply/complement (BMMC) permutations on the Parallel Disk Model proposed by Vitter and Shriver. A BMMC permutation maps a source index to a target index by an a ..."
Abstract
-
Cited by 59 (19 self)
- Add to MetaCart
This paper presents asymptotically equal lower and upper bounds for the number of parallel I/O operations required to perform bit-matrix-multiply/complement (BMMC) permutations on the Parallel Disk Model proposed by Vitter and Shriver. A BMMC permutation maps a source index to a target index by an affine transformation over GF (2), where the source and target indices are treated as bit vectors. The class of BMMC permutations includes many common permutations, such as matrix transposition (when dimensions are powers of 2), bit-reversal permutations, vector-reversal permutations, hypercube permutations, matrix reblocking, Graycode permutations, and inverse Gray-code permutations. The upper bound improves upon the asymptotic bound in the previous best known BMMC algorithm and upon the constant factor in the previous best known bit-permute/complement (BPC) permutation algorithm. The algorithm achieving the upper bound uses basic linear-algebra techniques to factor the characteristic matrix...
Speeding up External Mergesort
- IEEE Transactions on Knowledge and Data Engineering
"... External mergesort is normally implemented so that each run is stored contiguously on disk and blocks of data are read exactly in the order they are needed during merging. We investigate two ideas for improving the performance of external mergesort: interleaved layout and a new reading strategy. Int ..."
Abstract
-
Cited by 18 (0 self)
- Add to MetaCart
External mergesort is normally implemented so that each run is stored contiguously on disk and blocks of data are read exactly in the order they are needed during merging. We investigate two ideas for improving the performance of external mergesort: interleaved layout and a new reading strategy. Interleaved layout places blocks from different runs in consecutive disk addresses. This is done in the hope that interleaving will reduce seek overhead during merging. The new reading strategy precomputes the order in which data blocks are to be read according to where they are located on disk and when they are needed for merging. Extra buffer space makes it possible to read blocks in an order that reduces seek overhead, instead of reading them exactly in the order they are needed for merging. A detailed simulation model was used to compare the two layout strategies and three reading strategies. The effects of using multiple work disks were also investigated. We found that, in most cases, inte...
Efficient parallel algorithms for closest point problems
, 1994
"... This dissertation develops and studies fast algorithms for solving closest point problems. Algorithms for such problems have applications in many areas including statistical classification, crystallography, data compression, and finite element analysis. In addition to a comprehensive empirical study ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
This dissertation develops and studies fast algorithms for solving closest point problems. Algorithms for such problems have applications in many areas including statistical classification, crystallography, data compression, and finite element analysis. In addition to a comprehensive empirical study of known sequential methods, I introduce new parallel algorithms for these problems that are both efficient and practical. I present a simple and flexible programming model for designing and analyzing parallel algorithms. Also, I describe fast parallel algorithms for nearest-neighbor searching and constructing Voronoi diagrams. Finally, I demonstrate that my algorithms actually obtain good performance on a wide variety of machine architectures. The key algorithmic ideas that I examine are exploiting spatial locality, and random sampling. Spatial decomposition provides allows many concurrent threads to work independently of one another in local areas of a shared data structure. Random sampling provides a simple way to adaptively decompose irregular problems, and to balance workload among many threads. Used together, these techniques result in effective algorithms for a wide range of geometric problems. The key
Optimal Deterministic Sorting in Parallel Memory Hierarchies
, 1992
"... We present a general deterministic sorting strategy that is applicable to a wide variety of parallel memory hierarchies with parallel processors. The simplest incarnation of the strategy is an optimal deterministic algorithm called Balance Sort for external sorting on multiple disks with a single ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
We present a general deterministic sorting strategy that is applicable to a wide variety of parallel memory hierarchies with parallel processors. The simplest incarnation of the strategy is an optimal deterministic algorithm called Balance Sort for external sorting on multiple disks with a single CPU. Balance Sort was the topic of a previous report. This report shows how to adapt Balance Sort to sort deterministically in parallel memory hierarchies. The algorithms so derived will be optimal for all parallel memory hierarchies for which an optimal algorithm is known for a single hierarchy. In the case of D disks, P processors, block size B, and internal memory size M, they are optimal in terms of I/Os for any P _< M and DB < M, and in terms of internal processing time except when P > M log rain{M/B, log M}/log M and log M/B: o(1% M).
Minimizing the Input/Output Bottleneck
, 1992
"... this paper, we assume that all graphs are undirected, an assumption that may not hold for certain applications such as hypertext and object-oriented databases. One important assumption of our model is that data may be multiply represented in blocks. This is a stronger assumption than that used, for ..."
Abstract
- Add to MetaCart
this paper, we assume that all graphs are undirected, an assumption that may not hold for certain applications such as hypertext and object-oriented databases. One important assumption of our model is that data may be multiply represented in blocks. This is a stronger assumption than that used, for example, by external
Vector Layout in Virtual-Memory Systems for Data-Parallel Computing
"... In a data-parallel computer with virtual memory, the way in which vectors are laid out on the disk system affects the performance of data-parallel operations. We present a general method of vector layout called banded layout, in which we divide a vector into bands of a number of consecutive vector e ..."
Abstract
- Add to MetaCart
In a data-parallel computer with virtual memory, the way in which vectors are laid out on the disk system affects the performance of data-parallel operations. We present a general method of vector layout called banded layout, in which we divide a vector into bands of a number of consecutive vector elements laid out in column-major order, and we analyze the effect of the band size on the major classes of data-parallel operations. We find that although the best band size varies among the operations, choosing fairly small band sizes—at most a track—works well in general. 1
Asymptotically Tight Bounds forPerforming BMMC Permutations on Parallel Disk Systems
"... We give asymptotically equal lowerand upper bounds for the number of parallel I/O operations required to perform bit-matrix-multiply/complement (BMMC)permutations on parallel disk systems. In a BMMC permutation onN records, where N is a power of 2, each(lgN)-bit source address x maps to a correspond ..."
Abstract
- Add to MetaCart
We give asymptotically equal lowerand upper bounds for the number of parallel I/O operations required to perform bit-matrix-multiply/complement (BMMC)permutations on parallel disk systems. In a BMMC permutation onN records, where N is a power of 2, each(lgN)-bit source address x maps to a corresponding (lg N)-bit target address y by thematrix equation y = Ax c, where matrix multiplication is performed over GF(2). The characteristic matrix A is (lg N) (lg N) and nonsingular over GF(2). Under the Vitter-Shriver parallel-disk model with N records, D disks, B records per block, and M records of memory, weshowauniversal lower N bound of

