Results 1 - 10
of
15
Asymptotically Tight Bounds for Performing BMMC Permutations on Parallel Disk Systems
, 1994
"... This paper presents asymptotically equal lower and upper bounds for the number of parallel I/O operations required to perform bit-matrix-multiply/complement (BMMC) permutations on the Parallel Disk Model proposed by Vitter and Shriver. A BMMC permutation maps a source index to a target index by an a ..."
Abstract
-
Cited by 59 (19 self)
- Add to MetaCart
This paper presents asymptotically equal lower and upper bounds for the number of parallel I/O operations required to perform bit-matrix-multiply/complement (BMMC) permutations on the Parallel Disk Model proposed by Vitter and Shriver. A BMMC permutation maps a source index to a target index by an affine transformation over GF (2), where the source and target indices are treated as bit vectors. The class of BMMC permutations includes many common permutations, such as matrix transposition (when dimensions are powers of 2), bit-reversal permutations, vector-reversal permutations, hypercube permutations, matrix reblocking, Graycode permutations, and inverse Gray-code permutations. The upper bound improves upon the asymptotic bound in the previous best known BMMC algorithm and upon the constant factor in the previous best known bit-permute/complement (BPC) permutation algorithm. The algorithm achieving the upper bound uses basic linear-algebra techniques to factor the characteristic matrix...
Early experiences in evaluating the Parallel Disk Model with the ViC* implementation
, 1996
"... Although several algorithms have been developed for the Parallel Disk Model (PDM), few have beenimplemented. Consequently, little has been known about the accuracy of thePDMin measuring I/O time and total running time toperform an out-of-core computation. This paper analyzes timing results on multip ..."
Abstract
-
Cited by 19 (6 self)
- Add to MetaCart
Although several algorithms have been developed for the Parallel Disk Model (PDM), few have beenimplemented. Consequently, little has been known about the accuracy of thePDMin measuring I/O time and total running time toperform an out-of-core computation. This paper analyzes timing results on multiple-disk platforms fortwo PDM algorithms, out-of-core radix sort and BMMC permutations, to determine the strengths and weaknesses of thePDM. The results indicate the following. First, good PDM algorithms are usually not I/O bound. Second, of the four PDM parameters, one (problem size) is a good indicator of I/O time and running time, one (memory size) is a good indicator of I/O time but not necessarily running time, and the other two (block size and number of disks) do not necessarily indicate either I/O or running time. Third, because PDM algorithms tendnottobeI/Obound, using asynchronous I/O can reduce I/O wait times signi cantly. The software interface to the PDM is part of the ViC * run-time library. The interface is a set of wrappers that are designed to be both e cient and portable across several underlying le systems and target machines. 1
Multiprocessor Out-of-Core FFTs with Distributed Memory and Parallel Disks (Extended Abstract)
, 1997
"... ) Thomas H. Cormen Jake Wegmann David M. Nicol y Dartmouth College Department of Computer Science Abstract This paper extends an earlier out-of-core Fast Fourier Transform (FFT) method for a uniprocessor with the Parallel Disk Model (PDM) to use multiple processors. Four out-of-core multiproce ..."
Abstract
-
Cited by 15 (7 self)
- Add to MetaCart
) Thomas H. Cormen Jake Wegmann David M. Nicol y Dartmouth College Department of Computer Science Abstract This paper extends an earlier out-of-core Fast Fourier Transform (FFT) method for a uniprocessor with the Parallel Disk Model (PDM) to use multiple processors. Four out-of-core multiprocessor methods are examined. Operationally, these methods differ in the size of "minibutterfly " computed in memory and how the data are organized on the disks and in the distributed memory of the multiprocessor. The methods also perform differing amounts of I/O and communication. Two of them have the remarkable property that even though they are computing the FFT on a multiprocessor, all interprocessor communication occurs outside the mini-butterfly computations. Performance results on a small workstation cluster indicate that except for unusual combinations of problem size and memory size, the methods that do not perform interprocessor communication during the mini-butterfly computations req...
ViC*: A compiler for virtual-memory C
- In Proceedings of the Third International Workshop on High-Level Parallel Programming Models and Supportive Environments (HIPS ’98
, 1998
"... This paper describes the functionality of ViC*, a compiler for a variant of the data-parallel language C * with support for out-of-core data. The compiler translates C * programs with shapes declared outofcore, whichdescribe parallel data stored on disk. The compiler output is a SPMD-style program i ..."
Abstract
-
Cited by 13 (3 self)
- Add to MetaCart
This paper describes the functionality of ViC*, a compiler for a variant of the data-parallel language C * with support for out-of-core data. The compiler translates C * programs with shapes declared outofcore, whichdescribe parallel data stored on disk. The compiler output is a SPMD-style program in standard C with I/Oand library calls added to e ciently access out-of-core parallel data. The ViC * compiler also applies several program transformations to improve out-of-core data layout and access. 1
Out-of-Core FFTs with Parallel Disks
- ACM SIGMETRICS Performance Evaluation Review
, 1997
"... We examine approaches to computing the Fast Fourier Transform (FFT) when the data size exceeds the size of main memory. Analytical and experimental evidence shows that relying on native virtual memory with demand paging can yield extremely poor performance. We then present approaches based on minimi ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
We examine approaches to computing the Fast Fourier Transform (FFT) when the data size exceeds the size of main memory. Analytical and experimental evidence shows that relying on native virtual memory with demand paging can yield extremely poor performance. We then present approaches based on minimizing I/O costs with the Parallel Disk Model (PDM). Each of these approaches explicitly plans and performs disk accesses so as to minimize their number. 1 Introduction Although in most cases, Fast Fourier Transforms (FFTs) can be computed entirely in the main memory of a computer, in a few exceptional cases, the input vector is too large to fit. One application that uses very large FFTs is seismic analysis [2]; in one industrial application, an out-of-core one-dimensional FFT is necessary (as part of a higher dimensional FFT) even when the computer memory has 16 gigabytes of available RAM. Another application is in the area of radio astronomy. The High-Speed Data Acquisition and Very Large ...
The Future Fast Fourier Transform?
- SIAM J. SCI. COMPUTING
, 1997
"... It seems likely that improvements in arithmetic speed will continue to outpace advances in communication bandwidth. Furthermore, as more and more problems are working on huge datasets, it is becoming increasingly likely that data will be distributed across many processors because one processor does ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
It seems likely that improvements in arithmetic speed will continue to outpace advances in communication bandwidth. Furthermore, as more and more problems are working on huge datasets, it is becoming increasingly likely that data will be distributed across many processors because one processor does not have su#cient storage capacity. For these reasons, we propose that an inexact DFT such as an approximate matrix-vector approach based on singular values or a variation of the Dutt-Rokhlin fast-multipole-based algorithm may outperform any exact parallel FFT. The speedup may be as large as a factor of three in situations where FFT run time is dominated by communication. For the multipole idea we further propose that a method of "virtual charges" may improve accuracy, and we provide an analysis of the singular values that are needed for the approximate matrix-vector approaches.
Determining an Out-of-Core FFT Decomposition Strategy for Parallel Disks by Dynamic Programming
- ALGORITHMS FOR PARALLEL PROCESSING, VOLUME 105 OF IMA VOLUMES IN MATHEMATICS AND ITS APPLICATIONS
, 1999
"... We present an out-of-core FFT algorithm based on the in-core FFT method developed by Swarztrauber. Our algorithm uses a recursive divide-and-conquer strategy, and each stage in the recursion presents several possibilities for how to split the problem into subproblems. We give a recurrence for the al ..."
Abstract
-
Cited by 8 (3 self)
- Add to MetaCart
We present an out-of-core FFT algorithm based on the in-core FFT method developed by Swarztrauber. Our algorithm uses a recursive divide-and-conquer strategy, and each stage in the recursion presents several possibilities for how to split the problem into subproblems. We give a recurrence for the algorithm's I/O complexity on the Parallel Disk Model and show how to use dynamic programming to determine optimal splits at each recursive stage. The algorithm to determine the optimal splits takes only \Theta(lg 2 N) time for an N-point FFT, and it is practical. The out-of-core FFT algorithm itself takes considerably longer.
Parallel Algorithms in External Memory
, 2000
"... External memory (EM) algorithms are designed for computational problems in which the size of the internal memory of the computer is only a small fraction of the problem size. The Parallel Disk Model (PDM) of Vitter and Shriver is widely used to discriminate between external memory algorithms on the ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
External memory (EM) algorithms are designed for computational problems in which the size of the internal memory of the computer is only a small fraction of the problem size. The Parallel Disk Model (PDM) of Vitter and Shriver is widely used to discriminate between external memory algorithms on the basis of input/output (I/O) complexity. Parallel algorithms are designed to efficiently utilize the computing power of multiple processing units, interconnected by a communication mechanism. A popular model for developing and analyzing parallel algorithms is the Bulk Synchronous Parallel (BSP) model due to Valiant. In this work we develop simulation techniques, both randomized and deterministic, which produce efficient EM algorithms from efficient algorithms developed under BSPlike parallel computing models. Our techniques can accommodate one or multiple processors on the EM target machine, each with one or more disks, and they also adapt to the disk blocking factor of the target machine. ...
Multidimensional, Multiprocessor, Out-of-Core FFTs with Distributed Memory and Parallel Disks (Extended Abstract)
- In Proceedings of the Eleventh Annual ACM Symposium on Parallel Algorithms and Architectures
, 1999
"... ) Lauren M. Baptist Thomas H. Cormen # {lmb, thc}@cs.dartmouth.edu Dartmouth College Department of Computer Science Abstract We show how to compute multidimensional Fast Fourier Transforms (FFTs) on a multiprocessor system with distributed memory when problem sizes are so large that the data do ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
) Lauren M. Baptist Thomas H. Cormen # {lmb, thc}@cs.dartmouth.edu Dartmouth College Department of Computer Science Abstract We show how to compute multidimensional Fast Fourier Transforms (FFTs) on a multiprocessor system with distributed memory when problem sizes are so large that the data do not fit in the memory of the entire system. Instead, data reside on a parallel disk system and are brought into memory in sections. We use the Parallel Disk Model for implementation and analysis. Our method is a straightforward out-of-core variant of a wellknown method for in-core, multidimensional FFTs. It performs 1-dimensional FFT computations on each dimension in turn. This method is easy to generalize to any number of dimensions, and it also readily permits the individual dimensions to be of any sizes that are integer powers of 2. The key step is an out-of-core transpose operation that places the data along each dimension into contiguous positions on the parallel disk system so that the...
Towards tiny trusted third parties
, 2005
"... Many security protocols hypothesize the existence of a trusted third party (TTP) to ease handling of computation and data too sensitive for the other parties involved. Subsequent discussion usually dismisses these protocols as hypothetical or impractical, under the assumption that trusted third part ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
Many security protocols hypothesize the existence of a trusted third party (TTP) to ease handling of computation and data too sensitive for the other parties involved. Subsequent discussion usually dismisses these protocols as hypothetical or impractical, under the assumption that trusted third parties cannot exist. However, the last decade has seen the emergence of hardware-based devices that, to high assurance, can carry out computation unmolested; emerging research promises more. In theory, such devices can perform the role of a trusted third party in real-world problems. In practice, we have found problems. The devices aspire to be general-purpose processors but are too small to accommodate real-world problem sizes. The small size forces programmers to hand-tune each algorithm anew, if possible, to fit inside the small space without losing security. This tuning heavily uses operations that general-purpose processors do not perform well. Furthermore, perhaps by trying to incorporate too much functionality, current devices are also too expensive to deploy widely. Our current research attempts to overcome these barriers, by focusing on the effective use of tiny TTPs (T3Ps). To eliminate the programming obstacle, we used our experience building hardware TTP apps to design and prototype an efficient way to execute arbitrary programs on T3Ps while preserving the critical trust properties. To eliminate the performance and cost obstacles, we are currently examining the potential hardware design for a T3P optimized for these operations. In previous papers, we reported our work on the programming obstacle. In this paper, we examine the potential hardware designs. We estimate that such a T3P could outperform existing devices by several orders of magnitude, while also having a gate-count of only 30K-60K, one to three orders of magnitude smaller than existing devices. 1

