Results 1  10
of
14
Asymptotically Tight Bounds for Performing BMMC Permutations on Parallel Disk Systems
, 1994
"... This paper presents asymptotically equal lower and upper bounds for the number of parallel I/O operations required to perform bitmatrixmultiply/complement (BMMC) permutations on the Parallel Disk Model proposed by Vitter and Shriver. A BMMC permutation maps a source index to a target index by an a ..."
Abstract

Cited by 61 (19 self)
 Add to MetaCart
This paper presents asymptotically equal lower and upper bounds for the number of parallel I/O operations required to perform bitmatrixmultiply/complement (BMMC) permutations on the Parallel Disk Model proposed by Vitter and Shriver. A BMMC permutation maps a source index to a target index by an affine transformation over GF (2), where the source and target indices are treated as bit vectors. The class of BMMC permutations includes many common permutations, such as matrix transposition (when dimensions are powers of 2), bitreversal permutations, vectorreversal permutations, hypercube permutations, matrix reblocking, Graycode permutations, and inverse Graycode permutations. The upper bound improves upon the asymptotic bound in the previous best known BMMC algorithm and upon the constant factor in the previous best known bitpermute/complement (BPC) permutation algorithm. The algorithm achieving the upper bound uses basic linearalgebra techniques to factor the characteristic matrix...
Early experiences in evaluating the Parallel Disk Model with the ViC* implementation
, 1996
"... Although several algorithms have been developed for the Parallel Disk Model (PDM), few have beenimplemented. Consequently, little has been known about the accuracy of thePDMin measuring I/O time and total running time toperform an outofcore computation. This paper analyzes timing results on multip ..."
Abstract

Cited by 19 (6 self)
 Add to MetaCart
Although several algorithms have been developed for the Parallel Disk Model (PDM), few have beenimplemented. Consequently, little has been known about the accuracy of thePDMin measuring I/O time and total running time toperform an outofcore computation. This paper analyzes timing results on multipledisk platforms fortwo PDM algorithms, outofcore radix sort and BMMC permutations, to determine the strengths and weaknesses of thePDM. The results indicate the following. First, good PDM algorithms are usually not I/O bound. Second, of the four PDM parameters, one (problem size) is a good indicator of I/O time and running time, one (memory size) is a good indicator of I/O time but not necessarily running time, and the other two (block size and number of disks) do not necessarily indicate either I/O or running time. Third, because PDM algorithms tendnottobeI/Obound, using asynchronous I/O can reduce I/O wait times signi cantly. The software interface to the PDM is part of the ViC * runtime library. The interface is a set of wrappers that are designed to be both e cient and portable across several underlying le systems and target machines. 1
Multiprocessor OutofCore FFTs with Distributed Memory and Parallel Disks (Extended Abstract)
, 1997
"... ) Thomas H. Cormen Jake Wegmann David M. Nicol y Dartmouth College Department of Computer Science Abstract This paper extends an earlier outofcore Fast Fourier Transform (FFT) method for a uniprocessor with the Parallel Disk Model (PDM) to use multiple processors. Four outofcore multiproce ..."
Abstract

Cited by 16 (7 self)
 Add to MetaCart
) Thomas H. Cormen Jake Wegmann David M. Nicol y Dartmouth College Department of Computer Science Abstract This paper extends an earlier outofcore Fast Fourier Transform (FFT) method for a uniprocessor with the Parallel Disk Model (PDM) to use multiple processors. Four outofcore multiprocessor methods are examined. Operationally, these methods differ in the size of "minibutterfly " computed in memory and how the data are organized on the disks and in the distributed memory of the multiprocessor. The methods also perform differing amounts of I/O and communication. Two of them have the remarkable property that even though they are computing the FFT on a multiprocessor, all interprocessor communication occurs outside the minibutterfly computations. Performance results on a small workstation cluster indicate that except for unusual combinations of problem size and memory size, the methods that do not perform interprocessor communication during the minibutterfly computations req...
The future fast fourier transform
 SIAM J. Sci. Computing
, 1999
"... It seems likely that improvements in arithmetic speed will continue to outpace advances in communications bandwidth. Furthermore, as more and more problems are working on huge datasets, it is becoming increasingly likely that data will be distributed across many processors because one processor does ..."
Abstract

Cited by 13 (0 self)
 Add to MetaCart
It seems likely that improvements in arithmetic speed will continue to outpace advances in communications bandwidth. Furthermore, as more and more problems are working on huge datasets, it is becoming increasingly likely that data will be distributed across many processors because one processor does not have sufficient storage capacity. For these reasons, we propose that an inexact DFT such as an approximate matrixvector approach based on singular values or a variation of the DuttRokhlin fastmultipolebased algorithm [9] may outperform any exact parallel FFT. The speedup may be as large as a factor of three in situations where FFT run time is dominated by communication. For the multipole idea we further propose that a method of “virtual charges ” may improve accuracy, and we provide an analysis of the singular values that are needed for the approximate matrixvector approaches. 1
OutofCore FFTs with Parallel Disks
 ACM SIGMETRICS Performance Evaluation Review
, 1997
"... We examine approaches to computing the Fast Fourier Transform (FFT) when the data size exceeds the size of main memory. Analytical and experimental evidence shows that relying on native virtual memory with demand paging can yield extremely poor performance. We then present approaches based on minimi ..."
Abstract

Cited by 10 (1 self)
 Add to MetaCart
We examine approaches to computing the Fast Fourier Transform (FFT) when the data size exceeds the size of main memory. Analytical and experimental evidence shows that relying on native virtual memory with demand paging can yield extremely poor performance. We then present approaches based on minimizing I/O costs with the Parallel Disk Model (PDM). Each of these approaches explicitly plans and performs disk accesses so as to minimize their number. 1 Introduction Although in most cases, Fast Fourier Transforms (FFTs) can be computed entirely in the main memory of a computer, in a few exceptional cases, the input vector is too large to fit. One application that uses very large FFTs is seismic analysis [2]; in one industrial application, an outofcore onedimensional FFT is necessary (as part of a higher dimensional FFT) even when the computer memory has 16 gigabytes of available RAM. Another application is in the area of radio astronomy. The HighSpeed Data Acquisition and Very Large ...
Determining an OutofCore FFT Decomposition Strategy for Parallel Disks by Dynamic Programming
 ALGORITHMS FOR PARALLEL PROCESSING, VOLUME 105 OF IMA VOLUMES IN MATHEMATICS AND ITS APPLICATIONS
, 1999
"... We present an outofcore FFT algorithm based on the incore FFT method developed by Swarztrauber. Our algorithm uses a recursive divideandconquer strategy, and each stage in the recursion presents several possibilities for how to split the problem into subproblems. We give a recurrence for the al ..."
Abstract

Cited by 9 (3 self)
 Add to MetaCart
We present an outofcore FFT algorithm based on the incore FFT method developed by Swarztrauber. Our algorithm uses a recursive divideandconquer strategy, and each stage in the recursion presents several possibilities for how to split the problem into subproblems. We give a recurrence for the algorithm's I/O complexity on the Parallel Disk Model and show how to use dynamic programming to determine optimal splits at each recursive stage. The algorithm to determine the optimal splits takes only \Theta(lg 2 N) time for an Npoint FFT, and it is practical. The outofcore FFT algorithm itself takes considerably longer.
Towards tiny trusted third parties
, 2005
"... Many security protocols hypothesize the existence of a trusted third party (TTP) to ease handling of computation and data too sensitive for the other parties involved. Subsequent discussion usually dismisses these protocols as hypothetical or impractical, under the assumption that trusted third part ..."
Abstract

Cited by 5 (4 self)
 Add to MetaCart
Many security protocols hypothesize the existence of a trusted third party (TTP) to ease handling of computation and data too sensitive for the other parties involved. Subsequent discussion usually dismisses these protocols as hypothetical or impractical, under the assumption that trusted third parties cannot exist. However, the last decade has seen the emergence of hardwarebased devices that, to high assurance, can carry out computation unmolested; emerging research promises more. In theory, such devices can perform the role of a trusted third party in realworld problems. In practice, we have found problems. The devices aspire to be generalpurpose processors but are too small to accommodate realworld problem sizes. The small size forces programmers to handtune each algorithm anew, if possible, to fit inside the small space without losing security. This tuning heavily uses operations that generalpurpose processors do not perform well. Furthermore, perhaps by trying to incorporate too much functionality, current devices are also too expensive to deploy widely. Our current research attempts to overcome these barriers, by focusing on the effective use of tiny TTPs (T3Ps). To eliminate the programming obstacle, we used our experience building hardware TTP apps to design and prototype an efficient way to execute arbitrary programs on T3Ps while preserving the critical trust properties. To eliminate the performance and cost obstacles, we are currently examining the potential hardware design for a T3P optimized for these operations. In previous papers, we reported our work on the programming obstacle. In this paper, we examine the potential hardware designs. We estimate that such a T3P could outperform existing devices by several orders of magnitude, while also having a gatecount of only 30K60K, one to three orders of magnitude smaller than existing devices. 1
Parallel Algorithms in External Memory
, 2000
"... External memory (EM) algorithms are designed for computational problems in which the size of the internal memory of the computer is only a small fraction of the problem size. The Parallel Disk Model (PDM) of Vitter and Shriver is widely used to discriminate between external memory algorithms on the ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
External memory (EM) algorithms are designed for computational problems in which the size of the internal memory of the computer is only a small fraction of the problem size. The Parallel Disk Model (PDM) of Vitter and Shriver is widely used to discriminate between external memory algorithms on the basis of input/output (I/O) complexity. Parallel algorithms are designed to efficiently utilize the computing power of multiple processing units, interconnected by a communication mechanism. A popular model for developing and analyzing parallel algorithms is the Bulk Synchronous Parallel (BSP) model due to Valiant. In this work we develop simulation techniques, both randomized and deterministic, which produce efficient EM algorithms from efficient algorithms developed under BSPlike parallel computing models. Our techniques can accommodate one or multiple processors on the EM target machine, each with one or more disks, and they also adapt to the disk blocking factor of the target machine. ...
The FFT  an algorithm the whole family can use
, 1999
"... Introduction These days, it is almost beyond belief (at least for many of my students) that there was a time before digital technology. It seems almost everyone knows that somehow all the data whizzing over the internet, bustling through our modems or crashing into our cell phones is ultimately just ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
Introduction These days, it is almost beyond belief (at least for many of my students) that there was a time before digital technology. It seems almost everyone knows that somehow all the data whizzing over the internet, bustling through our modems or crashing into our cell phones is ultimately just a sequence of 0's and 1's { a digital sequence { that magically makes the world the convenient high speed place it is today. So much of this magic is due to a family of algorithms that collectively go by the name \The Fast Fourier Transform", or \FFT" to its friends, among which the version published by Cooley and Tukey [5] is the most famous. Indeed, the FFT is perhaps the most ubiquitous algorithm used today in the analysis and manipulation of digital or discrete data. My own research experience with various avors of the FFT is evidence of the wide range of applicability: electroacoustic music and audio signal processing, medical imaging, image processing, pattern recognition, computat
Multidimensional, Multiprocessor, OutofCore FFTs with Distributed Memory and Parallel Disks (Extended Abstract)
 In Proceedings of the Eleventh Annual ACM Symposium on Parallel Algorithms and Architectures
, 1999
"... ) Lauren M. Baptist Thomas H. Cormen # {lmb, thc}@cs.dartmouth.edu Dartmouth College Department of Computer Science Abstract We show how to compute multidimensional Fast Fourier Transforms (FFTs) on a multiprocessor system with distributed memory when problem sizes are so large that the data do ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
) Lauren M. Baptist Thomas H. Cormen # {lmb, thc}@cs.dartmouth.edu Dartmouth College Department of Computer Science Abstract We show how to compute multidimensional Fast Fourier Transforms (FFTs) on a multiprocessor system with distributed memory when problem sizes are so large that the data do not fit in the memory of the entire system. Instead, data reside on a parallel disk system and are brought into memory in sections. We use the Parallel Disk Model for implementation and analysis. Our method is a straightforward outofcore variant of a wellknown method for incore, multidimensional FFTs. It performs 1dimensional FFT computations on each dimension in turn. This method is easy to generalize to any number of dimensions, and it also readily permits the individual dimensions to be of any sizes that are integer powers of 2. The key step is an outofcore transpose operation that places the data along each dimension into contiguous positions on the parallel disk system so that the...