Results 1  10
of
12
The Galley parallel file system
 Parallel Computing
, 1996
"... Most current multiprocessor le systems are designed to use multiple disks in parallel, using the high aggregate bandwidth to meet the growing I/O requirements of parallel scienti c applications. Many multiprocessor le systems provide applications with a conventional Unixlike interface, allowing the ..."
Abstract

Cited by 134 (9 self)
 Add to MetaCart
Most current multiprocessor le systems are designed to use multiple disks in parallel, using the high aggregate bandwidth to meet the growing I/O requirements of parallel scienti c applications. Many multiprocessor le systems provide applications with a conventional Unixlike interface, allowing the application to access multiple disks transparently. Thisinterface conceals the parallelism within the le system, increasing the ease of programmability, but making it di cult or impossible for sophisticated programmers and libraries to use knowledge about their I/O needs to exploit that parallelism. In addition to providing an insu cient interface, most current multiprocessor le systems are optimized for a di erent workload than they are being asked to support. We introduce Galley, a new parallel le system that is intended to e ciently support realistic scienti c multiprocessor workloads. We discuss Galley's le structure and application interface, as well as the performance advantages o ered by that interface. 1
Asymptotically Tight Bounds for Performing BMMC Permutations on Parallel Disk Systems
, 1994
"... This paper presents asymptotically equal lower and upper bounds for the number of parallel I/O operations required to perform bitmatrixmultiply/complement (BMMC) permutations on the Parallel Disk Model proposed by Vitter and Shriver. A BMMC permutation maps a source index to a target index by an a ..."
Abstract

Cited by 61 (18 self)
 Add to MetaCart
This paper presents asymptotically equal lower and upper bounds for the number of parallel I/O operations required to perform bitmatrixmultiply/complement (BMMC) permutations on the Parallel Disk Model proposed by Vitter and Shriver. A BMMC permutation maps a source index to a target index by an affine transformation over GF (2), where the source and target indices are treated as bit vectors. The class of BMMC permutations includes many common permutations, such as matrix transposition (when dimensions are powers of 2), bitreversal permutations, vectorreversal permutations, hypercube permutations, matrix reblocking, Graycode permutations, and inverse Graycode permutations. The upper bound improves upon the asymptotic bound in the previous best known BMMC algorithm and upon the constant factor in the previous best known bitpermute/complement (BPC) permutation algorithm. The algorithm achieving the upper bound uses basic linearalgebra techniques to factor the characteristic matrix...
Performing outofcore FFTs on parallel disk systems
 PARALLEL COMPUTING
, 1998
"... The Fast Fourier Transform (FFT) plays a key role in many areas of computational science and engineering. Although most onedimensional FFT problems can be solved entirely in main memory, some important classes of applications require outofcore techniques. For these, use of parallel I/O systems ca ..."
Abstract

Cited by 19 (7 self)
 Add to MetaCart
The Fast Fourier Transform (FFT) plays a key role in many areas of computational science and engineering. Although most onedimensional FFT problems can be solved entirely in main memory, some important classes of applications require outofcore techniques. For these, use of parallel I/O systems can improve performance considerably. This paper shows how to perform onedimensional FFTs using a parallel disk system with independent disk accesses. We present both analytical and experimental results for performing outofcore FFTs in two ways: using traditional virtual memory with demand paging, and using a provably asymptotically optimal algorithm for the Parallel Disk Model (PDM) of Vitter and Shriver. When run on a DEC 2100 server with a large memory and eight parallel disks, the optimal algorithm for the PDM runs up to 144.7 times faster than incore methods under demand paging. Moreover, even including I/O costs, the normalized times for the optimal PDM algorithm are competitive, or better than, those for incore methods even when they run entirely in memory.
Multiprocessor OutofCore FFTs with Distributed Memory and Parallel Disks (Extended Abstract)
, 1997
"... ) Thomas H. Cormen Jake Wegmann David M. Nicol y Dartmouth College Department of Computer Science Abstract This paper extends an earlier outofcore Fast Fourier Transform (FFT) method for a uniprocessor with the Parallel Disk Model (PDM) to use multiple processors. Four outofcore multiproce ..."
Abstract

Cited by 16 (7 self)
 Add to MetaCart
) Thomas H. Cormen Jake Wegmann David M. Nicol y Dartmouth College Department of Computer Science Abstract This paper extends an earlier outofcore Fast Fourier Transform (FFT) method for a uniprocessor with the Parallel Disk Model (PDM) to use multiple processors. Four outofcore multiprocessor methods are examined. Operationally, these methods differ in the size of "minibutterfly " computed in memory and how the data are organized on the disks and in the distributed memory of the multiprocessor. The methods also perform differing amounts of I/O and communication. Two of them have the remarkable property that even though they are computing the FFT on a multiprocessor, all interprocessor communication occurs outside the minibutterfly computations. Performance results on a small workstation cluster indicate that except for unusual combinations of problem size and memory size, the methods that do not perform interprocessor communication during the minibutterfly computations req...
Modeling and Optimizing I/O Throughput of Multiple Disks on a Bus
 IN PROCEEDINGS OF ACM SIGMETRICS CONFERENCE
, 1999
"... In modern I/O architectures, multiple disk drives are attached to each I/O controller. A study of the performance of such architectures under I/Ointensive workloads has revealed a performance impairment that results from a previously unknown form of convoy behavior in disk I/O. In this paper, we de ..."
Abstract

Cited by 15 (6 self)
 Add to MetaCart
In modern I/O architectures, multiple disk drives are attached to each I/O controller. A study of the performance of such architectures under I/Ointensive workloads has revealed a performance impairment that results from a previously unknown form of convoy behavior in disk I/O. In this paper, we describe measurements of the read performance of multiple disks that share a SCSI bus under a heavy workload, and develop and validate formulas that accurately characterize the observed performance (to within 12 % on several platforms for I/O sizes in the range 16{128 KB). Two terms in the formula clearly characterize the lost performance seen in our experiments. We describe techniques to deal with the performance impairment, via userlevel workarounds that achieve greater overlap of bus transfers with disk seeks, and that increase the percentage of transfers that occur at the full bus bandwidth rather than at the lower bandwidth of a disk head. Experiments show bandwidth improvements of 1020 % when using these userlevel techniques, but only in the case of large I/Os.
ViC*: A compiler for virtualmemory C*
 IN PROCEEDINGS OF THE THIRD INTERNATIONAL WORKSHOP ON HIGHLEVEL PARALLEL PROGRAMMING MODELS AND SUPPORTIVE ENVIRONMENTS (HIPS ’98
, 1998
"... This paper describes the functionality of ViC*, a compiler for a variant of the dataparallel language C* with support for outofcore data. The compiler translates C* programs with shapes declared outofcore, which describe parallel data stored on disk. The compiler output is a SPMDstyle program in ..."
Abstract

Cited by 13 (3 self)
 Add to MetaCart
This paper describes the functionality of ViC*, a compiler for a variant of the dataparallel language C* with support for outofcore data. The compiler translates C* programs with shapes declared outofcore, which describe parallel data stored on disk. The compiler output is a SPMDstyle program in standard C with I/O and library calls added to efficiently access outofcore parallel data. The ViC* compiler also applies several program transformations to improve outofcore data layout and access.
Pcopt: Optimal offline prefetching and caching for parallel i/o systems
 IEEE TRANSACTIONS ON COMPUTERS
, 2002
"... We address the problem of prefetching and caching in a parallel I/O system and present a new algorithm for parallel disk scheduling. Traditional buffer management algorithms that minimize the number of block misses are substantially suboptimal in a parallel I/O system where multiple I/Os can proceed ..."
Abstract

Cited by 10 (0 self)
 Add to MetaCart
We address the problem of prefetching and caching in a parallel I/O system and present a new algorithm for parallel disk scheduling. Traditional buffer management algorithms that minimize the number of block misses are substantially suboptimal in a parallel I/O system where multiple I/Os can proceed simultaneously. We show that in the offline case, where a priori knowledge of all the requests is available, PCOPT performs the minimum number of I/Os to service the given I/O requests. This is the first parallel I/O scheduling algorithm that is provably offline optimal in the parallel disk model. In the online case, we study the context of global Lblock lookahead, which gives the buffer management algorithm a lookahead consisting of L distinct requests. We show that the competitive ratio of PCOPT, with global Lblock lookahead, is ðM L þ DÞ, when L M, and ðMD=LÞ, when L>M, where the number of disks is D and buffer size is M.
An Efficient Algorithm for OutofCore Matrix Transposition
"... Efficient transposition of Outofcore matrices has been widely studied. These efforts have focused on reducing the number of I/O operations. However, in the stateoftheart architectures, memorymemory data transfer time and index computation time are also signi cant components of the overall time ..."
Abstract

Cited by 9 (0 self)
 Add to MetaCart
Efficient transposition of Outofcore matrices has been widely studied. These efforts have focused on reducing the number of I/O operations. However, in the stateoftheart architectures, memorymemory data transfer time and index computation time are also signi cant components of the overall time. In this paper, we propose an algorithm that considers the index computation time and the I/O time and reduces the overall execution time. Our algorithm reduces the total execution time by reducing the number of I/O operations and eliminating the index computation. In doing so, two techniques are employed: writing the data onto disk in prede ned patterns and balancing the number of disk read and write operations. The index computation time, which is an expensive operation involving two divisions and a multiplication, is eliminated by partitioning the memory into read and write buffers. The expensive inprocessor permutation is replaced by data collection from the read buffer to the write buffer. Even though this partitioning may increase the number of I/O operations for some cases, it results in an overall reduction in the execution time due to the elimination of the expensive index computation. Our algorithm is analyzed using the wellknown Linear Model and the Parallel Disk Model. The experimental results on Sun Enterprise, SGI R12000, and Pentium III show that our algorithm reduces the
A Simple and Efficient Parallel Disk Mergesort
, 2002
"... External sorting—the process of sorting a file that is too large to fit into the computer’s internal memory and must be stored externally on disks—is a fundamental subroutine in database systems [G], [IBM]. Of prime importance are techniques that use multiple disks in parallel in order to speed up t ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
External sorting—the process of sorting a file that is too large to fit into the computer’s internal memory and must be stored externally on disks—is a fundamental subroutine in database systems [G], [IBM]. Of prime importance are techniques that use multiple disks in parallel in order to speed up the performance of external sorting. The simple randomized merging (SRM) mergesort algorithm proposed by Barve et al. [BGV] is the first parallel disk sorting algorithm that requires a provably optimal number of passes and that is fast in practice. Knuth [K, Section 5.4.9] recently identified SRM (which he calls “randomized striping”) as the method of choice for sorting with parallel disks. In this paper we present an efficient implementation of SRM, based upon novel and elegant data structures. We give a new implementation for SRM’s lookahead forecasting technique for parallel prefetching and its forecast and flush technique for buffer management. Our techniques amount to a significant improvement in the way SRM carries out the parallel, independent disk accesses necessary to read blocks of input runs efficiently during external merging. Our implementation is
Parallel Algorithms in External Memory
, 2000
"... External memory (EM) algorithms are designed for computational problems in which the size of the internal memory of the computer is only a small fraction of the problem size. The Parallel Disk Model (PDM) of Vitter and Shriver is widely used to discriminate between external memory algorithms on the ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
External memory (EM) algorithms are designed for computational problems in which the size of the internal memory of the computer is only a small fraction of the problem size. The Parallel Disk Model (PDM) of Vitter and Shriver is widely used to discriminate between external memory algorithms on the basis of input/output (I/O) complexity. Parallel algorithms are designed to efficiently utilize the computing power of multiple processing units, interconnected by a communication mechanism. A popular model for developing and analyzing parallel algorithms is the Bulk Synchronous Parallel (BSP) model due to Valiant. In this work we develop simulation techniques, both randomized and deterministic, which produce efficient EM algorithms from efficient algorithms developed under BSPlike parallel computing models. Our techniques can accommodate one or multiple processors on the EM target machine, each with one or more disks, and they also adapt to the disk blocking factor of the target machine. ...