Results 1  10
of
20
The Galley parallel file system
 Parallel Computing
, 1996
"... Most current multiprocessor le systems are designed to use multiple disks in parallel, using the high aggregate bandwidth to meet the growing I/O requirements of parallel scienti c applications. Many multiprocessor le systems provide applications with a conventional Unixlike interface, allowing the ..."
Abstract

Cited by 151 (9 self)
 Add to MetaCart
(Show Context)
Most current multiprocessor le systems are designed to use multiple disks in parallel, using the high aggregate bandwidth to meet the growing I/O requirements of parallel scienti c applications. Many multiprocessor le systems provide applications with a conventional Unixlike interface, allowing the application to access multiple disks transparently. Thisinterface conceals the parallelism within the le system, increasing the ease of programmability, but making it di cult or impossible for sophisticated programmers and libraries to use knowledge about their I/O needs to exploit that parallelism. In addition to providing an insu cient interface, most current multiprocessor le systems are optimized for a di erent workload than they are being asked to support. We introduce Galley, a new parallel le system that is intended to e ciently support realistic scienti c multiprocessor workloads. We discuss Galley's le structure and application interface, as well as the performance advantages o ered by that interface. 1
Asymptotically Tight Bounds for Performing BMMC Permutations on Parallel Disk Systems
, 1994
"... This paper presents asymptotically equal lower and upper bounds for the number of parallel I/O operations required to perform bitmatrixmultiply/complement (BMMC) permutations on the Parallel Disk Model proposed by Vitter and Shriver. A BMMC permutation maps a source index to a target index by an a ..."
Abstract

Cited by 58 (18 self)
 Add to MetaCart
This paper presents asymptotically equal lower and upper bounds for the number of parallel I/O operations required to perform bitmatrixmultiply/complement (BMMC) permutations on the Parallel Disk Model proposed by Vitter and Shriver. A BMMC permutation maps a source index to a target index by an affine transformation over GF (2), where the source and target indices are treated as bit vectors. The class of BMMC permutations includes many common permutations, such as matrix transposition (when dimensions are powers of 2), bitreversal permutations, vectorreversal permutations, hypercube permutations, matrix reblocking, Graycode permutations, and inverse Graycode permutations. The upper bound improves upon the asymptotic bound in the previous best known BMMC algorithm and upon the constant factor in the previous best known bitpermute/complement (BPC) permutation algorithm. The algorithm achieving the upper bound uses basic linearalgebra techniques to factor the characteristic matrix...
Performing outofcore FFTs on parallel disk systems
 PARALLEL COMPUTING
, 1998
"... The Fast Fourier Transform (FFT) plays a key role in many areas of computational science and engineering. Although most onedimensional FFT problems can be solved entirely in main memory, some important classes of applications require outofcore techniques. For these, use of parallel I/O systems ca ..."
Abstract

Cited by 20 (7 self)
 Add to MetaCart
The Fast Fourier Transform (FFT) plays a key role in many areas of computational science and engineering. Although most onedimensional FFT problems can be solved entirely in main memory, some important classes of applications require outofcore techniques. For these, use of parallel I/O systems can improve performance considerably. This paper shows how to perform onedimensional FFTs using a parallel disk system with independent disk accesses. We present both analytical and experimental results for performing outofcore FFTs in two ways: using traditional virtual memory with demand paging, and using a provably asymptotically optimal algorithm for the Parallel Disk Model (PDM) of Vitter and Shriver. When run on a DEC 2100 server with a large memory and eight parallel disks, the optimal algorithm for the PDM runs up to 144.7 times faster than incore methods under demand paging. Moreover, even including I/O costs, the normalized times for the optimal PDM algorithm are competitive, or better than, those for incore methods even when they run entirely in memory.
A framework for simple sorting algorithms on parallel disk systems
 In Proceedings of the Tenth Annual ACM Symposium on Parallel Algorithms and Architectures
, 1998
"... In this paper we present a simple parallel sorting algorithm and illustrate its application in general sorting, disk sorting, and hypercube sorting. The algorithm (called the (l;m)mergesort (LMM)) is an extension of the bitonic and oddeven mergesorts. Literature on parallel sorting is abundant. Ma ..."
Abstract

Cited by 18 (7 self)
 Add to MetaCart
(Show Context)
In this paper we present a simple parallel sorting algorithm and illustrate its application in general sorting, disk sorting, and hypercube sorting. The algorithm (called the (l;m)mergesort (LMM)) is an extension of the bitonic and oddeven mergesorts. Literature on parallel sorting is abundant. Many of the algorithms proposed, though being theoretically important, may not perform satisfactorily in practice owing to large constants in their time bounds. The algorithm to be presented in this paper has the potential of being practical. We present an application for the parallel disk sorting problem. The algorithm is asymptotically optimal (assuming that N is a polynomial inM, where N is the number of records to be sorted and M is the internal memory size). The underlying constant is very small. This algorithm performs better than the diskstriped mergesort (DSM) algorithm when the number of disks is large. Our implementation is as simple as that of DSM (requiring no fancy data structures or prefetch techniques.) 1 As a second application, we prove that we can get a sparse enumeration sort on the hypercube that is simpler than that of the classical algorithm of Nassimi and Sahni [16]. We also show that Leighton's columnsort algorithm is a special case of LMM. 1
Multiprocessor OutofCore FFTs with Distributed Memory and Parallel Disks (Extended Abstract)
, 1997
"... ) Thomas H. Cormen Jake Wegmann David M. Nicol y Dartmouth College Department of Computer Science Abstract This paper extends an earlier outofcore Fast Fourier Transform (FFT) method for a uniprocessor with the Parallel Disk Model (PDM) to use multiple processors. Four outofcore multiproce ..."
Abstract

Cited by 16 (7 self)
 Add to MetaCart
) Thomas H. Cormen Jake Wegmann David M. Nicol y Dartmouth College Department of Computer Science Abstract This paper extends an earlier outofcore Fast Fourier Transform (FFT) method for a uniprocessor with the Parallel Disk Model (PDM) to use multiple processors. Four outofcore multiprocessor methods are examined. Operationally, these methods differ in the size of "minibutterfly " computed in memory and how the data are organized on the disks and in the distributed memory of the multiprocessor. The methods also perform differing amounts of I/O and communication. Two of them have the remarkable property that even though they are computing the FFT on a multiprocessor, all interprocessor communication occurs outside the minibutterfly computations. Performance results on a small workstation cluster indicate that except for unusual combinations of problem size and memory size, the methods that do not perform interprocessor communication during the minibutterfly computations req...
Modeling and Optimizing I/O Throughput of Multiple Disks on a Bus
 IN PROCEEDINGS OF ACM SIGMETRICS CONFERENCE
, 1999
"... In modern I/O architectures, multiple disk drives are attached to each I/O controller. A study of the performance of such architectures under I/Ointensive workloads has revealed a performance impairment that results from a previously unknown form of convoy behavior in disk I/O. In this paper, we de ..."
Abstract

Cited by 15 (6 self)
 Add to MetaCart
In modern I/O architectures, multiple disk drives are attached to each I/O controller. A study of the performance of such architectures under I/Ointensive workloads has revealed a performance impairment that results from a previously unknown form of convoy behavior in disk I/O. In this paper, we describe measurements of the read performance of multiple disks that share a SCSI bus under a heavy workload, and develop and validate formulas that accurately characterize the observed performance (to within 12 % on several platforms for I/O sizes in the range 16{128 KB). Two terms in the formula clearly characterize the lost performance seen in our experiments. We describe techniques to deal with the performance impairment, via userlevel workarounds that achieve greater overlap of bus transfers with disk seeks, and that increase the percentage of transfers that occur at the full bus bandwidth rather than at the lower bandwidth of a disk head. Experiments show bandwidth improvements of 1020 % when using these userlevel techniques, but only in the case of large I/Os.
ViC*: A compiler for virtualmemory C*
 IN PROCEEDINGS OF THE THIRD INTERNATIONAL WORKSHOP ON HIGHLEVEL PARALLEL PROGRAMMING MODELS AND SUPPORTIVE ENVIRONMENTS (HIPS ’98
, 1998
"... This paper describes the functionality of ViC*, a compiler for a variant of the dataparallel language C* with support for outofcore data. The compiler translates C* programs with shapes declared outofcore, which describe parallel data stored on disk. The compiler output is a SPMDstyle program in ..."
Abstract

Cited by 13 (3 self)
 Add to MetaCart
This paper describes the functionality of ViC*, a compiler for a variant of the dataparallel language C* with support for outofcore data. The compiler translates C* programs with shapes declared outofcore, which describe parallel data stored on disk. The compiler output is a SPMDstyle program in standard C with I/O and library calls added to efficiently access outofcore parallel data. The ViC* compiler also applies several program transformations to improve outofcore data layout and access.
Pcopt: Optimal offline prefetching and caching for parallel i/o systems
 IEEE TRANSACTIONS ON COMPUTERS
, 2002
"... We address the problem of prefetching and caching in a parallel I/O system and present a new algorithm for parallel disk scheduling. Traditional buffer management algorithms that minimize the number of block misses are substantially suboptimal in a parallel I/O system where multiple I/Os can proceed ..."
Abstract

Cited by 13 (0 self)
 Add to MetaCart
(Show Context)
We address the problem of prefetching and caching in a parallel I/O system and present a new algorithm for parallel disk scheduling. Traditional buffer management algorithms that minimize the number of block misses are substantially suboptimal in a parallel I/O system where multiple I/Os can proceed simultaneously. We show that in the offline case, where a priori knowledge of all the requests is available, PCOPT performs the minimum number of I/Os to service the given I/O requests. This is the first parallel I/O scheduling algorithm that is provably offline optimal in the parallel disk model. In the online case, we study the context of global Lblock lookahead, which gives the buffer management algorithm a lookahead consisting of L distinct requests. We show that the competitive ratio of PCOPT, with global Lblock lookahead, is ðM L þ DÞ, when L M, and ðMD=LÞ, when L>M, where the number of disks is D and buffer size is M.
A Simple and Efficient Parallel Disk Mergesort
, 2002
"... External sorting—the process of sorting a file that is too large to fit into the computer’s internal memory and must be stored externally on disks—is a fundamental subroutine in database systems [G], [IBM]. Of prime importance are techniques that use multiple disks in parallel in order to speed up t ..."
Abstract

Cited by 11 (1 self)
 Add to MetaCart
External sorting—the process of sorting a file that is too large to fit into the computer’s internal memory and must be stored externally on disks—is a fundamental subroutine in database systems [G], [IBM]. Of prime importance are techniques that use multiple disks in parallel in order to speed up the performance of external sorting. The simple randomized merging (SRM) mergesort algorithm proposed by Barve et al. [BGV] is the first parallel disk sorting algorithm that requires a provably optimal number of passes and that is fast in practice. Knuth [K, Section 5.4.9] recently identified SRM (which he calls “randomized striping”) as the method of choice for sorting with parallel disks. In this paper we present an efficient implementation of SRM, based upon novel and elegant data structures. We give a new implementation for SRM’s lookahead forecasting technique for parallel prefetching and its forecast and flush technique for buffer management. Our techniques amount to a significant improvement in the way SRM carries out the parallel, independent disk accesses necessary to read blocks of input runs efficiently during external merging. Our implementation is
An efficient algorithm for outofcore matrix transposition
 IEEE Transactions on Computers
, 2002
"... ..."