• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Early experiences in evaluating the Parallel Disk Model with the ViC* implementation (1996)

by Thomas H. Cormen, Melissa Hirschl
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 12
Next 10 →

The Galley parallel file system

by Nils Nieuwejaar, David Kotz - Parallel Computing , 1996
"... Most current multiprocessor le systems are designed to use multiple disks in parallel, using the high aggregate bandwidth to meet the growing I/O requirements of parallel scienti c applications. Many multiprocessor le systems provide applications with a conventional Unix-like interface, allowing the ..."
Abstract - Cited by 127 (8 self) - Add to MetaCart
Most current multiprocessor le systems are designed to use multiple disks in parallel, using the high aggregate bandwidth to meet the growing I/O requirements of parallel scienti c applications. Many multiprocessor le systems provide applications with a conventional Unix-like interface, allowing the application to access multiple disks transparently. Thisinterface conceals the parallelism within the le system, increasing the ease of programmability, but making it di cult or impossible for sophisticated programmers and libraries to use knowledge about their I/O needs to exploit that parallelism. In addition to providing an insu cient interface, most current multiprocessor le systems are optimized for a di erent workload than they are being asked to support. We introduce Galley, a new parallel le system that is intended to e ciently support realistic scienti c multiprocessor workloads. We discuss Galley's le structure and application interface, as well as the performance advantages o ered by that interface. 1

Asymptotically Tight Bounds for Performing BMMC Permutations on Parallel Disk Systems

by Thomas H. Cormen, Thomas Sundquist, Leonard F. Wisniewski , 1994
"... This paper presents asymptotically equal lower and upper bounds for the number of parallel I/O operations required to perform bit-matrix-multiply/complement (BMMC) permutations on the Parallel Disk Model proposed by Vitter and Shriver. A BMMC permutation maps a source index to a target index by an a ..."
Abstract - Cited by 59 (19 self) - Add to MetaCart
This paper presents asymptotically equal lower and upper bounds for the number of parallel I/O operations required to perform bit-matrix-multiply/complement (BMMC) permutations on the Parallel Disk Model proposed by Vitter and Shriver. A BMMC permutation maps a source index to a target index by an affine transformation over GF (2), where the source and target indices are treated as bit vectors. The class of BMMC permutations includes many common permutations, such as matrix transposition (when dimensions are powers of 2), bit-reversal permutations, vector-reversal permutations, hypercube permutations, matrix reblocking, Graycode permutations, and inverse Gray-code permutations. The upper bound improves upon the asymptotic bound in the previous best known BMMC algorithm and upon the constant factor in the previous best known bit-permute/complement (BPC) permutation algorithm. The algorithm achieving the upper bound uses basic linear-algebra techniques to factor the characteristic matrix...

Performing out-of-core FFTs on parallel disk systems

by Thomas H. Cormen, David M. Nicol - PARALLEL COMPUTING , 1998
"... The Fast Fourier Transform (FFT) plays a key role in many areas of computational science and engineering. Although most one-dimensional FFT problems can be solved entirely in main memory, some important classes of applications require out-of-core techniques. For these, use of parallel I/O systems ca ..."
Abstract - Cited by 17 (7 self) - Add to MetaCart
The Fast Fourier Transform (FFT) plays a key role in many areas of computational science and engineering. Although most one-dimensional FFT problems can be solved entirely in main memory, some important classes of applications require out-of-core techniques. For these, use of parallel I/O systems can improve performance considerably. This paper shows how to perform one-dimensional FFTs using a parallel disk system with independent disk accesses. We present both analytical and experimental results for performing out-of-core FFTs in two ways: using traditional virtual memory with demand paging, and using a provably asymptotically optimal algorithm for the Parallel Disk Model (PDM) of Vitter and Shriver. When run on a DEC 2100 server with a large memory and eight parallel disks, the optimal algorithm for the PDM runs up to 144.7 times faster than in-core methods under demand paging. Moreover, even including I/O costs, the normalized times for the optimal PDM algorithm are competitive, or better than, those for in-core methods even when they run entirely in memory.

Multiprocessor Out-of-Core FFTs with Distributed Memory and Parallel Disks (Extended Abstract)

by Thomas H. Cormen, Jake Wegmann, David M. Nicol , 1997
"... ) Thomas H. Cormen Jake Wegmann David M. Nicol y Dartmouth College Department of Computer Science Abstract This paper extends an earlier out-of-core Fast Fourier Transform (FFT) method for a uniprocessor with the Parallel Disk Model (PDM) to use multiple processors. Four out-of-core multiproce ..."
Abstract - Cited by 15 (7 self) - Add to MetaCart
) Thomas H. Cormen Jake Wegmann David M. Nicol y Dartmouth College Department of Computer Science Abstract This paper extends an earlier out-of-core Fast Fourier Transform (FFT) method for a uniprocessor with the Parallel Disk Model (PDM) to use multiple processors. Four out-of-core multiprocessor methods are examined. Operationally, these methods differ in the size of "minibutterfly " computed in memory and how the data are organized on the disks and in the distributed memory of the multiprocessor. The methods also perform differing amounts of I/O and communication. Two of them have the remarkable property that even though they are computing the FFT on a multiprocessor, all interprocessor communication occurs outside the mini-butterfly computations. Performance results on a small workstation cluster indicate that except for unusual combinations of problem size and memory size, the methods that do not perform interprocessor communication during the mini-butterfly computations req...

Modeling and Optimizing I/O Throughput of Multiple Disks on a Bus

by Rakesh Barve, Elizabeth Shriver, Phillip B. Gibbons, Bruce K. Hillyer, Yossi Matias, Jeffrey Scott Vitter - IN PROCEEDINGS OF ACM SIGMETRICS CONFERENCE , 1999
"... In modern I/O architectures, multiple disk drives are attached to each I/O controller. A study of the performance of such architectures under I/O-intensive workloads has revealed a performance impairment that results from a previously unknown form of convoy behavior in disk I/O. In this paper, we de ..."
Abstract - Cited by 14 (6 self) - Add to MetaCart
In modern I/O architectures, multiple disk drives are attached to each I/O controller. A study of the performance of such architectures under I/O-intensive workloads has revealed a performance impairment that results from a previously unknown form of convoy behavior in disk I/O. In this paper, we describe measurements of the read performance of multiple disks that share a SCSI bus under a heavy workload, and develop and validate formulas that accurately characterize the observed performance (to within 12 % on several platforms for I/O sizes in the range 16{128 KB). Two terms in the formula clearly characterize the lost performance seen in our experiments. We describe techniques to deal with the performance impairment, via user-level workarounds that achieve greater overlap of bus transfers with disk seeks, and that increase the percentage of transfers that occur at the full bus bandwidth rather than at the lower bandwidth of a disk head. Experiments show bandwidth improvements of 10-20 % when using these user-level techniques, but only in the case of large I/Os.

ViC*: A compiler for virtual-memory C

by Alex Colvin, Thomas H. Cormen Y - In Proceedings of the Third International Workshop on High-Level Parallel Programming Models and Supportive Environments (HIPS ’98 , 1998
"... This paper describes the functionality of ViC*, a compiler for a variant of the data-parallel language C * with support for out-of-core data. The compiler translates C * programs with shapes declared outofcore, whichdescribe parallel data stored on disk. The compiler output is a SPMD-style program i ..."
Abstract - Cited by 13 (3 self) - Add to MetaCart
This paper describes the functionality of ViC*, a compiler for a variant of the data-parallel language C * with support for out-of-core data. The compiler translates C * programs with shapes declared outofcore, whichdescribe parallel data stored on disk. The compiler output is a SPMD-style program in standard C with I/Oand library calls added to e ciently access out-of-core parallel data. The ViC * compiler also applies several program transformations to improve out-of-core data layout and access. 1

An Efficient Algorithm for Out-of-Core Matrix Transposition

by Jinwoo Suh , Viktor K. Prasanna
"... Efficient transposition of Out-of-core matrices has been widely studied. These efforts have focused on reducing the number of I/O operations. However, in the state-of-the-art architectures, memory-memory data transfer time and index computation time are also signi cant components of the overall time ..."
Abstract - Cited by 8 (0 self) - Add to MetaCart
Efficient transposition of Out-of-core matrices has been widely studied. These efforts have focused on reducing the number of I/O operations. However, in the state-of-the-art architectures, memory-memory data transfer time and index computation time are also signi cant components of the overall time. In this paper, we propose an algorithm that considers the index computation time and the I/O time and reduces the overall execution time. Our algorithm reduces the total execution time by reducing the number of I/O operations and eliminating the index computation. In doing so, two techniques are employed: writing the data onto disk in prede ned patterns and balancing the number of disk read and write operations. The index computation time, which is an expensive operation involving two divisions and a multiplication, is eliminated by partitioning the memory into read and write buffers. The expensive in-processor permutation is replaced by data collection from the read buffer to the write buffer. Even though this partitioning may increase the number of I/O operations for some cases, it results in an overall reduction in the execution time due to the elimination of the expensive index computation. Our algorithm is analyzed using the well-known Linear Model and the Parallel Disk Model. The experimental results on Sun Enterprise, SGI R12000, and Pentium III show that our algorithm reduces the

Pc-opt: Optimal offline prefetching and caching for parallel i/o systems

by Mahesh Kallahalla, Peter J. Varman - IEEE TRANSACTIONS ON COMPUTERS , 2002
"... We address the problem of prefetching and caching in a parallel I/O system and present a new algorithm for parallel disk scheduling. Traditional buffer management algorithms that minimize the number of block misses are substantially suboptimal in a parallel I/O system where multiple I/Os can proceed ..."
Abstract - Cited by 8 (0 self) - Add to MetaCart
We address the problem of prefetching and caching in a parallel I/O system and present a new algorithm for parallel disk scheduling. Traditional buffer management algorithms that minimize the number of block misses are substantially suboptimal in a parallel I/O system where multiple I/Os can proceed simultaneously. We show that in the offline case, where a priori knowledge of all the requests is available, PC-OPT performs the minimum number of I/Os to service the given I/O requests. This is the first parallel I/O scheduling algorithm that is provably offline optimal in the parallel disk model. In the online case, we study the context of global L-block lookahead, which gives the buffer management algorithm a lookahead consisting of L distinct requests. We show that the competitive ratio of PC-OPT, with global L-block lookahead, is ðM L þ DÞ, when L M, and ðMD=LÞ, when L>M, where the number of disks is D and buffer size is M.

A Simple and Efficient Parallel Disk Mergesort

by R. D. Barve, J. S. Vitter , 2002
"... External sorting—the process of sorting a file that is too large to fit into the computer’s internal memory and must be stored externally on disks—is a fundamental subroutine in database systems [G], [IBM]. Of prime importance are techniques that use multiple disks in parallel in order to speed up t ..."
Abstract - Cited by 7 (0 self) - Add to MetaCart
External sorting—the process of sorting a file that is too large to fit into the computer’s internal memory and must be stored externally on disks—is a fundamental subroutine in database systems [G], [IBM]. Of prime importance are techniques that use multiple disks in parallel in order to speed up the performance of external sorting. The simple randomized merging (SRM) mergesort algorithm proposed by Barve et al. [BGV] is the first parallel disk sorting algorithm that requires a provably optimal number of passes and that is fast in practice. Knuth [K, Section 5.4.9] recently identified SRM (which he calls “randomized striping”) as the method of choice for sorting with parallel disks. In this paper we present an efficient implementation of SRM, based upon novel and elegant data structures. We give a new implementation for SRM’s lookahead forecasting technique for parallel prefetching and its forecast and flush technique for buffer management. Our techniques amount to a significant improvement in the way SRM carries out the parallel, independent disk accesses necessary to read blocks of input runs efficiently during external merging. Our implementation is

Parallel Algorithms in External Memory

by David Alexander Hutchinson , 2000
"... External memory (EM) algorithms are designed for computational problems in which the size of the internal memory of the computer is only a small fraction of the problem size. The Parallel Disk Model (PDM) of Vitter and Shriver is widely used to discriminate between external memory algorithms on the ..."
Abstract - Cited by 4 (1 self) - Add to MetaCart
External memory (EM) algorithms are designed for computational problems in which the size of the internal memory of the computer is only a small fraction of the problem size. The Parallel Disk Model (PDM) of Vitter and Shriver is widely used to discriminate between external memory algorithms on the basis of input/output (I/O) complexity. Parallel algorithms are designed to efficiently utilize the computing power of multiple processing units, interconnected by a communication mechanism. A popular model for developing and analyzing parallel algorithms is the Bulk Synchronous Parallel (BSP) model due to Valiant. In this work we develop simulation techniques, both randomized and deterministic, which produce efficient EM algorithms from efficient algorithms developed under BSPlike parallel computing models. Our techniques can accommodate one or multiple processors on the EM target machine, each with one or more disks, and they also adapt to the disk blocking factor of the target machine. ...
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University