Results 1  10
of
33
Algorithms for Parallel Memory I: TwoLevel Memories
, 1992
"... We provide the first optimal algorithms in terms of the number of input/outputs (I/Os) required between internal memory and multiple secondary storage devices for the problems of sorting, FFT, matrix transposition, standard matrix multiplication, and related problems. Our twolevel memory model is n ..."
Abstract

Cited by 236 (32 self)
 Add to MetaCart
We provide the first optimal algorithms in terms of the number of input/outputs (I/Os) required between internal memory and multiple secondary storage devices for the problems of sorting, FFT, matrix transposition, standard matrix multiplication, and related problems. Our twolevel memory model is new and gives a realistic treatment of parallel block transfer, in which during a single I/O each of the P secondary storage devices can simultaneously transfer a contiguous block of B records. The model pertains to a largescale uniprocessor system or parallel multiprocessor system with P disks. In addition, the sorting, FFT, permutation network, and standard matrix multiplication algorithms are typically optimal in terms of the amount of internal processing time. The difficulty in developing optimal algorithms is to cope with the partitioning of memory into P separate physical devices. Our algorithms' performance can be significantly better than those obtained by the wellknown but nonopti...
The Uniform Memory Hierarchy Model of Computation
 Algorithmica
, 1992
"... The Uniform Memory Hierarchy (UMH) model introduced in this paper captures performancerelevant aspects of the hierarchical nature of computer memory. It is used to quantify architectural requirements of several algorithms and to ratify the faster speeds achieved by tuned implementations that use im ..."
Abstract

Cited by 116 (9 self)
 Add to MetaCart
The Uniform Memory Hierarchy (UMH) model introduced in this paper captures performancerelevant aspects of the hierarchical nature of computer memory. It is used to quantify architectural requirements of several algorithms and to ratify the faster speeds achieved by tuned implementations that use improved datamovement strategies. A sequential computer's memory is modelled as a sequence hM 0 ; M 1 ; :::i of increasingly large memory modules. Computation takes place in M 0 . Thus, M 0 might model a computer's central processor, while M 1 might be cache memory, M 2 main memory, and so on. For each module M U , a bus B U connects it with the next larger module M U+1 . All buses may be active simultaneously. Data is transferred along a bus in fixedsized blocks. The size of these blocks, the time required to transfer a block, and the number of blocks that fit in a module are larger for modules farther from the processor. The UMH model is parameterized by the rate at which the blocksizes i...
Simple Randomized Mergesort on Parallel Disks
 PARALLEL COMPUTING
, 1996
"... We consider the problem of sorting a file of N records on the Ddisk model of parallel I/O [VS94] in which there are two sources of parallelism. Records are transferred to and from disk concurrently in blocks of B contiguous records. In each I/O operation, up to one block can be transferred to or fr ..."
Abstract

Cited by 63 (11 self)
 Add to MetaCart
We consider the problem of sorting a file of N records on the Ddisk model of parallel I/O [VS94] in which there are two sources of parallelism. Records are transferred to and from disk concurrently in blocks of B contiguous records. In each I/O operation, up to one block can be transferred to or from each of the D disks in parallel. We propose a simple, efficient, randomized mergesort algorithm called SRM that uses a forecastandflush approach to overcome the inherent difficulties of simple merging on parallel disks. SRM exhibits a limited use of randomization and also has a useful deterministic version. Generalizing the technique of forecasting [Knu73], our algorithm is able to read in, at any time, the "right" block from any disk, and using the technique of flushing, our algorithm evicts, without any I/O overhead, just the "right" blocks from memory to make space for new ones to be read in. The disk layout of SRM is such that it enjoys perfect write parallelism, avoiding fundamenta...
Asymptotically Tight Bounds for Performing BMMC Permutations on Parallel Disk Systems
, 1994
"... This paper presents asymptotically equal lower and upper bounds for the number of parallel I/O operations required to perform bitmatrixmultiply/complement (BMMC) permutations on the Parallel Disk Model proposed by Vitter and Shriver. A BMMC permutation maps a source index to a target index by an a ..."
Abstract

Cited by 61 (19 self)
 Add to MetaCart
This paper presents asymptotically equal lower and upper bounds for the number of parallel I/O operations required to perform bitmatrixmultiply/complement (BMMC) permutations on the Parallel Disk Model proposed by Vitter and Shriver. A BMMC permutation maps a source index to a target index by an affine transformation over GF (2), where the source and target indices are treated as bit vectors. The class of BMMC permutations includes many common permutations, such as matrix transposition (when dimensions are powers of 2), bitreversal permutations, vectorreversal permutations, hypercube permutations, matrix reblocking, Graycode permutations, and inverse Graycode permutations. The upper bound improves upon the asymptotic bound in the previous best known BMMC algorithm and upon the constant factor in the previous best known bitpermute/complement (BPC) permutation algorithm. The algorithm achieving the upper bound uses basic linearalgebra techniques to factor the characteristic matrix...
Towards a theory of cacheefficient algorithms
 PROCEEDINGS OF THE SYMPOSIUM ON DISCRETE
, 2000
"... We present a model that enables us to analyze the running time of an algorithm on a computer with a memory hierarchy with limited associativity, in terms of various cache parameters. Our cache model, an extension of Aggarwal and Vitterâ€™s I/O model, enables us to establish useful relationships betw ..."
Abstract

Cited by 47 (3 self)
 Add to MetaCart
We present a model that enables us to analyze the running time of an algorithm on a computer with a memory hierarchy with limited associativity, in terms of various cache parameters. Our cache model, an extension of Aggarwal and Vitterâ€™s I/O model, enables us to establish useful relationships between the cache complexity and the I/O complexity of computations. As a corollary, we obtain cacheefficient algorithms in the singlelevel cache model for fundamental problems like sorting, FFT, and an important subclass of permutations. We also analyze the averagecase cache behavior of mergesort, show that ignoring associativity concerns could lead to inferior performance, and present supporting experimental evidence. We further extend our model to multiple levels of cache with limited associativity and present optimal algorithms for matrix transpose and sorting. Our techniques may be used for systematic
Modeling Parallel Computers as Memory Hierarchies
 In Proc. Programming Models for Massively Parallel Computers
, 1993
"... A parameterized generic model that captures the features of diverse computer architectures would facilitate the development of portable programs. Specific models appropriate to particular computers are obtained by specifying parameters of the generic model. A generic model should be simple, and for ..."
Abstract

Cited by 45 (6 self)
 Add to MetaCart
A parameterized generic model that captures the features of diverse computer architectures would facilitate the development of portable programs. Specific models appropriate to particular computers are obtained by specifying parameters of the generic model. A generic model should be simple, and for each machine that it is intended to represent, it should have a reasonably accurate specific model. The Parallel Memory Hierarchy (PMH) model of computation uses a single mechanism to model the costs of both interprocessor communication and memory hierarchy traffic. A computer is modeled as a tree of memory modules with processors at the leaves. All data movement takes the form of block transfers between children and their parents. This paper assesses the strengths and weaknesses of the PMH model as a generic model. 1 Introduction The raw computing power of multiprocessor computers is exploding. The challenge is to create software that can take advantage of this computing power. The diversit...
Efficient ExternalMemory Data Structures and Applications
, 1996
"... In this thesis we study the Input/Output (I/O) complexity of largescale problems arising e.g. in the areas of database systems, geographic information systems, VLSI design systems and computer graphics, and design I/Oefficient algorithms for them. A general theme in our work is to design I/Oeffic ..."
Abstract

Cited by 38 (12 self)
 Add to MetaCart
In this thesis we study the Input/Output (I/O) complexity of largescale problems arising e.g. in the areas of database systems, geographic information systems, VLSI design systems and computer graphics, and design I/Oefficient algorithms for them. A general theme in our work is to design I/Oefficient algorithms through the design of I/Oefficient data structures. One of our philosophies is to try to isolate all the I/O specific parts of an algorithm in the data structures, that is, to try to design I/O algorithms from internal memory algorithms by exchanging the data structures used in internal memory with their external memory counterparts. The results in the thesis include a technique for transforming an internal memory tree data structure into an external data structure which can be used in a batched dynamic setting, that is, a setting where we for example do not require that the result of a search operation is returned immediately. Using this technique we develop batched dynamic external versions of the (onedimensional) rangetree and the segmenttree and we develop an external priority queue. Following our general philosophy we show how these structures can be used in standard internal memory sorting algorithms
Cacheoblivious algorithms and data structures
 IN LECTURE NOTES FROM THE EEF SUMMER SCHOOL ON MASSIVE DATA SETS
, 2002
"... A recent direction in the design of cacheefficient and diskefficient algorithms and data structures is the notion of cache obliviousness, introduced by Frigo, Leiserson, Prokop, and Ramachandran in 1999. Cacheoblivious algorithms perform well on a multilevel memory hierarchy without knowing any pa ..."
Abstract

Cited by 36 (3 self)
 Add to MetaCart
A recent direction in the design of cacheefficient and diskefficient algorithms and data structures is the notion of cache obliviousness, introduced by Frigo, Leiserson, Prokop, and Ramachandran in 1999. Cacheoblivious algorithms perform well on a multilevel memory hierarchy without knowing any parameters of the hierarchy, only knowing the existence of a hierarchy. Equivalently, a single cacheoblivious algorithm is efficient on all memory hierarchies simultaneously. While such results might seem impossible, a recent body of work has developed cacheoblivious algorithms and data structures that perform as well or nearly as well as standard externalmemory structures which require knowledge of the cache/memory size and block transfer size. Here we describe several of these results with the intent of elucidating the techniques behind their design. Perhaps the most exciting of these results are the data structures, which form general building blocks immediately
Disk Resident Arrays: An ArrayOriented I/O Library for OutofCore Computations
, 1996
"... In outofcore computations, disk storage is treated as another level in the memory hierarchy, below cache, local memory, and (in a parallel computer) remote memories. However, the tools used to manage this storage are typically quite different from those used to manage access to local and remote me ..."
Abstract

Cited by 31 (10 self)
 Add to MetaCart
In outofcore computations, disk storage is treated as another level in the memory hierarchy, below cache, local memory, and (in a parallel computer) remote memories. However, the tools used to manage this storage are typically quite different from those used to manage access to local and remote memory. This disparity complicates implementation of outofcore algorithms and hinders portability. We describe a programming model that addresses this problem. This model allows parallel programs to use essentially the same mechanisms to manage the movement of data between any two adjacent levels in a hierarchical memory system. We take as our starting point the Global Arrays sharedmemory model and library, which support a variety of operations on distributed arrays, including transfer between local and remote memories. We show how this model can be extended to support explicit transfer between global memory and secondary storage, and we define a Disk Resident Arrays library that supports s...
ExternalMemory Algorithms with Applications in Geographic Information Systems
 Algorithmic Foundations of GIS
, 1997
"... In the design of algorithms for largescale applications it is essential to consider the problem of minimizing Input/Output (I/O) communication. Geographical information systems (GIS) are good examples of such largescale applications as they frequently handle huge amounts of spatial data. In this n ..."
Abstract

Cited by 27 (9 self)
 Add to MetaCart
In the design of algorithms for largescale applications it is essential to consider the problem of minimizing Input/Output (I/O) communication. Geographical information systems (GIS) are good examples of such largescale applications as they frequently handle huge amounts of spatial data. In this note we survey the recent developments in externalmemory algorithms with applications in GIS. First we discuss the AggarwalVitter I/Omodel and illustrate why normal internalmemory algorithms for even very simple problems can perform terribly in an I/Oenvironment. Then we describe the fundamental paradigms for designing I/Oefficient algorithms by using them to design efficient sorting algorithms. We then go on and survey externalmemory algorithms for computational geometry problems  with special emphasis on problems with applications in GIS  and techniques for designing such algorithms: Using the orthogonal line segment intersection problem we illustrate the distributionsweeping and ...