Results 11 - 20
of
120
Models of Parallel Computation: A Survey and Synthesis
- INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES
, 1995
"... In the realm of sequential computing the random access machine has successufully provided an underlying model of computation that promoted consistency and coordination among algorithm developers, computer architects and language experts. In the realm of parallel computing, however, there has been no ..."
Abstract
-
Cited by 48 (0 self)
- Add to MetaCart
In the realm of sequential computing the random access machine has successufully provided an underlying model of computation that promoted consistency and coordination among algorithm developers, computer architects and language experts. In the realm of parallel computing, however, there has been no similar success. The need for such a unifying parallel model or set of models is heightened by the greater demand for performance and the greater diversity among machines. Yet the modeling of parallel computing still seems to be mired in controversy and chaos. This paper is an excerpt from a study which presents broad range of models of parallel computation and the different roles they serve in algorithm, language and machine design. The objective is to better understand which model characteristics are important to each design community in order to elucidate the requirements of a unifying paradigm. As an impetus for discussion, we conclude by suggesting a model of parallel computation which...
Fast Parallel Sorting Under LogP: Experience with the CM-5
- IEEE Transactions on Parallel and Distributed Systems
, 1996
"... In this paper, the LogP model is used to analyze four parallel sorting algorithms (bitonic, column, radix, and sample sort). LogP characterizes the performance of modern parallel machines with a small set of parameters: the communication latency (L), overhead (o), bandwidth (g), and the number of pr ..."
Abstract
-
Cited by 45 (10 self)
- Add to MetaCart
In this paper, the LogP model is used to analyze four parallel sorting algorithms (bitonic, column, radix, and sample sort). LogP characterizes the performance of modern parallel machines with a small set of parameters: the communication latency (L), overhead (o), bandwidth (g), and the number of processors (P). We develop implementations of these algorithms in Split-C, a parallel extension to C, and compare the performance predicted by LogP to actual performance on a CM-5 of 32 to 512 processors for a range of problem sizes and input sets. The sensitivity of the algorithms is evaluated by varying the distribution of key values and the rank ordering of the input. The LogP model is shown to be a valuable guide in the development of parallel algorithms and a good predictor of implementation performance. The model encourages the use of data layouts which minimize communication and balanced communication schedules which avoid contention. Using an empirical model of local processor performance, LogP predictions closely match observed execution times on uniformly distributed keys across a broad range of problem and machine sizes for all four algorithms. Communication performance is oblivious to the distribution of the keys values, whereas the local sort performance is not. The communication phases in radix and sample sort are sensitive to the ordering of keys, because certain layouts result in contention. 1
Implicit Coscheduling: Coordinated Scheduling with Implicit Information in Distributed Systems
- ACM TRANSACTIONS ON COMPUTER SYSTEMS
, 1998
"... In this thesis, we formalize the concept of an implicitly-controlled system, also referred to as an implicit system. In an implicit system, cooperating components do not explicitly contact other components for control or state information; instead, components infer remote state by observing natural ..."
Abstract
-
Cited by 44 (2 self)
- Add to MetaCart
In this thesis, we formalize the concept of an implicitly-controlled system, also referred to as an implicit system. In an implicit system, cooperating components do not explicitly contact other components for control or state information; instead, components infer remote state by observing naturally-occurring local events and their corresponding implicit information, i.e., information available outside of a defined interface. Many systems, particularly in distributed and networked environments, have leveraged implicit control to simplify the implementation of services with autonomous components. To concretely demonstrate the advantages of implicit control, we propose and implement implicit coscheduling, an algorithm for dynamically coordinating the time...
Can a Shared-Memory Model Serve as a Bridging Model for Parallel Computation?
, 1999
"... There has been a great deal of interest recently in the development of general-purpose bridging models for parallel computation. Models such as the BSP and LogP have been proposed as more realistic alternatives to the widely used PRAM model. The BSP and LogP models imply a rather different style fo ..."
Abstract
-
Cited by 41 (11 self)
- Add to MetaCart
There has been a great deal of interest recently in the development of general-purpose bridging models for parallel computation. Models such as the BSP and LogP have been proposed as more realistic alternatives to the widely used PRAM model. The BSP and LogP models imply a rather different style for designing algorithms when compared with the PRAM model. Indeed, while many consider data parallelism as a convenient style, and the shared-memory abstraction as an easyto-use platform, the bandwidth limitations of current machines have diverted much attention to message-passing and distributed-memory models (such as the BSP and LogP) that account more properly for these limitations. In this paper we consider the question of whether a shared-memory model can serve as an effective bridging model for parallel computation. In particular, can a shared-memory model be as effective as, say, the BSP? As a candidate for a bridging model, we introduce the Queuing Shared-Memory (QSM) model, which accounts for limited communication bandwidth while still providing a simple shared-memory abstraction. We substantiate the ability of the QSM to serve as a bridging model by providing a simple work-preserving emulation of the QSM on both the BSP, and on a related model, the (d, x)-BSP. We present evidence that the features of the QSM are essential to its effectiveness as a bridging model. In addition, we describe scenarios
Using Network Interface Support to Avoid Asynchronous Protocol Processing in Shared Virtual Memory Systems
- In Proceedings of the 26th International Symposium on Computer Architecture
, 1999
"... The performance of page-based software shared virtual memory (SVM) is still far from that achieved on hardwarecoherent distributed shared memory (DSM) systems. The interrupt cost for asynchronous protocol processing has been found to be a key source of performance loss and complexity. This paper sho ..."
Abstract
-
Cited by 40 (7 self)
- Add to MetaCart
The performance of page-based software shared virtual memory (SVM) is still far from that achieved on hardwarecoherent distributed shared memory (DSM) systems. The interrupt cost for asynchronous protocol processing has been found to be a key source of performance loss and complexity. This paper shows that by providing simple and general support for asynchronous message handling in a commodity network interface (NI), and by altering SVM protocols appropriately, protocol activity can be decoupled from asynchronous message handling and the need for interrupts or polling can be eliminated. The NI mechanisms needed are generic, not SVM-dependent. They also require neither visibility into the node memory system nor code instrumentation to identify memory operations. We prototype the mechanisms and such a synchronous home-based LRC protocol, called GeNIMA (GEneral-purpose Network Interface support in a shared Memory Abstraction), on a cluster of SMPs with a programmable NI, though the mechan...
Radix Sort For Vector Multiprocessors
- In Proceedings Supercomputing '91
, 1991
"... We have designed a radix sort algorithm for vector multiprocessors and have implemented the algorithm on the CRAY Y-MP. On one processor of the Y-MP, our sort is over 5 times faster on large sorting problems than the optimized library sort provided by CRAY Research. On eight processors we achieve a ..."
Abstract
-
Cited by 39 (6 self)
- Add to MetaCart
We have designed a radix sort algorithm for vector multiprocessors and have implemented the algorithm on the CRAY Y-MP. On one processor of the Y-MP, our sort is over 5 times faster on large sorting problems than the optimized library sort provided by CRAY Research. On eight processors we achieve an additional speedup of almost 5, yielding a routine over 25 times faster than the library sort. Using this multiprocessor version, we can sort at a rate of 15 million 64-bit keys per second. Our sorting algorithm is adapted from a data-parallel algorithm previously designed for a highly parallel Single Instruction Multiple Data (SIMD) computer, the Connection Machine CM-2. To develop our version we introduce three general techniques for mapping data-parallel algorithms ontovector multiprocessors. These techniques allow us to fully vectorize and parallelize the algorithm. The paper also derives equations that model the performance of our algorithm on the Y-MP. These equations are then used t...
Designing Efficient Sorting Algorithms for Manycore GPUs
, 2009
"... We describe the design of high-performance parallel radix sort and merge sort routines for manycore GPUs, taking advantage of the full programmability offered by CUDA. Our radix sort is the fastest GPU sort and our merge sort is the fastest comparison-based sort reported in the literature. Our radix ..."
Abstract
-
Cited by 34 (2 self)
- Add to MetaCart
We describe the design of high-performance parallel radix sort and merge sort routines for manycore GPUs, taking advantage of the full programmability offered by CUDA. Our radix sort is the fastest GPU sort and our merge sort is the fastest comparison-based sort reported in the literature. Our radix sort is up to 4 times faster than the graphics-based GPUSort and greater than 2 times faster than other CUDA-based radix sorts. It is also 23 % faster, on average, than even a very carefully optimized multicore CPU sorting routine. To achieve this performance, we carefully design our algorithms to expose substantial fine-grained parallelism and decompose the computation into independent tasks that perform minimal global communication. We exploit the high-speed onchip shared memory provided by NVIDIA’s GPU architecture and efficient data-parallel primitives, particularly parallel scan. While targeted at GPUs, these algorithms should also be wellsuited for other manycore processors.
A Quantitative Comparison of Parallel Computation Models
, 1996
"... This paper experimentally validates performance related issues for parallel computation models on several parallel platforms (a MasPar MP-1 with 1024 processors, a 64-node GCel and a CM-5 of 64 processors). Our work consists of three parts. First, there is an evaluation part in which we investigate ..."
Abstract
-
Cited by 33 (5 self)
- Add to MetaCart
This paper experimentally validates performance related issues for parallel computation models on several parallel platforms (a MasPar MP-1 with 1024 processors, a 64-node GCel and a CM-5 of 64 processors). Our work consists of three parts. First, there is an evaluation part in which we investigate whether the models correctly predict the execution time of an algorithm implementation. Unlike previous work, which mostly demonstrated a close match between the measured and predicted running times, this paper shows that there are situations in which the models do not precisely predict the actual execution time of an algorithm implementation. Second, there is a comparison part in which the models are contrasted with each other in order to determine which model induces the fastest algorithms. Finally, there is an efficiency validation part in which the performance of the model derived algorithms are compared with the performance of highly optimized library routines to show the effectiveness ...
Accounting for memory bank contention and delay in high-bandwidth multiprocessors
- In Proc. 7th ACM Symp. on Parallel Algorithms and Architectures
, 1997
"... Abstract—For years, the computation rate of processors has been much faster than the access rate of memory banks, and this divergence in speeds has been constantly increasing in recent years. As a result, several shared-memory multiprocessors consist of more memory banks than processors. The object ..."
Abstract
-
Cited by 29 (5 self)
- Add to MetaCart
Abstract—For years, the computation rate of processors has been much faster than the access rate of memory banks, and this divergence in speeds has been constantly increasing in recent years. As a result, several shared-memory multiprocessors consist of more memory banks than processors. The object of this paper is to provide a simple model (with only a few parameters) for the design and analysis of irregular parallel algorithms that will give a reasonable characterization of performance on such machines. For this purpose, we extend Valiant’s bulk-synchronous parallel (BSP) model with two parameters: a parameter for memory bank delay, the minimum time for servicing requests at a bank, and a parameter for memory bank expansion, the ratio of the number of banks to the number of processors. We call this model the (d, x)-BSP. We show experimentally that the (d, x)-BSP captures the impact of bank contention and delay on the CRAY C90 and J90 for irregular access patterns, without modeling machine-specific details of these machines. The model has clarified the performance characteristics of several unstructured algorithms on the CRAY C90 and J90, and allowed us to explore tradeoffs and optimizations for these algorithms. In addition to modeling individual algorithms directly, we also consider the use of the (d, x)-BSP as a bridging model for emulating a very high-level abstract model, the Parallel Random Access Machine (PRAM). We provide matching upper and lower bounds for emulating the EREW and QRQW PRAMs on the (d, x)-BSP.
Type systems for distributed data structures
- In the 27th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL
, 2000
"... Abstract Distributed-memory programs are often written using a global address space: any process can name any memory location on any processor. Some languages completely hide the distinction between local and remote memory, simplifying the programming model at some performance cost. Other languages ..."
Abstract
-
Cited by 29 (6 self)
- Add to MetaCart
Abstract Distributed-memory programs are often written using a global address space: any process can name any memory location on any processor. Some languages completely hide the distinction between local and remote memory, simplifying the programming model at some performance cost. Other languages give the programmer more explicit control, offering better potential performance but sacrificing both soundness and ease of use. Through a series of progressively richer type systems, we formalize the complex issues surrounding sound computation with explicitly distributed data structures. We then illustrate how type inference can subsume much of this complexity, letting programmers work at whatever level of detail is needed. Experiments conducted with the Titanium programming language show that this can result in easier development and significant performance improvements over manual optimization of local and global memory. 1 Introduction While there have been many efforts to design distributed, parallel programming languages, none has been completely satisfactory. Many approaches present the illusion of a single shared, global address space. While easy for programmers to understand, this approach hides the real structure of memory, making it difficult to exploit locality of data. In complex applications where local memory accesses may be orders of magnitude faster than remote accesses, this can seriously harm performance, development time, or both.

