Results 1 - 10
of
22
Basic Techniques for the Efficient Coordination of Very Large Numbers of Cooperating Sequential Processors
, 1981
"... In this paper we implement several basic operating system primitives by using a "replace-add" operation, which can supersede the standard "test and set", and which appears to be a universal primitive for efficiently coordinating large numbers of independently acting sequential processors. We also pr ..."
Abstract
-
Cited by 84 (2 self)
- Add to MetaCart
In this paper we implement several basic operating system primitives by using a "replace-add" operation, which can supersede the standard "test and set", and which appears to be a universal primitive for efficiently coordinating large numbers of independently acting sequential processors. We also present a hardware implementation of replace-add that permits multiple replace-adds to be processed nearly as efficiently as loads and stores. Moreover, the crucial special case of concurrent replace-adds updating the same variable is handled particularly well: If every PE simultaneously addresses a replace-add at the same variable, all these requests are satisfied in the time required to process just one request.
Scan Primitives for GPU Computing
- GRAPHICS HARDWARE 2007
, 2007
"... The scan primitives are powerful, general-purpose data-parallel primitives that are building blocks for a broad range of applications. We describe GPU implementations of these primitives, specifically an efficient formulation and implementation of segmented scan, on NVIDIA GPUs using the CUDA API.Us ..."
Abstract
-
Cited by 70 (4 self)
- Add to MetaCart
The scan primitives are powerful, general-purpose data-parallel primitives that are building blocks for a broad range of applications. We describe GPU implementations of these primitives, specifically an efficient formulation and implementation of segmented scan, on NVIDIA GPUs using the CUDA API.Using the scan primitives, we show novel GPU implementations of quicksort and sparse matrix-vector multiply, and analyze the performance of the scan primitives, several sort algorithms that use the scan primitives, and a graphical shallow-water fluid simulation using the scan framework for a tridiagonal matrix solver.
Parallel Algorithms with Processor Failures and Delays
, 1995
"... We study efficient deterministic parallel algorithms on two models: restartable fail-stop CRCW PRAMs and asynchronous PRAMs. In the first model, synchronous processors are subject to arbitrary stop failures and restarts determined by an on-line adversary and involving loss of private but not shared ..."
Abstract
-
Cited by 40 (8 self)
- Add to MetaCart
We study efficient deterministic parallel algorithms on two models: restartable fail-stop CRCW PRAMs and asynchronous PRAMs. In the first model, synchronous processors are subject to arbitrary stop failures and restarts determined by an on-line adversary and involving loss of private but not shared memory; the complexity measures are completed work (where processors are charged for completed fixed-size update cycles) and overhead ratio (completed work amortized over necessary work and failures). In the second model, the result of the computation is a serializaton of the actions of the processors determined by an on-line adversary; the complexity measure is total work (number of steps taken by all processors). Despite their differences the two models share key algorithmic techniques. We present new algorithms for the Write-All problem (in which P processors write ones into an array of size N ) for the two models. These algorithms can be used to implement a simulation strategy for any N ...
A comparison of shared and nonshared memory models of parallel computation
- Proceedings of the IEEE
, 1991
"... Four algorithms are analyzed in the shared and nonshared (distributed) memory models ofparallel computation. The analysis shows that the shared memory model predicts optimality for algorithms and programming styles that cannot be realized on any physical parallel computers. Programs based on these t ..."
Abstract
-
Cited by 23 (4 self)
- Add to MetaCart
Four algorithms are analyzed in the shared and nonshared (distributed) memory models ofparallel computation. The analysis shows that the shared memory model predicts optimality for algorithms and programming styles that cannot be realized on any physical parallel computers. Programs based on these techniques are inferior to programs wrinen in the nonshared memory model. The “unit ” cost charged for a reference to shared memory is argued to be the source of the shared memory model’s inaccuracy. The implications of these observations are discussed. I.
Interconnection networks using shuffles
- IEEE Computers
, 1981
"... techniques allow severalprocessors within a multiprocessing ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
techniques allow severalprocessors within a multiprocessing
Evolving Information Processing Organizations
- Tournament Selection and the Effects of Noise”, Complex Systems 9
, 1995
"... The organization of information processing resources is a central question in economic, organizational, and computational theory. Recent work by Radner (1992) and others has developed a simple theoretical framework and some useful formal mathematical results about the behavior of such systems. Here, ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
The organization of information processing resources is a central question in economic, organizational, and computational theory. Recent work by Radner (1992) and others has developed a simple theoretical framework and some useful formal mathematical results about the behavior of such systems. Here, we follow a complementary computational approach that allows us to pursue questions concerning the impact of coordination and various exogenous conditions facing the organization. We find that organizations demonstrate "order for free," that is, given a simple structural framework and a set of standard operating procedures, even randomly generated organizations imply well-defined patterns of behavior. Using a genetic algorithm, we also show that simple evolutionary processes allow organizations to "learn" better structures. I am grateful to Jody Lutz and, especially, Hollis Schuler for research assistance, and to Dean Behrens, Kathleen Carley, and Robyn Dawes for useful comments. I would ...
Communicators: Object-Based Multiparty Interactions for Parallel Programming
, 1991
"... Contemporary parallel programming languages often provide only few low-level primitives for pairwise communication and synchronization. These primitives are not always suitable for the interactions being programmed. Programming would be easier if it was possible to tailor communication and synchr ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
Contemporary parallel programming languages often provide only few low-level primitives for pairwise communication and synchronization. These primitives are not always suitable for the interactions being programmed. Programming would be easier if it was possible to tailor communication and synchronization mechanisms to fit the needs of the application, much as abstract data types are used to create application-specific data structures and operations. This should also include the possibility of expressing interactions among multiple processes at once. Communicators support this paradigm by creating abstract communication objects that provide a framework for interprocess multiparty interactions. The behavior of these objects is defined in terms of interactions, in which multiple processes can enrole. Interactions are performed when all the roles are filled by ready processes. Nondeterminism is used when the order of interaction performance is immaterial. Interactions can also ...
Cycles in Networks
, 1993
"... We study the presence of cycles and long paths in graphs that have been proposed as interconnection networks for parallel architectures. The study surveys and complements known results. 1 Introduction This paper is devoted to studying embeddings of the simplest possible guest graphs, the path P ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
We study the presence of cycles and long paths in graphs that have been proposed as interconnection networks for parallel architectures. The study surveys and complements known results. 1 Introduction This paper is devoted to studying embeddings of the simplest possible guest graphs, the path PN and the cycle CN , in graphs that have been proposed as interconnection networks for parallel architectures. In addition to their intrinsic interest, in terms of the development of algorithms on parallel architectures, these two guest graphs are important because of the fact that many structurally richer graphs can be constructed from paths and cycles by various product constructions. A few of the results we present are original; several appear in the literature and are duly cited; many belong to the folklore of the field. Indeed this paper is motivated by a desire to find a single repository for this important, yet scattered material. Before proceeding further, we define formally the ...
Strategic Directions in Computer Architecture
- ACM Computing Surveys
, 1996
"... Looking back on the last 30 years, we have seen the remarkable developments in semiconductor technology enabling the implementation of ideas that were previously ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Looking back on the last 30 years, we have seen the remarkable developments in semiconductor technology enabling the implementation of ideas that were previously
Empirical Study Design in the Area of High Performance Computing (HPC)
- IN PROCEEDINGS OF INTERNATIONAL SYMPOSIUM ON EMPIRICAL SOFTWARE ENGINEERING. NOOSA HEADS
"... The development of High-Performance Computing (HPC) programs is crucial to progress in many fields of scientific endeavor. We have run initial studies of the productivity of HPC developers and of techniques for improving that productivity, which have not previously been the subject of significant st ..."
Abstract
-
Cited by 4 (4 self)
- Add to MetaCart
The development of High-Performance Computing (HPC) programs is crucial to progress in many fields of scientific endeavor. We have run initial studies of the productivity of HPC developers and of techniques for improving that productivity, which have not previously been the subject of significant study. Because of key differences between development for HPC and for more conventional software engineering applications, this work has required the tailoring of experimental designs and protocols. A major contribution of our work is to begin to quantify the code development process in a specialized area that has previously not been extensively studied. Specifically, we present an analysis of the domain of High-Performance Computing for the aspects that would impact experimental design; show how those aspects are reflected in experimental design for this specific area; and demonstrate how we are using such experimental designs to build up a body of knowledge specific to the domain. Results to date build confidence in our approach by showing that there are no significant differences across studies comparing subjects with similar experience tackling similar problems, while there are significant differences in performance and effort among the different parallel models applied.

