Results 1 - 10
of
28
Shasta: A Low Overhead, Software-Only Approach . . . .
- IN PROCEEDINGS OF THE SEVENTH INTERNATIONAL CONFERENCE ON ARCHITECTURAL SUPPORT FOR PROGRAMMING LANGUAGES AND OPERATING SYSTEMS
, 1996
"... This paper describes Shasta, a system that supports a shared address space in software on clusters of computers with physically distributed memory. A unique aspect of Shasta compared to most other software distributed shared memory systems is that shared data can be kept coherent at a fine granu ..."
Abstract
-
Cited by 207 (5 self)
- Add to MetaCart
This paper describes Shasta, a system that supports a shared address space in software on clusters of computers with physically distributed memory. A unique aspect of Shasta compared to most other software distributed shared memory systems is that shared data can be kept coherent at a fine granularity. In addition, the system allows the coherence granularity to vary across different shared data structures in a single application. Shasta implements the shared address space by transparently rewriting the application executable to intercept loads and stores. For each shared load or store, the inserted code checks to see if the data is available locally and communicates with other processors if necessary. The system uses numerous techniques to reduce the run-time overhead of these checks. Since Shasta is implemented entirely in software, it also provides tremendous flexibility in supporting different types of cache coherence protocols. We have implemented an efficient cache co...
The Nexus Approach to Integrating Multithreading and Communication
- Journal of Parallel and Distributed Computing
, 1996
"... Lightweight threads have an important role to play in parallel systems: they can be used to exploit shared-memory parallelism, to mask communication and I/O latencies, to implement remote memory access, and to support task-parallel and irregular applications. In this paper, we address the question o ..."
Abstract
-
Cited by 205 (35 self)
- Add to MetaCart
Lightweight threads have an important role to play in parallel systems: they can be used to exploit shared-memory parallelism, to mask communication and I/O latencies, to implement remote memory access, and to support task-parallel and irregular applications. In this paper, we address the question of how to integrate threads and communication in high-performance distributed-memory systems. We propose an approach based on global pointer and remote service request mechanisms, and explain how these mechanisms support dynamic communication structures, asynchronous messaging, dynamic thread creation and destruction, and a global memory model via interprocessor references. We also explain how these mechanisms can be implemented in various environments. Our global pointer and remote service request mechanisms have been incorporated in a runtime system called Nexus that is used as a compiler target for parallel languages and as a substrate for higher-level communication libraries. We report th...
Hiding Communication Latency and Coherence Overhead in Software DSMs
, 1996
"... In this paper we propose the use of a PCI-based programmable protocol controller for hiding communication and coherence overheads in software DSMs. Our protocol controller provides three different types of overhead tolerance: a) moving basic communication and coherence tasks away from computation pr ..."
Abstract
-
Cited by 43 (6 self)
- Add to MetaCart
In this paper we propose the use of a PCI-based programmable protocol controller for hiding communication and coherence overheads in software DSMs. Our protocol controller provides three different types of overhead tolerance: a) moving basic communication and coherence tasks away from computation processors; b) prefetching of diffs; and c) generating and applying diffs with hardware assistance. We evaluate the isolated and combined impact of these features on the performance of TreadMarks. We also compare performance against two versions of the Shrimp-based AURC protocol. Using detailed execution-driven simulations of a 16-node network of workstations, we show that the greatest performance benefits provided by our protocol controller come from our hardware-supported diffs. Reducing the burden of communication and coherence transactions on the computation processor is also beneficial but to a smaller extent. Prefetching is not always profitable. Our results show that our protocol contr...
The Nexus Task-parallel Runtime System
- IN PROC. 1ST INTL WORKSHOP ON PARALLEL PROCESSING
, 1994
"... A runtime system provides a parallel language compiler with an interface to the low-level facilities required to support interaction between concurrently executing program components. Nexus is a portable runtime system for task-parallel programming languages. Distinguishing features of Nexus include ..."
Abstract
-
Cited by 43 (5 self)
- Add to MetaCart
A runtime system provides a parallel language compiler with an interface to the low-level facilities required to support interaction between concurrently executing program components. Nexus is a portable runtime system for task-parallel programming languages. Distinguishing features of Nexus include its support for multiple threads of control, dynamic processor acquisition, dynamic address space creation, a global memory model via interprocessor references, and asynchronous events. In addition, it supports heterogeneity at multiple levels, allowing a single computation to utilize different programming languages, executables, processors, and network protocols. Nexus is currently being used as a compiler target for two task-parallel languages: Fortran M and Compositional C++ . In this paper, we present the Nexus design, outline techniques used to implement Nexus on parallel computers, showhow it is used in compilers, and compare its performance with that of another runtime system.
Implementation of an efficient parallel BDD package
- In DAC
, 1996
"... Large BDD applications push computing resources to their limits. One solution to overcoming resource limitations is to distribute the BDD data structure across multiple networked workstations. This paper presents an efficient parallel BDD package for a distributed environment such as a network of wo ..."
Abstract
-
Cited by 39 (0 self)
- Add to MetaCart
Large BDD applications push computing resources to their limits. One solution to overcoming resource limitations is to distribute the BDD data structure across multiple networked workstations. This paper presents an efficient parallel BDD package for a distributed environment such as a network of workstations (NOW) or a distributed memory parallel computer. The implementation exploits a number of different forms of parallelism that can be found in depth-first algorithms. Significant effort is made to limit the communication overhead, including a two-level distributed hash table and an uncomputed cache. The package simultaneously executes multiple threads of computation on a distributed BDD. 1.
Nexus: Runtime Support for Task-Parallel Programming Languages
- Mathematics and Computer Science Division, Argonne National Laboratory, Argonne Il. 60439
, 1994
"... A runtime system provides a parallel language compiler with an interface to the low-level facilities required to support interaction between concurrently executing program components. Nexus is a portable runtime system for taskparallel programming languages. Distinguishing features of Nexus inclu ..."
Abstract
-
Cited by 31 (3 self)
- Add to MetaCart
A runtime system provides a parallel language compiler with an interface to the low-level facilities required to support interaction between concurrently executing program components. Nexus is a portable runtime system for taskparallel programming languages. Distinguishing features of Nexus include its support for multiple threads of control, dynamic processor acquisition, dynamic address space creation, a global memory model via interprocessor references, and asynchronous events. In addition, it supports heterogeneityat multiple levels, allowing a single computation to utilize di#erent programming languages, executables, processors, and network protocols. Nexus is currently being used as a compiler target for two task-parallel languages: Fortran M and Compositional C++ . In this paper, we present the Nexus design, outline techniques used to implement Nexus on parallel computers, showhowitis used in compilers, and compare its performance with that of another runtime system...
Code transformations to improve memory parallelism
- In Proceedings of the 32nd Annual International Symposium on Microarchitecture
, 1999
"... Current microprocessors incorporate techniques to exploit instruction-level parallelism (ILP). However, previous work has shown that these ILP techniques are less effective in removing memory stall time than CPU time, making the memory system a greater bottleneck in ILP-based systems than in previou ..."
Abstract
-
Cited by 27 (1 self)
- Add to MetaCart
Current microprocessors incorporate techniques to exploit instruction-level parallelism (ILP). However, previous work has shown that these ILP techniques are less effective in removing memory stall time than CPU time, making the memory system a greater bottleneck in ILP-based systems than in previous-generation systems. These deficiencies arise largely because applications present limited opportunities for an out-oforder issue processor to overlap multiple read misses, the dominant source of memory stalls. This work proposes code transformations to increase parallelism in the memory system by overlapping multiple read misses within the same instruction window, while preserving cache locality. We present an analysis and transformation framework suitable for compiler implementation. Our simulation experiments show execution time reductions averaging 20 % in a multiprocessor and 30 % in a uniprocessor. A substantial part of these reductions comes from increases in memory parallelism. We see similar benefits on a Convex Exemplar.
The Sensitivity of Communication Mechanisms to Bandwidth and Latency
"... The goal of this paper is to gain insight into the relative performance of communication mechanisms as bisection bandwidth and network latency vary. We compare shared memory with and without prefetching, message passing with interrupts and with polling, and bulk transfer via DMA. We present two sets ..."
Abstract
-
Cited by 13 (0 self)
- Add to MetaCart
The goal of this paper is to gain insight into the relative performance of communication mechanisms as bisection bandwidth and network latency vary. We compare shared memory with and without prefetching, message passing with interrupts and with polling, and bulk transfer via DMA. We present two sets of experiments involving four irregular applications on the MIT Alewife multiprocessor. First, we introduce I/O cross-traffic to vary bisection bandwidth. Second, we change processor clock speeds to vary relative network latency. We establish a framework from which to understand a range of results. On Alewife, shared memory provides good performance, even on producer-consumer applications with little datareuse. On machines with lower bisection bandwidth and higher network latency, however, message-passing mechanisms become important. In particular, the high communication volume of shared memory threatens to become difficult to support on future machines without expensive, high-dimensional networks. Furthermore, the round-trip nature of shared memory may not be able to tolerate the latencies of future networks.
Mianjin is Gardens Point: A Parallel Language Taming Asynchronous Communication
- In Fourth Australasian Conference on Parallel and Real-Time Systems (PART'97
, 1997
"... . The Gardens language, Mianjin, supports task based parallel computation, which utilises Active Messages style asynchronous communication. Unlike raw Active Messages, which is a library, this is achieved safely and without degrading performance by enforcing all necessary restrictions statically at ..."
Abstract
-
Cited by 11 (9 self)
- Add to MetaCart
. The Gardens language, Mianjin, supports task based parallel computation, which utilises Active Messages style asynchronous communication. Unlike raw Active Messages, which is a library, this is achieved safely and without degrading performance by enforcing all necessary restrictions statically at the language level. Also supported is the definition of communication abstractions. To do this Mianjin utilises a partitioned address space where local objects are distinguished from those made globally accessible; this is achieved via type annotations. Type annotations are also used to distinguish those routines which may perform communications from those which will not; a distinction important for the static safety of an Active Message implementation. A special parameter mechanism is used to implement and enforce the conditions necessary for safe reply (read) operations. 1 Introduction This paper describes the Gardens programming language, Mianjin 1 , and its support for global objects....
The Data Mover: A Machine-independent Abstraction for Managing Customized Data Motion
, 1999
"... This paper discusses an abstraction, called the Data Mover, for expressing machine-independent customized communication algorithms in a variety of block-structured applications. The Data Mover enables its user to express data motion using intuitive geometric operations that encapsulate the low-level ..."
Abstract
-
Cited by 11 (2 self)
- Add to MetaCart
This paper discusses an abstraction, called the Data Mover, for expressing machine-independent customized communication algorithms in a variety of block-structured applications. The Data Mover enables its user to express data motion using intuitive geometric operations that encapsulate the low-level details of the underlying communication. Communication patterns are expressed as collective operations, and are restricted to movement of rectangular array sections. We describe the Data Mover model of communication, and present performance for various applications. The Data Mover currently serves as useful middleware for application library designers, but defines a simple machine-independent interface suitable as a target for a compiler or compiler run time library. 1.

