Results 1 - 10
of
136
Orca: A language for parallel programming of distributed systems
- IEEE Transactions on Software Engineering
, 1992
"... Orca is a language for implementing parallel applications on loosely coupled distributed systems. Unlike most languages for distributed programming, it allows processes on different machines to share data. Such data are encapsulated in data-objects, which are instances of user-defined abstract data ..."
Abstract
-
Cited by 307 (43 self)
- Add to MetaCart
Orca is a language for implementing parallel applications on loosely coupled distributed systems. Unlike most languages for distributed programming, it allows processes on different machines to share data. Such data are encapsulated in data-objects, which are instances of user-defined abstract data types. The implementation of Orca takes care of the physical distribution of objects among the local memories of the processors. In particular, an implementation may replicate and/or migrate objects in order to decrease access times to objects and increase parallelism. This paper gives a detailed description of the Orca language design and motivates the design choices. Orca is intended for applications programmers rather than systems programmers. This is reflected in its design goals to provide a simple, easy to use language that is type-secure and provides clean semantics. The paper discusses three example parallel applications in Orca, one of which is described in detail. It also describes one of the existing implementations, which is based on reliable broadcasting. Performance measurements of this system are given for three parallel applications. The measurements show that significant speedups can be obtained for all three applications. Finally, the paper compares Orca with several related languages and systems. 1.
CRL: High-Performance All-Software Distributed Shared Memory
, 1995
"... This paper introduces the C Region Library (CRL), a new all-software distributed shared memory (DSM) system. CRL requires no special compiler, hardware, or operating system support beyond the ability to send and receive messages. It provides a simple, portable shared address space programming model ..."
Abstract
-
Cited by 191 (11 self)
- Add to MetaCart
This paper introduces the C Region Library (CRL), a new all-software distributed shared memory (DSM) system. CRL requires no special compiler, hardware, or operating system support beyond the ability to send and receive messages. It provides a simple, portable shared address space programming model that is capable of delivering good performance on a wide range of multiprocessor and distributed system architectures. We have developed CRL implementations for two platforms: the CM-5, a commercial multicomputer, and the MIT Alewife machine, an experimental multiprocessor offering efficient support for both message passing and shared memory. We present results for up to 128 processors on the CM-5 and up to 32 processors on Alewife. In a set of controlled experiments, we demonstrate that CRL is the first all-software DSM system capable of delivering performance competitive with hardware DSMs. CRL achieves speedups within 30% of those provided by Alewife's native support for shared memory, eve...
The M-Machine Multicomputer
, 1995
"... The M-Machine is an experimental multicomputer being developed to test architectural concepts motivated by the constraints of modern semiconductor technology and the demands of programming systems. The M-Machine computing nodes are con- nected with a 3-D mesh network; each node is a multithreaded pr ..."
Abstract
-
Cited by 100 (13 self)
- Add to MetaCart
The M-Machine is an experimental multicomputer being developed to test architectural concepts motivated by the constraints of modern semiconductor technology and the demands of programming systems. The M-Machine computing nodes are con- nected with a 3-D mesh network; each node is a multithreaded processor incorporating 12 function units, on-chip cache, and local memory. The multiple function units are used to exploit both instruction-level and thread-level parallelism. A user accessible message passing system yields fast communication and synchronization between nodes. RapM access to remote memory is provided transparently to the user with a combination of hardware and software mechanisms. This paper presents the architecture of the M-Machine and describes how its mechanisms attempt to maximize both single thread performance and overall system throughput. The architecture is complete and the MAP chip, which will serve as the M-Machine processing node, is currently being implemented.
Improving Release-Consistent Shared Virtual Memory using Automatic Update
- IN THE 2ND IEEE SYMPOSIUM ON HIGH-PERFORMANCE COMPUTER ARCHITECTURE
, 1996
"... Shared virtual memory is a software technique to provide shared memory on a network of computers without special hardware support. Although several relaxed consistency models and implementations are quite effective, there is still a considerable performance gap between the "software-only" approach a ..."
Abstract
-
Cited by 82 (20 self)
- Add to MetaCart
Shared virtual memory is a software technique to provide shared memory on a network of computers without special hardware support. Although several relaxed consistency models and implementations are quite effective, there is still a considerable performance gap between the "software-only" approach and the hardware approach that uses directory-based caches. Automatic update is a simple communication mechanism, implemented in the SHRIMP multicomputer, that forwards local writes to remote memory transparently. In this paper we propose a new lazy release consistency based protocol, called Automatic Update Release Consistency (AURC), that uses automatic update to propagate and merge shared memory modifications. We compare the performance of this protocol against a software-only LRC implementation on several Splash2 applications and show that the AURC approach can substantially improve the performance of LRC. For 16 processors, the average speedup has increased from 5.9 under LRC, to 8.3 under AURC.
Brazos: A Third Generation DSM System
- IN PROCEEDINGS OF THE 1ST USENIX WINDOWS NT SYMPOSIUM
, 1997
"... Brazos is a third generation distributed shared memory (DSM) system designed for x86 machines running Microsoft Windows NT 4.0. Brazos is unique among existing systems in its use of selective multicast, a software-only implementation of scope consistency, and several adaptive runtime performance tun ..."
Abstract
-
Cited by 67 (11 self)
- Add to MetaCart
Brazos is a third generation distributed shared memory (DSM) system designed for x86 machines running Microsoft Windows NT 4.0. Brazos is unique among existing systems in its use of selective multicast, a software-only implementation of scope consistency, and several adaptive runtime performance tuning mechanisms. The Brazos runtime system is multithreaded, allowing the overlap of computation with the long communication latencies typically associated with software DSM systems. Brazos also supports multithreaded user-code execution, allowing programs to take advantage of the local tightly-coupled shared memory available on multiprocessor PC servers, while transparently interacting with remote "virtual" shared memory. Brazos currently runs on a cluster of Compaq Proliant 1500 multiprocessor servers connected by a 100 Mbps FastEthernet. This paper describes the Brazos design and implementation, and compares its performance running five scientific applications to the performance of Solaris...
ASHs: Application-Specific Handlers for High-Performance Messaging
- IN ACM COMMUNICATION ARCHITECTURES, PROTOCOLS, AND APPLICATIONS (SIGCOMM ’96
, 1996
"... Application-specific safe message handlers (ASHs) are designed to provide applications with hardware-level network performance. ASHs are user-written code fragments that safely and efficiently execute in the kernel in response to message arrival. ASHs can direct message transfers (thereby eliminatin ..."
Abstract
-
Cited by 59 (11 self)
- Add to MetaCart
Application-specific safe message handlers (ASHs) are designed to provide applications with hardware-level network performance. ASHs are user-written code fragments that safely and efficiently execute in the kernel in response to message arrival. ASHs can direct message transfers (thereby eliminating copies) and send messages (thereby reducing send-response latency). In addition, the ASH system provides support for dynamic integrated layer processing (thereby eliminating duplicate message traversals) and dynamic protocol composition (thereby supporting modularity). ASHs provide this high degree of flexibility while still providing network performance as good as, or (if they exploit application-specific knowledge) even better than, hard-wired in-kernel implementations. A combination of user-level microbenchmarks and end-to-end system measurements using TCP demonstrate the benefits of the ASH system.
Safemem: Exploiting ECC-memory for detecting memory leaks and memory corruption during production runs
- In Proceedings of the 11th International Symposium on High-Performance Computer Architecture
, 2005
"... Memory leaks and memory corruption are two major forms of software bugs that severely threaten system availability and security. According to the US-CERT Vulnerability Notes Database, 68 % of all reported vulnerabilities in 2003 were caused by memory leaks or memory corruption. Dynamic monitoring to ..."
Abstract
-
Cited by 59 (11 self)
- Add to MetaCart
Memory leaks and memory corruption are two major forms of software bugs that severely threaten system availability and security. According to the US-CERT Vulnerability Notes Database, 68 % of all reported vulnerabilities in 2003 were caused by memory leaks or memory corruption. Dynamic monitoring tools, such as the state-of-the-art Purify, are commonly used to detect memory leaks and memory corruption. However, most of these tools suffer from high overhead, with up to a 20 times slowdown, making them infeasible to be used for production-runs. This paper proposes a tool called SafeMem to detect memory leaks and memory corruption on-the-fly during production-runs. This tool does not rely on any new hardware support. Instead, it makes a novel use of existing ECC memory technology and exploits intelligent dynamic memory usage behavior analysis to detect memory leaks and corruption. We have evaluated SafeMem with seven real-world applications that contain memory leak or memory corruption bugs. SafeMem detects all tested bugs with low overhead (only 1.6%-14.4%), 2-3 orders of magnitudes smaller than Purify. Our results also show that ECCprotection is effective in pruning false positives for memory leak detection, and in reducing the amount of memory waste (by a factor of 64-74) used for memory monitoring in memory corruption detection compared to page-protection. 1
Software-Extended Coherent Shared Memory: Performance and Cost
"... This paper evaluates the tradeoffs involved in the design of the software-extended memory system of Alewife, a multiprocessor architecture that implements coherentsharedmemory through a combination of hardware and software mechanisms. For each block of memory, Alewife implements between zero and fiv ..."
Abstract
-
Cited by 54 (5 self)
- Add to MetaCart
This paper evaluates the tradeoffs involved in the design of the software-extended memory system of Alewife, a multiprocessor architecture that implements coherentsharedmemory through a combination of hardware and software mechanisms. For each block of memory, Alewife implements between zero and five coherence directory pointers in hardwareand allows software to handle requests when the pointers are exhausted. The software includes a flexible coherence interface that facilitates protocol software implementation. This interface is indispensable for conducting experiments and has proven important for implementing enhancements to the basic system. Simulations of a
Designing Memory Consistency Models for Shared-Memory Multiprocessors
, 1993
"... The memory consistency model (or memory model) of a shared-memory multiprocessor system influences both the performance and the programmability of the system. The simplest and most intuitive model for programmers, sequential consistency, restricts the use of many performance-enhancing optimizations ..."
Abstract
-
Cited by 51 (8 self)
- Add to MetaCart
The memory consistency model (or memory model) of a shared-memory multiprocessor system influences both the performance and the programmability of the system. The simplest and most intuitive model for programmers, sequential consistency, restricts the use of many performance-enhancing optimizations exploited by uniprocessors. For higher performance, several alternative models have been proposed. However, many of these are hardware-centric in nature and difficult to program. Further, the multitude of many seemingly unrelated memory models inhibits portability. We use a 3P criteria of programmability, portability, and performance to assess memory models, and find current models lacking in one or more of these criteria. This thesis establishes a unifying framework for reasoning about memory models that leads to models that adequately satisfy the 3P criteria. The first contribution of this thesis is a programmer-centric methodology, called sequential consistency normal form (SCNF), for specifying memory models. This methodology is based on the observation that performance enhancing optimizations can be allowed without violating sequential consistency if the system is given some information about the program. An SCNF model is a contract between the system and the programmer, where the system guarantees both high performance and sequential consistency only if the programmer provides certain information about the program. Insufficient information gives lower performance, but incorrect information
Computation Migration: Enhancing Locality for Distributed-Memory Parallel Systems
"... We describe computation migration, a new technique that is based on compile-time program transformations, for accessing remote data in a distributed-memory parallel system. In contrast with RPC-style access, where the access is performed remotely, and with data migration, where the data is moved so ..."
Abstract
-
Cited by 47 (4 self)
- Add to MetaCart
We describe computation migration, a new technique that is based on compile-time program transformations, for accessing remote data in a distributed-memory parallel system. In contrast with RPC-style access, where the access is performed remotely, and with data migration, where the data is moved so that it is local, computation migration moves part of the current thread to the processor where the data resides. The access is performed at the remote processor, and the migrated thread portion continues to run on that same processor; this makes subsequent accesses in the thread portion local. We describe an implementation of computation migration that consists of two parts: an implementation that migrates single activation frames, and a high-level language annotation that allows a programmer to express when migration is desired. We performed experiments using two applications; these experiments demonstrate that computation migration is a valuable alternative to RPC and data migration.

