Results 1 - 10
of
21
Software Caching and Computation Migration in Olden
, 1995
"... The goal of the Olden project is to build a system that provides parallelism for general purpose C programs with minimal programmer annotations. We focus on programs using dynamic structures such as trees, lists, and DAGs. We demonstrate that providing both software caching and computation migratio ..."
Abstract
-
Cited by 84 (0 self)
- Add to MetaCart
The goal of the Olden project is to build a system that provides parallelism for general purpose C programs with minimal programmer annotations. We focus on programs using dynamic structures such as trees, lists, and DAGs. We demonstrate that providing both software caching and computation migration can improve the performance of these programs, and provide a compile-time heuristic that selects between them for each pointer dereference. We have implemented a prototype system on the Thinking Machines CM-5. We describe our implementation and report on experiments with ten benchmarks.
Improving Release-Consistent Shared Virtual Memory using Automatic Update
- IN THE 2ND IEEE SYMPOSIUM ON HIGH-PERFORMANCE COMPUTER ARCHITECTURE
, 1996
"... Shared virtual memory is a software technique to provide shared memory on a network of computers without special hardware support. Although several relaxed consistency models and implementations are quite effective, there is still a considerable performance gap between the "software-only" approach a ..."
Abstract
-
Cited by 82 (20 self)
- Add to MetaCart
Shared virtual memory is a software technique to provide shared memory on a network of computers without special hardware support. Although several relaxed consistency models and implementations are quite effective, there is still a considerable performance gap between the "software-only" approach and the hardware approach that uses directory-based caches. Automatic update is a simple communication mechanism, implemented in the SHRIMP multicomputer, that forwards local writes to remote memory transparently. In this paper we propose a new lazy release consistency based protocol, called Automatic Update Release Consistency (AURC), that uses automatic update to propagate and merge shared memory modifications. We compare the performance of this protocol against a software-only LRC implementation on several Splash2 applications and show that the AURC approach can substantially improve the performance of LRC. For 16 processors, the average speedup has increased from 5.9 under LRC, to 8.3 under AURC.
Home-based Shared Virtual Memory
, 1998
"... In this dissertation, I investigate how to improve the performance of shared virtual memory (SVM) by examining consistency models, protocols, hardware support and applications. The main conclusion of this research is that the performance of shared virtual memory can be significantly improved when pe ..."
Abstract
-
Cited by 51 (4 self)
- Add to MetaCart
In this dissertation, I investigate how to improve the performance of shared virtual memory (SVM) by examining consistency models, protocols, hardware support and applications. The main conclusion of this research is that the performance of shared virtual memory can be significantly improved when performance-enhancing techniques from all these areas are combined. This dissertation proposes home-based lazy release consistency as a simple, effective, and scalable way to build shared virtual memory systems. In home-based protocols each shared page has a home to which all writes are propagated and from which all copies are derived. Two home-based protocols are described, implemented and evaluated on two hardware and software platforms: Automatic Update Release Consistency (AURC), which requires hardware support for fine-grained remote writes (automatic updates), and Homebased Lazy Release Consistency (HLRC), which is implemented exclusively in software. The dissertation investigates the ...
Designing Memory Consistency Models for Shared-Memory Multiprocessors
, 1993
"... The memory consistency model (or memory model) of a shared-memory multiprocessor system influences both the performance and the programmability of the system. The simplest and most intuitive model for programmers, sequential consistency, restricts the use of many performance-enhancing optimizations ..."
Abstract
-
Cited by 51 (8 self)
- Add to MetaCart
The memory consistency model (or memory model) of a shared-memory multiprocessor system influences both the performance and the programmability of the system. The simplest and most intuitive model for programmers, sequential consistency, restricts the use of many performance-enhancing optimizations exploited by uniprocessors. For higher performance, several alternative models have been proposed. However, many of these are hardware-centric in nature and difficult to program. Further, the multitude of many seemingly unrelated memory models inhibits portability. We use a 3P criteria of programmability, portability, and performance to assess memory models, and find current models lacking in one or more of these criteria. This thesis establishes a unifying framework for reasoning about memory models that leads to models that adequately satisfy the 3P criteria. The first contribution of this thesis is a programmer-centric methodology, called sequential consistency normal form (SCNF), for specifying memory models. This methodology is based on the observation that performance enhancing optimizations can be allowed without violating sequential consistency if the system is given some information about the program. An SCNF model is a contract between the system and the programmer, where the system guarantees both high performance and sequential consistency only if the programmer provides certain information about the program. Insufficient information gives lower performance, but incorrect information
Using Simple Page Placement Policies to Reduce the Cost of Cache Fills in Coherent Shared-Memory Systems
- In Proceedings of the Ninth International Parallel Processing Symposium
, 1994
"... The cost of a cache miss depends heavily on the location of the main memory that backs the missing line. For certain applications, this cost is a major factor in overall performance. We report on the utility of OS-based page placement as a mechanism to increase the frequency with which cache fills a ..."
Abstract
-
Cited by 43 (11 self)
- Add to MetaCart
The cost of a cache miss depends heavily on the location of the main memory that backs the missing line. For certain applications, this cost is a major factor in overall performance. We report on the utility of OS-based page placement as a mechanism to increase the frequency with which cache fills access local memory in distributed shared memory multiprocessors. Even with the very simple policy of first-use placement, we find significant improvements over round-robin placement for many applications on both hardware- and software-coherent systems. For most of our applications, first-use placement allows 35 to 75 percent of cache fills to be performed locally, resulting in performance improvements of up to 40 percent with respect to round-robin placement. We were surprised to find no performance advantage in more sophisticated policies, including page migration and page replication. In fact, in many cases the performance of our applications suffered under these policies. 1 Introduction ...
VM-Based Shared Memory on Low-Latency, Remote-Memory-Access Networks
, 1997
"... Recent technological advances have produced network interfaces that provide users with very low-latency access to the memory of remote machines. We examine the impact of such networks on the implementation and performance of software DSM. Specifically, we compare two DSM systems—Cashmere and TreadMa ..."
Abstract
-
Cited by 41 (11 self)
- Add to MetaCart
Recent technological advances have produced network interfaces that provide users with very low-latency access to the memory of remote machines. We examine the impact of such networks on the implementation and performance of software DSM. Specifically, we compare two DSM systems—Cashmere and TreadMarks—on a 32-processor DEC Alpha cluster connected by a Memory Channel network. Both Cashmere and TreadMarks use virtual memory to maintain coherence on pages, and both use lazy, multi-writer release consistency. The systems differ dramatically, however, in the mechanisms used to track sharing information and to collect and merge concurrent updates to a page, with the result that Cashmere communicates much more frequently, and at a much finer grain. Our principal conclusion is that low-latency networks make DSM based on fine-grain communication competitive with more coarse-grain approaches,but that further hardware improvements will be needed before such systems can provide consistently superior performance. In our experiments, Cashmere scales slightly better than TreadMarks for applications with false sharing. At the same time, it is severely constrained by limitations of the current Memory Channel hardware. In general, performance is better for TreadMarks.
Using Memory-Mapped Network Interfaces to Improve the Performance of Distributed Shared Memory
- In The 2nd IEEE Symposium on High-Performance Computer Architecture
, 1996
"... Shared memory is widely believed to provide an easier programming model than message passing for expressing parallel algorithms. Distributed Shared Memory (DSM) systems provide the illusion of shared memory on top of standard message passing hardware at very low implementation cost, but provide acce ..."
Abstract
-
Cited by 34 (7 self)
- Add to MetaCart
Shared memory is widely believed to provide an easier programming model than message passing for expressing parallel algorithms. Distributed Shared Memory (DSM) systems provide the illusion of shared memory on top of standard message passing hardware at very low implementation cost, but provide acceptable performance for only a limited class of applications. We argue that the principal sources of overhead in DSM systems can be dramatically reduced with modest amounts of hardware support (substantially less than is required for hardware cache coherence). Specifically, we present and evaluate a family of protocols designed to exploit hardware support for a global, but noncoherent, physical address space. We consider systems both with and without remote cache fills, fine-grain access faults, "doubled" writes to local and remote memory, and merging write buffers. We also consider varying levels of latency and bandwidth. We evaluate our protocols using execution driven simulation, comparing...
Shared Virtual Memory: Progress and Challenges
- Proceedings of the IEEE
, 1999
"... This paper is a survey of the first 12 years of research in SVM, placing the multi-track flow of ideas and results obtained so far in a comprehensive framework. The contributions indicated in Figure 1 are classified in four categories, each belonging primarily to one layer: relaxed consistency model ..."
Abstract
-
Cited by 33 (4 self)
- Add to MetaCart
This paper is a survey of the first 12 years of research in SVM, placing the multi-track flow of ideas and results obtained so far in a comprehensive framework. The contributions indicated in Figure 1 are classified in four categories, each belonging primarily to one layer: relaxed consistency models, protocol laziness, architectural support, and applications and application-driven research. A section of the paper is devoted to each category. The last section discusses other important emerging issues related to SVM: the alternative of fine-grained software coherence, hybrid protocols that implement software shared memory across multiple hardware-coherent multiprocessors, and scalability. The paper summarizes comparative performance results from the literature, discusses their limitations, places existing protocols in a framework based on laziness, and identifies the lessons learned so far and some key outstanding questions.
Software Cache Coherence for Large Scale Multiprocessors
- In Proceedings of the First International Symposium on High Performance Computer Architecture
, 1994
"... Shared memory is an appealing abstraction for parallel programming. It must be implemented with caches in order to perform well, however, and caches require a coherence mechanism to ensure that processors reference current data. Hardware coherence mechanisms for large-scale machines are complex and ..."
Abstract
-
Cited by 23 (9 self)
- Add to MetaCart
Shared memory is an appealing abstraction for parallel programming. It must be implemented with caches in order to perform well, however, and caches require a coherence mechanism to ensure that processors reference current data. Hardware coherence mechanisms for large-scale machines are complex and costly, but existing software mechanisms for message-passing machines have not provided a performance-competitive solution. We claim that an intermediate hardware option---memory-mapped network interfaces that support a global physical address space---can provide most of the performance benefits of hardware cache coherence. We present a software coherence protocol that runs on this class of machines and greatly narrows the performance gap between hardware and software coherence. We compare the performance of the protocol to that of existing software and hardware alternatives and evaluate the tradeoffs among various cache-write policies. We also observe that simple program changes can greatly...
Multiprocessor Cache Coherence Based on Virtual Memory Support
, 1995
"... : Virtual memory based cache coherence is a mechanism that relies only on hardware that already exists on the microprocessors of a shared memory multiprocessor system, yet dynamically detects and resolves potential cache inconsistencies using virtualmemory techniques. The key feature of the approac ..."
Abstract
-
Cited by 17 (1 self)
- Add to MetaCart
: Virtual memory based cache coherence is a mechanism that relies only on hardware that already exists on the microprocessors of a shared memory multiprocessor system, yet dynamically detects and resolves potential cache inconsistencies using virtualmemory techniques. The key feature of the approach is that the virtual memory translation hardware on each processor is used to detect shared accesses that could lead to memory incoherencies, and VM page fault handlers execute the appropriate actions to maintain cache coherence. VM-based cache coherence basically trades off design simplicity against increased software overheads. The work presented in this paper evaluates this tradeoff. We show that VM-based cache coherence performs well for scientific applications that require significant aggregate memory bandwidth. ffl Keywords: shared memory, multiprocessors, cache coherence, virtual memory, performance evaluation. ffl Biographies: Karin Petersen is a Member of the Research Staff at Xe...

