Results 1 - 10
of
73
Fine-grain Access Control for Distributed Shared Memory
- In Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS VI
, 1994
"... This paper discusses implementations of fine-grain memory access control, which selectively restricts reads and writes to cache-block-sized memory regions. Fine-grain access control forms the basis of efficient cache-coherent shared memory. This paper focuses on low-cost implementations that require ..."
Abstract
-
Cited by 160 (26 self)
- Add to MetaCart
This paper discusses implementations of fine-grain memory access control, which selectively restricts reads and writes to cache-block-sized memory regions. Fine-grain access control forms the basis of efficient cache-coherent shared memory. This paper focuses on low-cost implementations that require little or no additional hardware. These techniques permit efficient implementation of shared memory on a wide range of parallel systems, thereby providing shared-memory codes with a portability previously limited to message passing. This paper categorizes techniques based on where access control is enforced and where access conflicts are handled. We incorporated three techniques that require no additional hardware into Blizzard, a system that supports distributed shared memory on the CM-5. The first adds a software lookup before each shared-memory reference by modifying the program's executable. The second uses the memory's error correcting code (ECC) as cache-block valid bits. The third is...
Pointer Analysis for Multithreaded Programs
- ACM SIGPLAN 99
, 1999
"... This paper presents a novel interprocedural, flow-sensitive, and context-sensitive pointer analysis algorithm for multithreaded programs that may concurrently update shared pointers. For each pointer and each program point, the algorithm computes a conservative approximation of the memory locations ..."
Abstract
-
Cited by 125 (13 self)
- Add to MetaCart
This paper presents a novel interprocedural, flow-sensitive, and context-sensitive pointer analysis algorithm for multithreaded programs that may concurrently update shared pointers. For each pointer and each program point, the algorithm computes a conservative approximation of the memory locations to which that pointer may point. The algorithm correctly handles a full range of constructs in multithreaded programs, including recursive functions, function pointers, structures, arrays, nested structures and arrays, pointer arithmetic, casts between pointer variables of different types, heap and stack allocated memory, shared global variables, and thread-private global variables. We have implemented the algorithm in the SUIF compiler system and used the implementation to analyze a sizable set of multithreaded programs written in the Cilk multithreaded programming language. Our experimental results show that the analysis has good precision and converges quickly for our set of Cilk programs.
Software Caching and Computation Migration in Olden
, 1995
"... The goal of the Olden project is to build a system that provides parallelism for general purpose C programs with minimal programmer annotations. We focus on programs using dynamic structures such as trees, lists, and DAGs. We demonstrate that providing both software caching and computation migratio ..."
Abstract
-
Cited by 84 (0 self)
- Add to MetaCart
The goal of the Olden project is to build a system that provides parallelism for general purpose C programs with minimal programmer annotations. We focus on programs using dynamic structures such as trees, lists, and DAGs. We demonstrate that providing both software caching and computation migration can improve the performance of these programs, and provide a compile-time heuristic that selects between them for each pointer dereference. We have implemented a prototype system on the Thinking Machines CM-5. We describe our implementation and report on experiments with ten benchmarks.
Efficient Support for Irregular Applications on Distributed-Memory Machines
, 1995
"... Irregular computation problems underlie many important scientific applications. Although these problems are computationally expensive, and so would seem appropriate for parallel machines, their irregular and unpredictable run-time behavior makes this type of parallel program difficult to write and a ..."
Abstract
-
Cited by 81 (12 self)
- Add to MetaCart
Irregular computation problems underlie many important scientific applications. Although these problems are computationally expensive, and so would seem appropriate for parallel machines, their irregular and unpredictable run-time behavior makes this type of parallel program difficult to write and adversely affects run-time performance. This paper explores three issues -- partitioning, mutual exclusion, and data transfer -- crucial to the efficient execution of irregular problems on distributed-memory machines. Unlike previous work, we studied the same programs running in three alternative systems on the same hardware base (a Thinking Machines CM-5): the CHAOS irregular application library, Transparent Shared Memory (TSM), and eXtensible Shared Memory (XSM). CHAOS and XSM performed equivalently for all three applications. Both systems were somewhat (13%) to significantly faster (991%) than TSM.
Memory sharing predictor: The key to a speculative coherent DSM
- In Proceedings of the 26th annual international symposium on Computer architecture
, 1999
"... Recent research advocates using general message predictors to learn and predict the coherence activity in distributed shared memory (DSM). By accurately predicting a message and timely invoking the necessary coherence actions, a DSM can hide much of the remote access latency. This paper proposes the ..."
Abstract
-
Cited by 57 (6 self)
- Add to MetaCart
Recent research advocates using general message predictors to learn and predict the coherence activity in distributed shared memory (DSM). By accurately predicting a message and timely invoking the necessary coherence actions, a DSM can hide much of the remote access latency. This paper proposes the Memory Sharing Predictors (MSPs), pattern-based predictors that significantly improve prediction accuracy and implementation cost over general message predictors. An MSP is based on the key observation that to hide the remote access latency, a predictor must accurately predict only the remote memory accesses (i.e., request messages) and not the subsequent coherence messages invoked by an access. Simulation results indicate that MSPs improve prediction accuracy over general message predictors from 81 % to 93 % while requiring less storage overhead. This paper also presents the first design and evaluation for a speculative coherent DSM using pattern-based predictors. We identify simple techniques and mechanisms to trigger prediction timely and perform speculation for remote read accesses. Our speculation hardware readily works with a conventional full-map write-invalidate coherence protocol without any modifications. Simulation results indicate that performing speculative read requests alone reduces execution times by 12 % in our shared-memory applications. 1
Using Prediction to Accelerate Coherence Protocols
, 1998
"... Most large shared-memory multiprocessors use directory protocols to keep per-processor caches coherent. Some memory references in such systems, however, suffer long latencies for misses to remotely cached blocks. To ameliorate this latency, researchers have augmented standard coherence protocols wit ..."
Abstract
-
Cited by 52 (4 self)
- Add to MetaCart
Most large shared-memory multiprocessors use directory protocols to keep per-processor caches coherent. Some memory references in such systems, however, suffer long latencies for misses to remotely cached blocks. To ameliorate this latency, researchers have augmented standard coherence protocols with optimizations for specific sharing patterns, such as read-modify-write, producer-consumer, and migratory sharing. This paper seeks to replace these directed solutions with general prediction logic that monitors coherence activity and triggers appropriate coherence actions. This paper takes the first step toward using general prediction to accelerate coherence protocols by developing and evaluating the Cosmos coherence message predictor. Cosmos predicts the source and type of the next coherence message for a cache block using logic that is an extension of Yeh and Patt's two-level PAp branch predictor. For five scientific applications running on 16 processors, Cosmos has prediction accuracie...
Dynamic Partitioning of Non-Uniform Structured Workloads with Spacefilling Curves
- IEEE Transactions on Parallel and Distributed Systems
, 1995
"... We discuss Inverse Spacefilling Partitioning (ISP), a partitioning strategy for nonuniform scientific computations running on distributed memory MIMD parallel computers. We consider the case of a dynamic workload distributed on a uniform mesh, and compare ISP against Orthogonal Recursive Bisectio ..."
Abstract
-
Cited by 51 (2 self)
- Add to MetaCart
We discuss Inverse Spacefilling Partitioning (ISP), a partitioning strategy for nonuniform scientific computations running on distributed memory MIMD parallel computers. We consider the case of a dynamic workload distributed on a uniform mesh, and compare ISP against Orthogonal Recursive Bisection (ORB) and a Median of Medians variant of ORB, ORB-MM. We present two results. First, ISP and ORB-MM are superior to ORB in rendering balanced workloads---because they are more finegrained ---and incur communication overheads that are comparable to ORB. Second, ISP is more attractive than ORB-MM from a software engineering standpoint because it avoids elaborate bookkeeping. Whereas ISP partitionings can be described succinctly as logically contiguous segments of the line, ORB-MM's partitionings are inherently unstructured. We describe the general d-dimensional ISP algorithm and report empirical results with two- and three-dimensional, non-hierarchical particle methods. Scott B. Bad...
Where is Time Spent in Message-Passing and Shared-Memory Programs?
- In Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS VI
, 1994
"... Message passing and shared memory are two techniques parallel programs use for coordination and communication. This paper studies the strengths and weaknesses of these two mechanisms by comparing equivalent, well-written message-passing and shared-memory programs running on similar hardware. To ensu ..."
Abstract
-
Cited by 51 (4 self)
- Add to MetaCart
Message passing and shared memory are two techniques parallel programs use for coordination and communication. This paper studies the strengths and weaknesses of these two mechanisms by comparing equivalent, well-written message-passing and shared-memory programs running on similar hardware. To ensure that our measurements are comparable, we produced two carefully tuned versions of each program and measured them on closely-related simulators of a message-passing and a shared-memory machine, both of which are based on same underlying hardware assumptions. We examined the behavior and performance of each program carefully. Although the cost of computation in each pair of programs was similar, synchronization and communication differed greatly. We found that message-passing's advantage over shared-memory is not clear-cut. Three of the four shared-memory programs ran at roughly the same speed as their message-passing equivalent, even though their communication patterns were different. 1 In...
Dynamic Feedback: An Effective Technique for Adaptive Computing
, 1997
"... This paper presents dynamic feedback, a technique that enables computations to adapt dynamically to different execution environments. A compiler that uses dynamic feedback produces several different versions of the same source code; each version uses a different optimization policy. The generated co ..."
Abstract
-
Cited by 51 (3 self)
- Add to MetaCart
This paper presents dynamic feedback, a technique that enables computations to adapt dynamically to different execution environments. A compiler that uses dynamic feedback produces several different versions of the same source code; each version uses a different optimization policy. The generated code alternately performs sampling phases and production phases. Each sampling phase measures the overhead of each version in the current environment. Each production phase uses the version with the least overhead in the previous sampling phase. The computation periodically resamples to adjust dynamically to changes in the environment.
Coherent network interfaces for fine-grain communication
- In Proceedings of the 23rd Annual International Symposium on Computer Architecture
, 1996
"... Historically, processor accesses to memory-mapped device registers huve been marked uncachable to insure their visibili ~ to the device. The ubiquity of snooping cache coherence, howeveg makes it possible for processors and devices to interact with cachable, coherent memory operations. Using coheren ..."
Abstract
-
Cited by 48 (14 self)
- Add to MetaCart
Historically, processor accesses to memory-mapped device registers huve been marked uncachable to insure their visibili ~ to the device. The ubiquity of snooping cache coherence, howeveg makes it possible for processors and devices to interact with cachable, coherent memory operations. Using coherence can improve performance by facilitating burst transfers of whole cache blocks and reducing control overheads (e. g., for polling). This paper begins an exploration of network inter-jtces (NIs) that use coherence—coherent network interfaces (CNIs)--to improve communication performance, We restrict this study to NI/ CNIS that reside on coherent memoty or I/O buses, to NVCNIS that are much simpler than processors, and to the pe~ormance of&egrain messagingfiom user process to user process. Our jirst contribution is to develop and optimize two mechanisms that CNIS use to communicate with processors. A cachable device register—derived from cachable control registers [39)40]— is a coherent, cachable block of memory used to transfer status, control, or data between a device and a processor Cachable queues generalize cachable device registers from one cachable, coherent memory block to a contiguous region of cachable, coherent blocks managed as a circular queue. Our second contribution is a taxonomy and comparison of four CNIS with a more conventional NI. Microbenchmark results show that CNIS can improve the round-trip latency and achievable bandwidth of a small 64-byte message by 37 % and 125 % respectively on the memory bus and 74 % and 123 % respectively on a coherent 1/0 bus. Experiments with jive macrobenchmarks show that CNIS can improve the pe~ormance by 17-5370 on the memory bus and 30-88 % on the I/O bus.

