Results 1 - 10
of
13
Tolerating Latency Through Software-Controlled Prefetching in Shared-Memory Multiprocessors
- Journal of Parallel and Distributed Computing
, 1991
"... The large latency of memory accesses is a major obstacle in obtaining high processor utilization in large scale shared-memory multiprocessors. Although the provision of coherent caches in many recent machines has alleviated the problem somewhat, cache misses still occur frequently enough that they s ..."
Abstract
-
Cited by 263 (17 self)
- Add to MetaCart
The large latency of memory accesses is a major obstacle in obtaining high processor utilization in large scale shared-memory multiprocessors. Although the provision of coherent caches in many recent machines has alleviated the problem somewhat, cache misses still occur frequently enough that they significantly lower performance. In this paper we evaluate the effectiveness of non-binding software-controlled lyrefetching, as proposed in the Stanford DASH Multiprocessor, to address this problem. The prefetches are non-binding in the sense that the prefetched data is brought to a cache close to the processor, but is still available to the cache coherence protocol to keep it consistent. Prefetching is software-controlled since the program must explicitly issue prefetch instructions.
Weak Ordering -- A New Definition
, 1990
"... A memory model for a shared memory, multiprocessor commonly and often implicitly assumed by programmers is that of sequential consistency. This model guarantees that all memory accesses will appear to execute atomically and in program order. An alternative model, weak ordering, offers greater perfor ..."
Abstract
-
Cited by 213 (12 self)
- Add to MetaCart
A memory model for a shared memory, multiprocessor commonly and often implicitly assumed by programmers is that of sequential consistency. This model guarantees that all memory accesses will appear to execute atomically and in program order. An alternative model, weak ordering, offers greater performance potential. Weak ordering was first defined by Dubois, Scheurich and Briggs in terms of a set of rules for hardware that have to be made visible to software. The central hypothesis of this work is that programmers prefer to reason about sequentially consistent memory, rather than having to think about weaker memory, or even write buffers. Following this hypothesis, we re-define weak ordering as a contract between software and hardware. By this contract, software agrees to some formally specified constraints, and hardware agrees to appear sequentially consistent to at least the software that obeys those constraints. We illustrate the power of the new definition with a set of software constraints that forbid data races and an imple-mentation for cache-coherent systems chat is not allowed by the old definition.
Performance Evaluation of Memory Consistency Models for Shared-Memory Multiprocessors
"... The memory consistency model supported by a multiprocessor architecture determines the amount of buffering and pipelining that may be used to hide or reduce the latency of memory accesses. Several different consistency models have been proposed. These range from sequential consistency on one end, al ..."
Abstract
-
Cited by 146 (9 self)
- Add to MetaCart
The memory consistency model supported by a multiprocessor architecture determines the amount of buffering and pipelining that may be used to hide or reduce the latency of memory accesses. Several different consistency models have been proposed. These range from sequential consistency on one end, allowing very limited buffering, to release consistency on the other end, allowing extensive buffering and pipelining. The processor consistency and weak consistency models fall in between. The advantage of the less strict models is increased performance potential. The disadvantage is increased hardware complexity and a more complex programming model. To make an informed decision on the above tradeoff requires performance data for the various models. This paper addresses the issue of performance benefits from the above four consistency models. Our results are based on simulation studies done for three applications. The results show that in an environment where processor reads are blocking and writes are buffered, a significant performance increase is achieved from allowing reads to bypass previous writes. Pipelining of writes, which determines the rate at which writes are retired from the write buffer, is of secondary importance. As a result, we show that the sequential consistency model performs poorly relative to all other models, while the processor consistency model provides most of the benefits of the weak and release consistency models.
Two Techniques to Enhance the Performance of Memory Consistency Models
- In Proceedings of the 1991 International Conference on Parallel Processing
, 1991
"... The memory consistency model supported by a multiprocessor directly affects its performance. Thus, several attempts have been made to relax the consistency models to allow for more buffering and pipelining of memory accesses. Unfortunately, the potential increase in performance afforded by relaxing ..."
Abstract
-
Cited by 123 (5 self)
- Add to MetaCart
The memory consistency model supported by a multiprocessor directly affects its performance. Thus, several attempts have been made to relax the consistency models to allow for more buffering and pipelining of memory accesses. Unfortunately, the potential increase in performance afforded by relaxing the consistency model is accompanied by a more complex programming model. This paper introduces two general implementation techniques that provide higher performance for all the models. The first technique involves prefetching values for accesses that are delayed due to consistency model constraints. The second technique employs speculative execution to allow the processor to proceed even though the consistency model requires the memory accesses to be delayed. When combined, the above techniques alleviate the limitations imposed by a consistency model on buffering and pipelining of memory accesses, thus significantly reducing the impact of the memory consistency model on performance. 1 Intro...
SPAID: Software Prefetching in Pointer- and Call-Intensive Environments
- In Proceedings of the 28th annual international symposium on Microarchitecture
, 1995
"... Software prefetching, typically in the context of numericor loop-intensive benchmarks, has been proposed as one remedy for the performance bottleneck imposed on computer systems by the cost of servicing cache misses. This paper proposes a new heuristic--SPAID--for utilizing prefetch instructions in ..."
Abstract
-
Cited by 58 (3 self)
- Add to MetaCart
Software prefetching, typically in the context of numericor loop-intensive benchmarks, has been proposed as one remedy for the performance bottleneck imposed on computer systems by the cost of servicing cache misses. This paper proposes a new heuristic--SPAID--for utilizing prefetch instructions in pointer- and call-intensive environments. We use trace-driven cache simulation of a number of pointer- and call-intensive benchmarks to evaluate the benefits and implementation trade-offs of SPAID. Our results indicate that a significant proportion of the cost of data cache misses can be eliminated or reduced with SPAID without unduly increasing memory traffic. 1. Introduction It is well known that processor clock speeds are increasing exponentially over time, while memory speeds are not increasing nearly as rapidly [RD94]. The computing industry has reached the point where system performance is dominated by the cost of servicing cache misses. To address this problem, several instruction s...
Designing Memory Consistency Models for Shared-Memory Multiprocessors
, 1993
"... The memory consistency model (or memory model) of a shared-memory multiprocessor system influences both the performance and the programmability of the system. The simplest and most intuitive model for programmers, sequential consistency, restricts the use of many performance-enhancing optimizations ..."
Abstract
-
Cited by 51 (8 self)
- Add to MetaCart
The memory consistency model (or memory model) of a shared-memory multiprocessor system influences both the performance and the programmability of the system. The simplest and most intuitive model for programmers, sequential consistency, restricts the use of many performance-enhancing optimizations exploited by uniprocessors. For higher performance, several alternative models have been proposed. However, many of these are hardware-centric in nature and difficult to program. Further, the multitude of many seemingly unrelated memory models inhibits portability. We use a 3P criteria of programmability, portability, and performance to assess memory models, and find current models lacking in one or more of these criteria. This thesis establishes a unifying framework for reasoning about memory models that leads to models that adequately satisfy the 3P criteria. The first contribution of this thesis is a programmer-centric methodology, called sequential consistency normal form (SCNF), for specifying memory models. This methodology is based on the observation that performance enhancing optimizations can be allowed without violating sequential consistency if the system is given some information about the program. An SCNF model is a contract between the system and the programmer, where the system guarantees both high performance and sequential consistency only if the programmer provides certain information about the program. Insufficient information gives lower performance, but incorrect information
The Combined Effectiveness of Unimodular Transformations, Tiling, and Software Prefetching
- In Proceedings of the 10th International Parallel Processing Symposium
, 1996
"... Unimodular transformations, tiling, and software prefetching are loop optimizations known to be effective in increasing parallelism, reducing cache miss rates, and eliminating processor stall time. Although these optimizations individually are quite effective, there is the expectation that even bett ..."
Abstract
-
Cited by 18 (1 self)
- Add to MetaCart
Unimodular transformations, tiling, and software prefetching are loop optimizations known to be effective in increasing parallelism, reducing cache miss rates, and eliminating processor stall time. Although these optimizations individually are quite effective, there is the expectation that even better improvements can be obtained by combining them together. In this paper we show that indeed this is the case when unimodular transformations are combined with either tiling or software prefetching. However, our results also show that although combining tiling with prefetching tends to improve the performance of tiling alone, it is also the case that in some situations tiling can degrade the cache performance of software prefetching. The reasons for this unexpected behavior are three fold: 1) tiling introduces interference misses inside the localized space which are difficult to characterize with current techniques based on locality analysis; 2) prefetch predicates are computed using only e...
An Overview of Mermera: A System and Formalism for Non-coherent Distributed Parallel Memory
- IN PROC. 26TH HAWAII INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES, MAUI, HAWAII
, 1993
"... Several non-coherent memories have been proposed to address the problem of poor scalability of traditional coherent shared memory systems. However, such memories make programming much more difficult than coherent memory. We give an overview of Mermera a system that gives the programmer the choice of ..."
Abstract
-
Cited by 15 (5 self)
- Add to MetaCart
Several non-coherent memories have been proposed to address the problem of poor scalability of traditional coherent shared memory systems. However, such memories make programming much more difficult than coherent memory. We give an overview of Mermera a system that gives the programmer the choice of coherent and non-coherent behavior in the same program. We sketch a formal model that describes Mermera's non-coherent behavior. This model helps us identify a new non-coherent behavior that we call Local Consistency. We also report our measurements from a pilot implementation on a BBN Butterfly which show the response time of pipelined-RAM operations to be an order of magnitude faster than that of coherent operations. Thus, we begin to illuminate a new trade-off in which the programmer can significantly improve shared memory performance at the cost of tolerating varying degrees of non-coherence.
Improving Processor Performance by Dynamically Pre-Processing the Instruction Stream
, 1998
"... The exponentially increasing gap between processors and off-chip memory, as measured in processor cycles, is rapidly turning memory latency into a major processor performance bottleneck. Traditional solutions, such as employing multiple levels of caches, are expensive and do not work well with som ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
The exponentially increasing gap between processors and off-chip memory, as measured in processor cycles, is rapidly turning memory latency into a major processor performance bottleneck. Traditional solutions, such as employing multiple levels of caches, are expensive and do not work well with some applications. We evaluate a technique, called runahead pre-processing, that can significantly improve processor performance. The basic idea behind runahead is to use the processor pipeline to pre-process instructions during cache miss cycles, instead of stalling. The pre-processed instructions are used to generate highly accurate instruction and data stream prefetches, while all of the pre-processed instruction results are discarded after the cache miss has been serviced: this allows us to achieve a form of very aggressive speculation with a simple in-order...
Weak Ordering - A New Definition And Some Implications
, 1989
"... This paper is primarily concerned with the programmer's model of a shared memory system and its implications on hardware design and performance. A model for correct behavior of programs commonly (and often implicitly) assumed by programmers is that of sequential consistency, formally defined by Lamp ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
This paper is primarily concerned with the programmer's model of a shared memory system and its implications on hardware design and performance. A model for correct behavior of programs commonly (and often implicitly) assumed by programmers is that of sequential consistency, formally defined by Lamport [Lam79] as follows: [A system is sequentially consistent if] the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program. While the definition leaves the specific interpretation of the term operations undefined, for a shared memory system, it is usually assumed to refer to memory operations or accesses (e.g., reads and writes). Thus, stated simply for a shared memory system, the above definition translates into the following two conditions - (1) all memory accesses appear to execute atomically in some total order, and (2) all memory accesses of a single process appear to execute in program order. Uniprocessor systems offer the model of sequential consistency almost naturally and without much compromise in performance. In the simplest of architectures, where a processor is allowed to issue a memory access only after the previous access in program order is complete, a total order of memory accesses can be obtained based on the wall-clock time of their issue or execution. More sophisticated architectures allow overlap of instruction execution, out-of-order memory accesses, write buffers, caches (which may be lock-up free [Kro81]), etc. In these machines, an ordering of memory accesses based on wall-clock time of issue or execution may violate program order, but interlock logic assures that accesses appear...

