Results 1 - 10
of
43
Shared memory consistency models: A tutorial
- IEEE Computer
, 1996
"... Parallel systems that support the shared memory abstraction are becoming widely accepted in many areas of computing. Writing correct and efficient programs for such systems requires a formal specification of memory semantics, called a memory consistency model. The most intuitive model—sequential con ..."
Abstract
-
Cited by 297 (8 self)
- Add to MetaCart
Parallel systems that support the shared memory abstraction are becoming widely accepted in many areas of computing. Writing correct and efficient programs for such systems requires a formal specification of memory semantics, called a memory consistency model. The most intuitive model—sequential consistency—greatly restricts the use of many performance optimizations commonly used by uniprocessor hardware and compiler designers, thereby reducing the benefit of using a multiprocessor. To alleviate this problem, many current multiprocessors support more relaxed consistency models. Unfortunately, the models supported by various systems differ from each other in subtle yet important ways. Furthermore, precisely defining the semantics of each model often leads to complex specifications that are difficult to understand for typical users and builders of computer systems. The purpose of this tutorial paper is to describe issues related to memory consistency models in a way that would be understandable to most computer professionals. We focus on consistency models proposed for hardware-based shared-memory systems. Many of these models are originally specified with an emphasis on the system optimizations they allow. We retain the system-centric emphasis, but use uniform and simple terminology to describe the different models. We also briefly discuss an alternate programmer-centric view that describes the models in terms of program behavior rather than specific system optimizations. 1
Memory bandwidth limitations of future microprocessors
- IN PROCEEDINGS OF THE 23RD ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE
, 1996
"... This paper makes the case that pin bandwidth will be a critical consideration for future microprocessors. We show that many of the techniques used to tolerate growing memory latencies do so at the expense of increased bandwidth requirements. Using a decomposition of execution time, we show that for ..."
Abstract
-
Cited by 184 (10 self)
- Add to MetaCart
This paper makes the case that pin bandwidth will be a critical consideration for future microprocessors. We show that many of the techniques used to tolerate growing memory latencies do so at the expense of increased bandwidth requirements. Using a decomposition of execution time, we show that for modern processors that employ aggressive memory latency tolerance techniques, wasted cycles due to insufficient bandwidth generally exceed those due to raw memory latencies. Given the importance of maximizing memory bandwidth, we calculate effective pin bandwidth, then estimate optimal effective pin bandwidth. We measure these quantities by determining the amount by which both caches and minimal-traffic caches filter accesses to the lower levels of the memory hierarchy. We see that there is a gap that can exceed two orders of magnitude between the total memory traffic generated by caches and the minimal-traffic caches—implying that the potential exists to increase effective pin bandwidth substantially. We decompose this traffic gap into four factors, and show they contribute quite differently to traffic reduction for different benchmarks. We conclude that, in the short term, pin bandwidth limitations will make more complex on-chip caches cost-effective. For example, flexible caches may allow individual applications to choose from a range of caching policies. In the long term, we predict that off-chip accesses will be so expensive that all system memory will reside on one or more processor chips.
Reducing Memory Latency via Non-blocking and Prefetching Caches
, 1992
"... Non-blocking caches and prefetching caches are two techniques for hiding memory latency by exploiting the overlap of processor computations with data accesses. A non-blocking cache allows execution to proceed concurrently with cache misses as long as dependency constraints are observed, thus exploit ..."
Abstract
-
Cited by 150 (2 self)
- Add to MetaCart
Non-blocking caches and prefetching caches are two techniques for hiding memory latency by exploiting the overlap of processor computations with data accesses. A non-blocking cache allows execution to proceed concurrently with cache misses as long as dependency constraints are observed, thus exploiting post-miss operations. A prefetching cache generates prefetch requests to bring data in the cache before it is actually needed, thus allowing overlap with pre-miss computations. In this paper, we evaluate the effectiveness of these two hardware-based schemes. We propose a hybrid design based on the combination of these approaches. We also consider compiler-based optimizations to enhance the effectiveness of non-blocking caches. Results from instruction level simulations on the SPEC benchmarks show that the hardware prefetching caches generally outperform non-blocking caches. Also, the relative effectiveness of non-blocking caches is more adversely affected by an increase in memory latency...
Job Scheduling in Multiprogrammed Parallel Systems
, 1997
"... Scheduling in the context of parallel systems is often thought of in terms of assigning tasks in a program to processors, so as to minimize the makespan. This formulation assumes that the processors are dedicated to the program in question. But when the parallel system is shared by a number of us ..."
Abstract
-
Cited by 145 (15 self)
- Add to MetaCart
Scheduling in the context of parallel systems is often thought of in terms of assigning tasks in a program to processors, so as to minimize the makespan. This formulation assumes that the processors are dedicated to the program in question. But when the parallel system is shared by a number of users, this is not necessarily the case. In the context of multiprogrammed parallel machines, scheduling refers to the execution of threads from competing programs. This is an operating system issue, involved with resource allocation, not a program development issue. Scheduling schemes for multiprogrammed parallel systems can be classified as one or two leveled. Single-level scheduling combines the allocation of processing power with the decision of which thread will use it. Two level scheduling decouples the two issues: first, processors are allocated to the job, and then the job's threads are scheduled using this pool of processors. The processors of a parallel system can be shared i...
Complexity/Performance Tradeoffs with Non-Blocking Loads
, 1994
"... Non-blocking loads are a very effective technique for tolerating the cache-miss latency on data cache references. We describe several methods for implementing non-blocking loads. A range of resulting hardware complexity/performance tradeoffs are investigated using an object-code translation and inst ..."
Abstract
-
Cited by 86 (9 self)
- Add to MetaCart
Non-blocking loads are a very effective technique for tolerating the cache-miss latency on data cache references. We describe several methods for implementing non-blocking loads. A range of resulting hardware complexity/performance tradeoffs are investigated using an object-code translation and instrumentation system. We have investigated the SPEC92 benchmarks and have found that for the integer benchmarks, a simple hit-under-miss implementation achieves almost all of the available performance improvement for relatively little cost. However, for most of the numeric benchmarks, more expensive implementations are worthwhile. The results also point out the importance of using a compiler capable of scheduling load instructions for cache misses rather than cache hits in nonblocking systems. This Research Report is a preprint of a paper to appear at the 21st Annual International Symposium on Computer Architecture. d i g i t a l Western Research Laboratory 250 University Avenue Palo Alto, ...
Performance of Database Workloads on Shared-Memory Systems with Out-of-Order Processors
, 1998
"... Database applications such as online transaction processing (OLTP) and decision support systems (DSS) constitute the largest and fastest-growing segment of the market for multiprocessor servers. However, most current system designs have been optimized to perform well on scientific and engineering wo ..."
Abstract
-
Cited by 81 (4 self)
- Add to MetaCart
Database applications such as online transaction processing (OLTP) and decision support systems (DSS) constitute the largest and fastest-growing segment of the market for multiprocessor servers. However, most current system designs have been optimized to perform well on scientific and engineering workloads. Given the radically different behavior of database workloads (especially OLTP), it is important to re-evaluate key system design decisions in the context of this important class of applications. This paper examines the behavior of database workloads on shared-memory multiprocessors with aggressive out-of-order processors, and considers simple optimizations that can provide further performance improvements. Our study is based on detailed simulations of the Oracle commercial database engine. The results show that the combination of out-of-order execution and multiple instruction issue is indeed effective in improving performance of database workloads, providing gains of 1.5 and 2.6 t...
Memory Consistency Models for Shared-Memory Multiprocessors
- WRL RESEARCH REPORT
, 1995
"... The memory consistency model for a shared-memory multiprocessor specifies the behavior of memory with respect to read and write operations from multiple processors. As such, the memory model influences many aspects of system design, including the design of programming languages, compilers, and the u ..."
Abstract
-
Cited by 61 (1 self)
- Add to MetaCart
The memory consistency model for a shared-memory multiprocessor specifies the behavior of memory with respect to read and write operations from multiple processors. As such, the memory model influences many aspects of system design, including the design of programming languages, compilers, and the underlying hardware. Relaxed models that impose fewer memory ordering constraints offer the potential for higher performance by allowing hardware and software to overlap and reorder memory operations. However, fewer ordering guarantees can compromise programmability and portability. Many of the previously proposed models either fail to provide reasonable programming semantics or are biased toward programming ease at the cost of sacrificing performance. Furthermore, the lack of consensus on an acceptable model hinders software portability across different systems. This dissertation focuses on providing a balanced solution that directly addresses the trade-off between programming ease and performance. To address programmability, we propose an alternative method for specifying memory behavior that presents a higher level abstraction to the programmer. We show that with only a few types of information supplied by the
Designing Memory Consistency Models for Shared-Memory Multiprocessors
, 1993
"... The memory consistency model (or memory model) of a shared-memory multiprocessor system influences both the performance and the programmability of the system. The simplest and most intuitive model for programmers, sequential consistency, restricts the use of many performance-enhancing optimizations ..."
Abstract
-
Cited by 51 (8 self)
- Add to MetaCart
The memory consistency model (or memory model) of a shared-memory multiprocessor system influences both the performance and the programmability of the system. The simplest and most intuitive model for programmers, sequential consistency, restricts the use of many performance-enhancing optimizations exploited by uniprocessors. For higher performance, several alternative models have been proposed. However, many of these are hardware-centric in nature and difficult to program. Further, the multitude of many seemingly unrelated memory models inhibits portability. We use a 3P criteria of programmability, portability, and performance to assess memory models, and find current models lacking in one or more of these criteria. This thesis establishes a unifying framework for reasoning about memory models that leads to models that adequately satisfy the 3P criteria. The first contribution of this thesis is a programmer-centric methodology, called sequential consistency normal form (SCNF), for specifying memory models. This methodology is based on the observation that performance enhancing optimizations can be allowed without violating sequential consistency if the system is given some information about the program. An SCNF model is a contract between the system and the programmer, where the system guarantees both high performance and sequential consistency only if the programmer provides certain information about the program. Insufficient information gives lower performance, but incorrect information
The Declining Effectiveness of Dynamic Caching for General-Purpose Microprocessors
, 1995
"... The computational power of commodity general-purpose microprocessors is racing to truly amazing levels. As peak levels of performance rise, the building of memory systems that can keep pace becomes increasingly problematic. We claim that in addition to the latency associated with waiting for operand ..."
Abstract
-
Cited by 49 (7 self)
- Add to MetaCart
The computational power of commodity general-purpose microprocessors is racing to truly amazing levels. As peak levels of performance rise, the building of memory systems that can keep pace becomes increasingly problematic. We claim that in addition to the latency associated with waiting for operands, the bandwidth of the memory system, especially that across the chip boundary, will become a progressively greater limit to high performance. After describing the current state of microsolutions aimed at alleviating the memory bottleneck, this paper postulates that dynamic caches themselves use memory inefficiently and will impede attempts to solve the memory problem. We present an analysis of several important algorithms, which shows that increasing levels of integration will not result in computational requirements outstripping off-chip bandwidth needs, thereby preserving the memory bottleneck. We then present results from two sets of simulations, which measured both the efficiency with which current caching techniques use memory (generally less than 20%), and how well (or poorly) caches reduce traffic to main memory (cache sizes up to 2000 times worse than optimal). We then discuss how two classes of techniques, (i) decoupling memory operations from computation, and (ii) explicit compiler management of the memory hierarchy, provide better long-term solutions to lowering a program's memory latencies and bandwidth requirements. Finally, we describe Galileo, a new project that will attempt to provide a long-term solution to the pernicious memory bottleneck.
An Evaluation of Memory Consistency Models for Shared-Memory Systems with ILP Processors
, 1997
"... The memory consistency model of a shared-memory multiprocessor determines the extent to which memory operations may be overlapped or reordered for better performance. Studies on previous-generation shared-memory multiprocessors have shown that relaxed memory consistency models like release consisten ..."
Abstract
-
Cited by 43 (13 self)
- Add to MetaCart
The memory consistency model of a shared-memory multiprocessor determines the extent to which memory operations may be overlapped or reordered for better performance. Studies on previous-generation shared-memory multiprocessors have shown that relaxed memory consistency models like release consistency (RC) can significantly outperform the conceptually simpler model of sequential consistency (SC). Current and next-generation multiprocessors use commodity microprocessors that aggressively exploit instruction-level parallelism (ILP) using methods such as multiple issue, dynamic scheduling, and non-blocking reads. For such processors, researchers have conjectured that two techniques, hardware-controlled non-binding prefetching and speculative reads, have the potential to equalize the hardware performance of memory consistency models. These techniques have recently begun to appear in commercial microprocessors, and re-open the question of whether the performance benefits of release consiste...

