Results 1 - 10
of
20
Tolerating Latency Through Software-Controlled Prefetching in Shared-Memory Multiprocessors
- Journal of Parallel and Distributed Computing
, 1991
"... The large latency of memory accesses is a major obstacle in obtaining high processor utilization in large scale shared-memory multiprocessors. Although the provision of coherent caches in many recent machines has alleviated the problem somewhat, cache misses still occur frequently enough that they s ..."
Abstract
-
Cited by 264 (17 self)
- Add to MetaCart
The large latency of memory accesses is a major obstacle in obtaining high processor utilization in large scale shared-memory multiprocessors. Although the provision of coherent caches in many recent machines has alleviated the problem somewhat, cache misses still occur frequently enough that they significantly lower performance. In this paper we evaluate the effectiveness of non-binding software-controlled lyrefetching, as proposed in the Stanford DASH Multiprocessor, to address this problem. The prefetches are non-binding in the sense that the prefetched data is brought to a cache close to the processor, but is still available to the cache coherence protocol to keep it consistent. Prefetching is software-controlled since the program must explicitly issue prefetch instructions.
Effective Hardware-based Data Prefetching for High-performance Processors
- IEEE Transactions on Computers
, 1995
"... Abstract-Memory latency and bandwidth are progressing at a much slower pace than processor performance. In this paper, we describe and evaluate the performance of three variations of a hardware function unit whose goal is to assist a data cache in prefetching data accesses so that memory latency is ..."
Abstract
-
Cited by 180 (2 self)
- Add to MetaCart
Abstract-Memory latency and bandwidth are progressing at a much slower pace than processor performance. In this paper, we describe and evaluate the performance of three variations of a hardware function unit whose goal is to assist a data cache in prefetching data accesses so that memory latency is hidden as often as possible. The basic idea of the prefetching scheme is to keep track of data access patterns in a Reference Prediction Table (RPT) organized as an instruction cache. The three designs differ mostly on the timing of the prefetching. In the simplest scheme (basic), prefetches can be generated one iteration ahead of actual use. The lookahead variation takes advantage of a lookahead pro-gram counter that ideally stays one memory latency time ahead of the real program counter and that is used as the control mecha-nism to generate the prefetches. Finally the correlated scheme uses a more sophisticated design to detect patterns across loop levels. These designs are evaluated by simulating the ten SPEC benchmarks on a cycle-by-cycle basis. The results show that 1) the three hardware prefetching schemes all yield significant reductions in the data access penalty when compared with regu-lar caches, 2) the benefits are greater when the hardware assist augments small on-chip caches, and 3) the lookahead scheme is the preferred one cost-performance wise. Index Terms-Prefetching, hardware function unit, reference prediction, branch prediction, data cache, cycle-by-cycle simulations. I.
False Sharing and Spatial Locality in Multiprocessor Caches
- IEEE Transactions on Computers
, 1992
"... The performance of the data cache in shared-memory multiprocessors has been shown to be different from that in uniprocessors. In particular, cache miss rates in multiprocessors do not show the sharp drop typical of uniprocessors when the size of the cache block increases. The resulting high cache mi ..."
Abstract
-
Cited by 115 (4 self)
- Add to MetaCart
The performance of the data cache in shared-memory multiprocessors has been shown to be different from that in uniprocessors. In particular, cache miss rates in multiprocessors do not show the sharp drop typical of uniprocessors when the size of the cache block increases. The resulting high cache miss rate is a cause of concern, since it can significantly limit the performance of multiprocessors. Some researchers have speculated that this effect is due to false sharing, the coherence transactions that result when different processors update different words of the same cache block in an interleaved fashion. While the analysis of six applications in this paper confirms that false sharing has a significant impact on the miss rate, the measurements also show that poor spatial locality among accesses to shared data has an even larger impact. To mitigate false sharing and to enhance spatial locality, we optimize the layout of shared data in cache blocks in a programmer-transparent manner. We...
Data Prefetching for High-Performance Processors
, 1993
"... Recent technological advances are such that the gap between processor cycle times and memory cycle times is growing. Techniques to reduce or tolerate large memory latencies become essential for achieving high processor utilization. In this dissertation, we propose and evaluate data prefetching tech ..."
Abstract
-
Cited by 36 (1 self)
- Add to MetaCart
Recent technological advances are such that the gap between processor cycle times and memory cycle times is growing. Techniques to reduce or tolerate large memory latencies become essential for achieving high processor utilization. In this dissertation, we propose and evaluate data prefetching techniques that address the data access penalty problems. First, we propose a hardware-based data prefetching approach for reducing memory latency. The basic idea of the prefetching scheme is to keep track of data access patterns in a reference prediction table (RPT) organized as an instruction cache. It includes three variations of the design of the RPT and associated logic: generic design, a lookahead mechanism, and a correlated scheme. They differ mostly on the timing of the prefetching. We evaluate the three schemes by ...
Delayed Consistency And Its Effects On The Miss Rate Of Parallel Programs
- In Supercomputing'91 Proceedings
, 1991
"... In cache based multiprocessors a protocol must maintain coherence among replicated copies of shared writable data. In delayed consistency protocols the effect of out-going and incoming invalidations or updates are delayed. Delayed coherence can reduce processor blocking time as well as the effects o ..."
Abstract
-
Cited by 27 (4 self)
- Add to MetaCart
In cache based multiprocessors a protocol must maintain coherence among replicated copies of shared writable data. In delayed consistency protocols the effect of out-going and incoming invalidations or updates are delayed. Delayed coherence can reduce processor blocking time as well as the effects of false sharing. In this paper, we introduce several implementations of delayed consistency for cache-based systems in the framework of a weakly-ordered consistency model. A performance comparison of the delayed protocols with the corresponding On-the-Fly (non-delayed) consistency protocol is made, through execution-driven simulations of four parallel algorithms. The results show that, for parallel programs in which false sharing is a problem, significant reductions in the data miss rate of parallel programs can be obtained with just a small increase in the cost and complexity of the cache system. 2 1.0 Introduction The design of shared memory multiprocessors that can scale up to large n...
Cache Coherence for Shared Memory Multiprocessors Based on Virtual Memory Support
, 1992
"... This paper presents a software cache coherence scheme that uses virtual memory (VM) support to maintain cache coherency for shared memory multiprocessors and requires no special hardware to do so. Traditional VM translation hardware in each processor is used to detect memory access attempts that wou ..."
Abstract
-
Cited by 24 (3 self)
- Add to MetaCart
This paper presents a software cache coherence scheme that uses virtual memory (VM) support to maintain cache coherency for shared memory multiprocessors and requires no special hardware to do so. Traditional VM translation hardware in each processor is used to detect memory access attempts that would violate cache coherence and system software is used to enforce coherence. The implementation of this class of coherence schemes is extremely economical: it requires neither special multiprocessor hardware nor compiler support, and easily incorporates different consistency models. We evaluated two consistency models for the VM-based approach: sequential consistency and lazy release consistency. The VM-based schemes are compared with a bus based snoopy caching architecture, and our trace-driven simulation results show that the VM-based cache coherence schemes are practical for small-scale, shared memory multiprocessors. Keywords: shared memory, multiprocessors, cache coherence, memory manag...
Multiprocessor Cache Coherence Based on Virtual Memory Support
, 1995
"... : Virtual memory based cache coherence is a mechanism that relies only on hardware that already exists on the microprocessors of a shared memory multiprocessor system, yet dynamically detects and resolves potential cache inconsistencies using virtualmemory techniques. The key feature of the approac ..."
Abstract
-
Cited by 17 (1 self)
- Add to MetaCart
: Virtual memory based cache coherence is a mechanism that relies only on hardware that already exists on the microprocessors of a shared memory multiprocessor system, yet dynamically detects and resolves potential cache inconsistencies using virtualmemory techniques. The key feature of the approach is that the virtual memory translation hardware on each processor is used to detect shared accesses that could lead to memory incoherencies, and VM page fault handlers execute the appropriate actions to maintain cache coherence. VM-based cache coherence basically trades off design simplicity against increased software overheads. The work presented in this paper evaluates this tradeoff. We show that VM-based cache coherence performs well for scientific applications that require significant aggregate memory bandwidth. ffl Keywords: shared memory, multiprocessors, cache coherence, virtual memory, performance evaluation. ffl Biographies: Karin Petersen is a Member of the Research Staff at Xe...
Design and Analysis of a Scalable Cache Coherence Scheme based on Clocks and Timestamps
, 1992
"... this paper, we restrict ourselves to a study of caching of shared variables. The presence of multiple private caches introduces the well-known cache coherence problem [7]. Hardware based protocols to solve the cache coherence problem are well understood in a shared-bus environment (e.g., [17, 22, 32 ..."
Abstract
-
Cited by 15 (0 self)
- Add to MetaCart
this paper, we restrict ourselves to a study of caching of shared variables. The presence of multiple private caches introduces the well-known cache coherence problem [7]. Hardware based protocols to solve the cache coherence problem are well understood in a shared-bus environment (e.g., [17, 22, 32, 37]). However these solutions cannot be extended to the dance-hall multiprocessors since they make use of the instantaneous broadcast and "snoopy" mechanisms provided by the shared-bus. Software-assisted [10, 25, 27, 33, 38, 40] and directory-based [1, 4, 7, 36, 41] schemes are usually advocated in such an environment. In this paper, we propose a software-assisted cache coherence scheme which overcomes some of the inefficiencies of previous approaches by using a combination of a compile-time marking of references and a hardware-based local incoherence detection scheme. We also give a performance evaluation of our proposed scheme. In Section 2, we give the notation used throughout the paper. Section 3 reviews previous software-assisted approaches to enforcing cache coherence. In Section 4, a complete description of our approach is given. A correctness proof of our proposed scheme is given elsewhere [29] and is omitted here. Section 5 gives a quantitative comparison of our scheme with previous approaches. Section 6 provides some concluding remarks. 2 Definitions Programs written for shared-memory multiprocessors may use explicit parallel constructs or may be conventional sequential programs transformed into equivalent parallel ones by a restructuring compiler or a preprocessor like Parafrase [24, 39], PFC [3] or PTRAN [2]. The parallelism is constrained by data dependences : flow-dependence, anti-dependence, and
A Preliminary Evaluation of Cache-Miss-Initiated Prefetching Techniques in Scalable Multiprocessors
, 1994
"... this paper we use execution-driven simulation of parallel programs to evaluate these tradeoffs for scalable multiprocessors with high network bandwidth and latency. In particular, we consider the effect on application performance of three different cache-miss-initiated prefetching techniques: (1) la ..."
Abstract
-
Cited by 13 (2 self)
- Add to MetaCart
this paper we use execution-driven simulation of parallel programs to evaluate these tradeoffs for scalable multiprocessors with high network bandwidth and latency. In particular, we consider the effect on application performance of three different cache-miss-initiated prefetching techniques: (1) large cache blocks, which fetch multiple addresses within a single block, (2) sequential prefetching, which fetches multiple consecutive blocks, and (3) hybrid prefetching, a novel technique combining hardware and software support for stride-directed prefetching. Our results show that block sizes between 16 and 128 bytes provide the best performance for our applications; larger blocks either increase the miss rate or incur an increase in the miss penalty that dominates any improvement in the miss rate. Our results also show that sequential and hybrid prefetching perform better than prefetching via large cache blocks, and that hybrid prefetching performs at least as well as sequential prefetching. In fact, hybrid prefetching can perform as well as software prefetching, given sufficient bandwidth and regular memory addressing. Based on these results, we conclude that among the cache-miss-initiated prefetching techniques we consider, hybrid prefetching is the only strategy that can offer significant performance improvements for scalable multiprocessors. The remainder of this paper is organized as follows. In section 2 we describe in detail each of the cache-miss-initiated techniques we consider. In section 3 we describe our simulation methodology, performance metrics, and application workload. We present our experimental results in section 4, and our conclusions in section 5. 2 Cache-Miss-Initiated Prefetching Techniques
The Impact of Parallel Loop Scheduling Strategies on Prefetching in a Shared-Memory Multiprocessor
- IEEE Trans. on Parallel and Distributed Systems
, 1994
"... Trace-driven simulations of numerical Fortran programs are used to study the impact of the parallel loop scheduling strategy on data prefetching in a shared memory multiprocessor with private data caches. The simulations indicate that to maximize memory performance it is important to schedule blocks ..."
Abstract
-
Cited by 10 (2 self)
- Add to MetaCart
Trace-driven simulations of numerical Fortran programs are used to study the impact of the parallel loop scheduling strategy on data prefetching in a shared memory multiprocessor with private data caches. The simulations indicate that to maximize memory performance it is important to schedule blocks of consecutive iterations to execute on each processor, and then to adaptively prefetch singleword cache blocks to match the number of iterations scheduled. Prefetching multiple single-word cache blocks on a miss reduces the miss ratio by approximately 5 to 30 percent compared to a system with no prefetching. In addition, the proposed adaptive prefetching scheme further reduces the miss ratio while significantly reducing the false sharing among cache blocks compared to nonadaptive prefetching strategies. Reducing the false sharing causes fewer coherence invalidations to be generated, and thereby reduces the total network traffic. The impact of the prefetching and scheduling strategies on th...

