Results 1 - 10
of
19
Design and Evaluation of a Compiler Algorithm for Prefetching
- in Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems
, 1992
"... Software-controlled data prefetching is a promising technique for improving the performance of the memory subsystem to match today's high-performance processors. While prefetching is useful in hiding the latency, issuing prefetches incurs an instruction overhead and can increase the load on the memo ..."
Abstract
-
Cited by 451 (21 self)
- Add to MetaCart
Software-controlled data prefetching is a promising technique for improving the performance of the memory subsystem to match today's high-performance processors. While prefetching is useful in hiding the latency, issuing prefetches incurs an instruction overhead and can increase the load on the memory subsystem. As a result, care must be taken to ensure that such overheads do not exceed the benefits.
Tolerating Latency Through Software-Controlled Prefetching in Shared-Memory Multiprocessors
- Journal of Parallel and Distributed Computing
, 1991
"... The large latency of memory accesses is a major obstacle in obtaining high processor utilization in large scale shared-memory multiprocessors. Although the provision of coherent caches in many recent machines has alleviated the problem somewhat, cache misses still occur frequently enough that they s ..."
Abstract
-
Cited by 264 (17 self)
- Add to MetaCart
The large latency of memory accesses is a major obstacle in obtaining high processor utilization in large scale shared-memory multiprocessors. Although the provision of coherent caches in many recent machines has alleviated the problem somewhat, cache misses still occur frequently enough that they significantly lower performance. In this paper we evaluate the effectiveness of non-binding software-controlled lyrefetching, as proposed in the Stanford DASH Multiprocessor, to address this problem. The prefetches are non-binding in the sense that the prefetched data is brought to a cache close to the processor, but is still available to the cache coherence protocol to keep it consistent. Prefetching is software-controlled since the program must explicitly issue prefetch instructions.
Two Techniques to Enhance the Performance of Memory Consistency Models
- In Proceedings of the 1991 International Conference on Parallel Processing
, 1991
"... The memory consistency model supported by a multiprocessor directly affects its performance. Thus, several attempts have been made to relax the consistency models to allow for more buffering and pipelining of memory accesses. Unfortunately, the potential increase in performance afforded by relaxing ..."
Abstract
-
Cited by 124 (5 self)
- Add to MetaCart
The memory consistency model supported by a multiprocessor directly affects its performance. Thus, several attempts have been made to relax the consistency models to allow for more buffering and pipelining of memory accesses. Unfortunately, the potential increase in performance afforded by relaxing the consistency model is accompanied by a more complex programming model. This paper introduces two general implementation techniques that provide higher performance for all the models. The first technique involves prefetching values for accesses that are delayed due to consistency model constraints. The second technique employs speculative execution to allow the processor to proceed even though the consistency model requires the memory accesses to be delayed. When combined, the above techniques alleviate the limitations imposed by a consistency model on buffering and pipelining of memory accesses, thus significantly reducing the impact of the memory consistency model on performance. 1 Intro...
Comparative Evaluation of Latency Reducing and Tolerating Techniques
- In Proceedings of the 18th Annual International Symposium on Computer Architecture
, 1991
"... Techniques that can cope with the large latency of memory accesses are essential for achieving high processor utilization in large-scale shared-memory multiprocessors. In this paper, we consider four architectural techniques that address the latency problem: (i) hardware coherent caches, (ii) relaxe ..."
Abstract
-
Cited by 103 (6 self)
- Add to MetaCart
Techniques that can cope with the large latency of memory accesses are essential for achieving high processor utilization in large-scale shared-memory multiprocessors. In this paper, we consider four architectural techniques that address the latency problem: (i) hardware coherent caches, (ii) relaxed memory consistency, (iii) softwarecontrolled prefetching, and (iv) multiple-context support. While some studies of benefits of the individual techniques have been done, no study evaluates all of the techniques within a consistent framework. This paper attempts to remedy this by providing a comprehensive evaluation of the benefits of the four techniques, both individually and in combinations, using a consistent set of architectural assumptions. The results in this paper have been obtained using detailed simulations of a large-scale shared-memory multiprocessor. Our results show that caches and relaxed consistency uniformly improve performance. The improvements due to prefetching and multiple contexts are sizeable, but are much more applicationdependent. Combinations of the various techniques generally attain better performance than each one on its own. Overall, we show that using suitable combinations of the techniques, performance can be improved by 4 to 7 times.
Compiler-directed Data Prefetching in Multiprocessors with Memory Hierarchies
- In International Conference on Supercomputing
, 1990
"... Memory hierarchies are used by multiprocessor systems to reduce large memory access times. It is necessary to automatically manage such a hierarchy, to obtain effective memory utilization. In this paper, we discuss the various issues involved in obtaining an optimal memory management strategy for a ..."
Abstract
-
Cited by 87 (7 self)
- Add to MetaCart
Memory hierarchies are used by multiprocessor systems to reduce large memory access times. It is necessary to automatically manage such a hierarchy, to obtain effective memory utilization. In this paper, we discuss the various issues involved in obtaining an optimal memory management strategy for a memory hierarchy. We present an algorithm for finding the earliest point in a program that a block of data can be prefetched. This determination is based on the control and data dependences in the program. Such a method is an integral part of more general memory management algorithms. We demonstrate our method's potential by using static analysis to estimate the performance improvement afforded by our prefetching strategy and to analyze the reference patterns in a set of Fortran benchmarks. We also study the effectiveness of prefetching in a realistic shared-memory system using an RTL-level simulator and real codes. This differs from previous studies by considering prefetching benefits in th...
Improving Memory-System Performance of Sparse Matrix-Vector Multiplication
- IBM Journal of Research and Development
, 1997
"... Sparse Matrix-Vector Multiplication is an important kernel that often runs inefficiently on superscalar RISC processors. This paper describe techniques that increase instruction-level parallelism and improve performance. The techniques include reordering to reduce cache misses originally due to Das ..."
Abstract
-
Cited by 61 (0 self)
- Add to MetaCart
Sparse Matrix-Vector Multiplication is an important kernel that often runs inefficiently on superscalar RISC processors. This paper describe techniques that increase instruction-level parallelism and improve performance. The techniques include reordering to reduce cache misses originally due to Das et al., blocking to reduce load instructions, and prefetching to prevent multiple load-store units from stalling simulteneously. The techniques improve performnance from about 40 Mflops (on a well-ordered matrix) to over 100 Mflops on a 266 Mflops machine. The techniques are applicable to other superscalar RISC processors as well and have improved performance on a Sun UltraSparc I workstation, for example. 1 Introduction Sparse matrix-vector multiplication is an important computational kernel in many iterative linear solvers (see [5], for example). Unfortunately, on many computers this kernel runs slowly relative to other numerical codes, such as dense matrix computations. This paper propos...
A Timestamp-based Cache Coherence Scheme
, 1989
"... this paper, we propose a software-assisted cache coherence scheme which overcomes some of the inefficiencies of previous approaches by using a combination of a compile-time marking of references and a hardware-based local incoherence detection scheme. In section 2, we give the notation used througho ..."
Abstract
-
Cited by 29 (1 self)
- Add to MetaCart
this paper, we propose a software-assisted cache coherence scheme which overcomes some of the inefficiencies of previous approaches by using a combination of a compile-time marking of references and a hardware-based local incoherence detection scheme. In section 2, we give the notation used throughout the paper. Section 3 reviews previous software-assisted methods to enforcing cache coherence. In section 4, a complete description of our approach is given along with a correctness proof. Section 5 gives a qualitative comparison of our scheme and the directory-based approaches. Section 6 provides some concluding remarks. Definitions In conventional programs, there are four kinds of data dependences : flow-dependence, antidependence, output-dependence and input-dependence [14]. Let r and r
Evaluating the Impact of Advanced Memory Systems on Compiler-Parallelized Codes
, 1995
"... Compiler-parallelized applications are increasing in importance as moderate-scale multiprocessors become common. This paper evaluates how features of advancedmemory systems (e.g., longer cache lines) impact memory system behavior for applications amenable to compiler parallelization. Using full-size ..."
Abstract
-
Cited by 19 (7 self)
- Add to MetaCart
Compiler-parallelized applications are increasing in importance as moderate-scale multiprocessors become common. This paper evaluates how features of advancedmemory systems (e.g., longer cache lines) impact memory system behavior for applications amenable to compiler parallelization. Using full-sized input data sets and applications taken from the SPEC, NAS, PERFECT, and RICEPS benchmark suites, we measure statistics such as speedups, memory costs, causes of cache misses, cache line utilization, and data traffic. This exploration allows us to draw several conclusions. First, we find that larger granularity parallelism often correlates with good memory system behavior, good overall performance, and high speedup in these applications. Second, we show that when long (512 byte) cache lines are used, many of these applications suffer from false sharing and low cache line utilization. Third, we identify some of the common artifacts in compiler-parallelized codes that can lead to false sharin...
Design and Analysis of a Scalable Cache Coherence Scheme based on Clocks and Timestamps
, 1992
"... this paper, we restrict ourselves to a study of caching of shared variables. The presence of multiple private caches introduces the well-known cache coherence problem [7]. Hardware based protocols to solve the cache coherence problem are well understood in a shared-bus environment (e.g., [17, 22, 32 ..."
Abstract
-
Cited by 15 (0 self)
- Add to MetaCart
this paper, we restrict ourselves to a study of caching of shared variables. The presence of multiple private caches introduces the well-known cache coherence problem [7]. Hardware based protocols to solve the cache coherence problem are well understood in a shared-bus environment (e.g., [17, 22, 32, 37]). However these solutions cannot be extended to the dance-hall multiprocessors since they make use of the instantaneous broadcast and "snoopy" mechanisms provided by the shared-bus. Software-assisted [10, 25, 27, 33, 38, 40] and directory-based [1, 4, 7, 36, 41] schemes are usually advocated in such an environment. In this paper, we propose a software-assisted cache coherence scheme which overcomes some of the inefficiencies of previous approaches by using a combination of a compile-time marking of references and a hardware-based local incoherence detection scheme. We also give a performance evaluation of our proposed scheme. In Section 2, we give the notation used throughout the paper. Section 3 reviews previous software-assisted approaches to enforcing cache coherence. In Section 4, a complete description of our approach is given. A correctness proof of our proposed scheme is given elsewhere [29] and is omitted here. Section 5 gives a quantitative comparison of our scheme with previous approaches. Section 6 provides some concluding remarks. 2 Definitions Programs written for shared-memory multiprocessors may use explicit parallel constructs or may be conventional sequential programs transformed into equivalent parallel ones by a restructuring compiler or a preprocessor like Parafrase [24, 39], PFC [3] or PTRAN [2]. The parallelism is constrained by data dependences : flow-dependence, anti-dependence, and
An Integrated Hardware/Software Solution for Effective Management of Local Storage in High-Performance Systems
- In Proceedings of the International Conference on Parallel Processing, volume II
, 1991
"... The potential of high-performance systems, especially vector and parallel machines, is generally limited by the bandwidth between processors and memory. To achieve the performance of which these machines should be capable, greater emphasis must be placed on optimizing array accesses. We propose a pr ..."
Abstract
-
Cited by 11 (2 self)
- Add to MetaCart
The potential of high-performance systems, especially vector and parallel machines, is generally limited by the bandwidth between processors and memory. To achieve the performance of which these machines should be capable, greater emphasis must be placed on optimizing array accesses. We propose a practical, integrated hardware/software strategy for increasing the effectiveness of local storage management. Our scheme provides many of the advantages of both compile-time and run-time memory management techniques. In this paper, we describe our local storage facility: the priority data cache. We also describe compile-time techniques for easily and effectively utilizing this level of local storage. 1 Introduction The potential of high-performance systems is generally limited by the bandwidth between processors and memory. To achieve the performance of which these machines should be capable, greater emphasis must be placed on optimizing array accesses. At compile time, many safe and profita...

