Results 1 -
8 of
8
Reducing Memory Latency via Non-blocking and Prefetching Caches
, 1992
"... Non-blocking caches and prefetching caches are two techniques for hiding memory latency by exploiting the overlap of processor computations with data accesses. A non-blocking cache allows execution to proceed concurrently with cache misses as long as dependency constraints are observed, thus exploit ..."
Abstract
-
Cited by 150 (2 self)
- Add to MetaCart
Non-blocking caches and prefetching caches are two techniques for hiding memory latency by exploiting the overlap of processor computations with data accesses. A non-blocking cache allows execution to proceed concurrently with cache misses as long as dependency constraints are observed, thus exploiting post-miss operations. A prefetching cache generates prefetch requests to bring data in the cache before it is actually needed, thus allowing overlap with pre-miss computations. In this paper, we evaluate the effectiveness of these two hardware-based schemes. We propose a hybrid design based on the combination of these approaches. We also consider compiler-based optimizations to enhance the effectiveness of non-blocking caches. Results from instruction level simulations on the SPEC benchmarks show that the hardware prefetching caches generally outperform non-blocking caches. Also, the relative effectiveness of non-blocking caches is more adversely affected by an increase in memory latency...
Models of Machines and Computation for Mapping in Multicomputers
, 1993
"... It is now more than a quarter of a century since researchers started publishing papers on mapping strategies for distributing computation across the computation resource of multiprocessor systems. There exists a large body of literature on the subject, but there is no commonly-accepted framework ..."
Abstract
-
Cited by 76 (1 self)
- Add to MetaCart
It is now more than a quarter of a century since researchers started publishing papers on mapping strategies for distributing computation across the computation resource of multiprocessor systems. There exists a large body of literature on the subject, but there is no commonly-accepted framework whereby results in the field can be compared. Nor is it always easy to assess the relevance of a new result to a particular problem. Furthermore, changes in parallel computing technology have made some of the earlier work of less relevance to current multiprocessor systems. Versions of the mapping problem are classified, and research in the field is considered in terms of its relevance to the problem of programming currently available hardware in the form of a distributed memory multiple instruction stream multiple data stream computer: a multicomputer.
Data Prefetching for High-Performance Processors
, 1993
"... Recent technological advances are such that the gap between processor cycle times and memory cycle times is growing. Techniques to reduce or tolerate large memory latencies become essential for achieving high processor utilization. In this dissertation, we propose and evaluate data prefetching tech ..."
Abstract
-
Cited by 36 (1 self)
- Add to MetaCart
Recent technological advances are such that the gap between processor cycle times and memory cycle times is growing. Techniques to reduce or tolerate large memory latencies become essential for achieving high processor utilization. In this dissertation, we propose and evaluate data prefetching techniques that address the data access penalty problems. First, we propose a hardware-based data prefetching approach for reducing memory latency. The basic idea of the prefetching scheme is to keep track of data access patterns in a reference prediction table (RPT) organized as an instruction cache. It includes three variations of the design of the RPT and associated logic: generic design, a lookahead mechanism, and a correlated scheme. They differ mostly on the timing of the prefetching. We evaluate the three schemes by ...
Designing Programming Languages for Analyzability: A Fresh Look at Pointer Data Structures
- In Proceedings of the IEEE 1992 International Conference on Programming Languages
, 1992
"... In this paper we propose a programming language mechanism and associated compiler techniques which significantly enhance the analyzability of pointerbased data structures frequently used in non-scientific programs. Our approach is based on exploiting two important properties of pointer data structur ..."
Abstract
-
Cited by 20 (5 self)
- Add to MetaCart
In this paper we propose a programming language mechanism and associated compiler techniques which significantly enhance the analyzability of pointerbased data structures frequently used in non-scientific programs. Our approach is based on exploiting two important properties of pointer data structures: structural inductivity and speculative traversability. Structural inductivity facilitates the application of a static interference analysis method for such pointer data structures based on path matrices, and speculative traversability is utilized by a novel loop unrolling technique for while loops that exploit fine-grain parallelism by speculatively traversing such data structures. The effectiveness of this approach is demonstrated by applying it to a collection of loops found in typical nonscientific C programs. 1 Introduction In the past decade, the dramatic improvement of VLSI technology has led to modern high-performance microprocessors that support some level of fine-grain paralle...
Compiler Support for Software Prefetching
, 1998
"... Due to the growing disparity between processor speed and main memory speed, techniques that improve cache utilization and hide memory latency are often needed to help applications achieve peak performance. Compiler-directed software prefetching is a hybrid software/hardware strategy that addresses t ..."
Abstract
-
Cited by 17 (0 self)
- Add to MetaCart
Due to the growing disparity between processor speed and main memory speed, techniques that improve cache utilization and hide memory latency are often needed to help applications achieve peak performance. Compiler-directed software prefetching is a hybrid software/hardware strategy that addresses this need. In this form of prefetching, the compiler inserts cache prefetch instructions into a program during the compilation process. During the program's execution, the hardware executes the prefetch instructions in parallel with other operations, bringing data items into the cache prior to the point where they are actually used, eliminating processor stalls due to cache misses. In this dissertation, we focus on the compiler's role in software prefetching. In a set of experimental studies, we evaluate the performance of current software prefetching strategies, first for sequential benchmark programs running on a simulated uniprocessor machine, and then for a set of parallel benchmarks on a...
An Experimental Evaluation of List Scheduling
- Dept. of Computer Science, Rice University
, 1998
"... While altering the scope of instruction scheduling has a rich heritage in compiler literature, instruction scheduling algorithms have received little coverage in recent times. The widely held belief is that greedy heuristic techniques such as list scheduling are "good" enough for most practical ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
While altering the scope of instruction scheduling has a rich heritage in compiler literature, instruction scheduling algorithms have received little coverage in recent times. The widely held belief is that greedy heuristic techniques such as list scheduling are "good" enough for most practical purposes. The evidence supporting this belief is largely anecdotal with a few exceptions.
GNU Instruction Scheduler: Ailments and Cures in Context of Superscalarity
, 1997
"... In the past, the GNU C compiler (GCC) has been successfully ported to several superscalar microprocessors. For that purpose, the instruction timing of the target processor usually had been modeled in a straightforward manner. Unfortunately, in our experience, this is likely to lead astray the instru ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
In the past, the GNU C compiler (GCC) has been successfully ported to several superscalar microprocessors. For that purpose, the instruction timing of the target processor usually had been modeled in a straightforward manner. Unfortunately, in our experience, this is likely to lead astray the instruction scheduler. In this paper we describe some of our experiments that revealed such flaws, concerning the DEC Alpha 21064 as well as other superscalar RISC processors. We analyze the circumstances that led to poorly scheduled code, and demonstrate how the machine description supplied for a superscalar processor can be modified to fix some of these problems without hampering the portability of the GCC. On the other hand we show situations for which we do not have a solution within the given framework. I. Introduction T HE GNU C Compiler (GCC) [12] has been designed to combine high portability with the generation of fast executing programs. To render the GCC portable, all machine-dependent...
September 1998An Experimental Evaluation of List Scheduling ⋆
"... Abstract. While altering the scope of instruction scheduling has a rich heritage in compiler literature, instruction scheduling algorithms have received little coverage in recent times. The widely held belief is that greedy heuristic techniques such as list scheduling are “good ” enough for most pra ..."
Abstract
- Add to MetaCart
Abstract. While altering the scope of instruction scheduling has a rich heritage in compiler literature, instruction scheduling algorithms have received little coverage in recent times. The widely held belief is that greedy heuristic techniques such as list scheduling are “good ” enough for most practical purposes. The evidence supporting this belief is largely anecdotal with a few exceptions. In this paper we examine some hard evidence in support of list scheduling. To this end we present two alternative algorithms to list scheduling that use randomization: randomized backward forward list scheduling, and iterative repair. Using these alternative algorithms we are better able to examine the conditions under which list scheduling performs well and poorly. Specifically, we explore the efficacy of list scheduling in light of available parallelism, the list scheduling priority heuristic, and number of functional units. While the generic list scheduling algorithm does indeed perform quite well overall, there are important situations which may warrant the use of alternate algorithms. 1

