Results 1 - 10
of
11
Reducing Memory Latency via Non-blocking and Prefetching Caches
, 1992
"... Non-blocking caches and prefetching caches are two techniques for hiding memory latency by exploiting the overlap of processor computations with data accesses. A non-blocking cache allows execution to proceed concurrently with cache misses as long as dependency constraints are observed, thus exploit ..."
Abstract
-
Cited by 150 (2 self)
- Add to MetaCart
Non-blocking caches and prefetching caches are two techniques for hiding memory latency by exploiting the overlap of processor computations with data accesses. A non-blocking cache allows execution to proceed concurrently with cache misses as long as dependency constraints are observed, thus exploiting post-miss operations. A prefetching cache generates prefetch requests to bring data in the cache before it is actually needed, thus allowing overlap with pre-miss computations. In this paper, we evaluate the effectiveness of these two hardware-based schemes. We propose a hybrid design based on the combination of these approaches. We also consider compiler-based optimizations to enhance the effectiveness of non-blocking caches. Results from instruction level simulations on the SPEC benchmarks show that the hardware prefetching caches generally outperform non-blocking caches. Also, the relative effectiveness of non-blocking caches is more adversely affected by an increase in memory latency...
ADAPTIVE OPTIMIZATION FOR SELF: RECONCILING HIGH PERFORMANCE WITH EXPLORATORY PROGRAMMING
, 1994
"... Object-oriented programming languages confer many benefits, including abstraction, which lets the programmer hide
the details of an object’s implementation from the object’s clients. Unfortunately, crossing abstraction boundaries
often incurs a substantial run-time overhead in the form of frequent p ..."
Abstract
-
Cited by 95 (6 self)
- Add to MetaCart
Object-oriented programming languages confer many benefits, including abstraction, which lets the programmer hide
the details of an object’s implementation from the object’s clients. Unfortunately, crossing abstraction boundaries
often incurs a substantial run-time overhead in the form of frequent procedure calls. Thus, pervasive use of abstraction,
while desirable from a design standpoint, may be impractical when it leads to inefficient programs.
Aggressive compiler optimizations can reduce the overhead of abstraction. However, the long compilation times
introduced by optimizing compilers delay the programming environment‘s responses to changes in the program.
Furthermore, optimization also conflicts with source-level debugging. Thus, programmers are caught on the horns of
two dilemmas: they have to choose between abstraction and efficiency, and between responsive programming environments
and efficiency. This dissertation shows how to reconcile these seemingly contradictory goals by performing
optimizations lazily.
Four new techniques work together to achieve high performance and high responsiveness:
• Type feedback achieves high performance by allowing the compiler to inline message sends based on information
extracted from the runtime system. On average, programs run 1.5 times faster than the previous SELF system;
compared to a commercial Smalltalk implementation, two medium-sized benchmarks run about three times faster.
This level of performance is obtained with a compiler that is both simpler and faster than previous SELF compilers.
• Adaptive optimization achieves high responsiveness without sacrificing performance by using a fast nonoptimizing
compiler to generate initial code while automatically recompiling heavily used parts of the program
with an optimizing compiler. On a previous-generation workstation like the SPARCstation-2, fewer than 200
pauses exceeded 200 ms during a 50-minute interaction, and 21 pauses exceeded one second. On a currentgeneration
workstation, only 13 pauses exceed 400 ms.
• Dynamic deoptimization shields the programmer from the complexity of debugging optimized code by
transparently recreating non-optimized code as needed. No matter whether a program is optimized or not, it can
always be stopped, inspected, and single-stepped. Compared to previous approaches, deoptimization allows more
debugging while placing fewer restrictions on the optimizations that can be performed.
• Polymorphic inline caching generates type-case sequences on-the-fly to speed up messages sent from the same
call site to several different types of object. More significantly, they collect concrete type information for the
optimizing compiler.
With better performance yet good interactive behavior, these techniques make exploratory programming possible
both for pure object-oriented languages and for application domains requiring higher ultimate performance, reconciling
exploratory programming, ubiquitous abstraction, and high performance.
The detection and elimination of useless misses in multiprocessors
- In Proceedings of the 20th International Symposium on Computer Architecture
, 1993
"... In this paper we introduce a classification of misses in shared-memory multiprocessors based on inter processor communication. We identify the set of essential misses, i.e., the smallest set of misses necessary for correct execution. Essential misses include cold misses and true sharing misses. All ..."
Abstract
-
Cited by 64 (3 self)
- Add to MetaCart
In this paper we introduce a classification of misses in shared-memory multiprocessors based on inter processor communication. We identify the set of essential misses, i.e., the smallest set of misses necessary for correct execution. Essential misses include cold misses and true sharing misses. All other misses are useless misses and can be ignored without affecting program execution. Based on the new classification we evaluate miss reduction techniques in hardware, based on delaying and combining invalidations. We compare the effectiveness of five different protocols for combining invalidations leading to useless misses for cachebased multiprocessors and for multiprocessors with virtual shared memory. In cache based systems these techniques are very effective and lead to miss rates which are close to the minimum. In virtual shared memory systems, the techniques are also effective but leave room for additional improvements.
Data Prefetching for High-Performance Processors
, 1993
"... Recent technological advances are such that the gap between processor cycle times and memory cycle times is growing. Techniques to reduce or tolerate large memory latencies become essential for achieving high processor utilization. In this dissertation, we propose and evaluate data prefetching tech ..."
Abstract
-
Cited by 36 (1 self)
- Add to MetaCart
Recent technological advances are such that the gap between processor cycle times and memory cycle times is growing. Techniques to reduce or tolerate large memory latencies become essential for achieving high processor utilization. In this dissertation, we propose and evaluate data prefetching techniques that address the data access penalty problems. First, we propose a hardware-based data prefetching approach for reducing memory latency. The basic idea of the prefetching scheme is to keep track of data access patterns in a reference prediction table (RPT) organized as an instruction cache. It includes three variations of the design of the RPT and associated logic: generic design, a lookahead mechanism, and a correlated scheme. They differ mostly on the timing of the prefetching. We evaluate the three schemes by ...
Essential Misses And Data Traffic In Coherence Protocols
, 1995
"... In this paper we introduce a classification of misses and of components of the data traffic in shared-memory multiprocessors based on inter-processor communication. We consider protocols with invalidations, updates and prefetches in systems with infinite and finite caches. We identify the set of ess ..."
Abstract
-
Cited by 14 (6 self)
- Add to MetaCart
In this paper we introduce a classification of misses and of components of the data traffic in shared-memory multiprocessors based on inter-processor communication. We consider protocols with invalidations, updates and prefetches in systems with infinite and finite caches. We identify the set of essential misses and the essential traffic, i.e., the smallest set of misses and the smallest amount of traffic necessary for correct execution. The rest of the misses and of the data traffic is non-essential and could be ignored without affecting the correctness of program execution. To illustrate the classification of misses and traffic, we apply it to a set of parallel scientific programs and observe the overhead created by different hardware mechanisms when block sizes and cache sizes are varied. Keywords: Shared-memory multiprocessor, virtual shared memory, cache coherence, latency tolerance, performance evaluation, execution-driven simulations. 1. This research was partially supported by ...
Improving Processor Performance by Dynamically Pre-Processing the Instruction Stream
, 1998
"... The exponentially increasing gap between processors and off-chip memory, as measured in processor cycles, is rapidly turning memory latency into a major processor performance bottleneck. Traditional solutions, such as employing multiple levels of caches, are expensive and do not work well with som ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
The exponentially increasing gap between processors and off-chip memory, as measured in processor cycles, is rapidly turning memory latency into a major processor performance bottleneck. Traditional solutions, such as employing multiple levels of caches, are expensive and do not work well with some applications. We evaluate a technique, called runahead pre-processing, that can significantly improve processor performance. The basic idea behind runahead is to use the processor pipeline to pre-process instructions during cache miss cycles, instead of stalling. The pre-processed instructions are used to generate highly accurate instruction and data stream prefetches, while all of the pre-processed instruction results are discarded after the cache miss has been serviced: this allows us to achieve a form of very aggressive speculation with a simple in-order...
Specialized Caches To Improve Data Access Performance
, 1993
"... High performance processor organizations place large demands on the data memory hierarchy. The data bandwidth requirements of a processor can be a serious performance constraint. As a result, cache memory is a very important element in the implementation of a high performance computer. Caching a sub ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
High performance processor organizations place large demands on the data memory hierarchy. The data bandwidth requirements of a processor can be a serious performance constraint. As a result, cache memory is a very important element in the implementation of a high performance computer. Caching a subset of main memory in faster memory provides the processor with high bandwidth and low latency data. The utility of caching is determined by the cache size, organization, and speed. The focus of this dissertation is to improve the performance of a data cache by the addition of small specialized caches. By analyzing the spatial and temporal locality in the data reference stream at various points in the data memory hierarchy, we have developed a mixture of specialized caches to improve the data memory hierarchy. We have developed and analyzed write caches, tag caches, subword caches, fetch caches, and a two-level windowed register file. Each specialized cache reduces the bandwidth and/or laten...
Area And Performance Analysis Of Processor Configurations With Scaling Of Technology
, 1994
"... As integrated circuit density increases, computer architects face the interesting problem of how best to utilize the available die size given cost and performance constraints. Traditionally, area partitioning and floor-planning have been done in an ad hoc fashion based on intuition and experience of ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
As integrated circuit density increases, computer architects face the interesting problem of how best to utilize the available die size given cost and performance constraints. Traditionally, area partitioning and floor-planning have been done in an ad hoc fashion based on intuition and experience of the designers. This paper proposes a systematic methodology for correlating area and performance as the designer increases the transistor count of a given sub-unit. Specifically, we investigate the performance of three possible processor configurations, and present performance results as the mimimun feature size is reduced. Key Words and Phrases: cache, technology scaling, bus traffic, Area modeling, superscalar, multiprocessor Copyright c fl 1994 by Steve Fu and Michael Flynn Contents 1 Introduction 1 2 Processor Specifications 1 3 Methodology 6 4 Area Modeling 6 5 Cache Configuration Exploration 7 6 Cache Design 10 6.1 Unified versus Split : : : : : : : : : : : : : : : : : : : : : :...
Systematic Objective-driven Computer Architecture Optimization
- in Proc. 16th Conference on Advanced Research in VLSI (ARVLSI'95
, 1995
"... Computer designers now have more transistors and architectural alternatives than at any time. Computer-aided design tools automate much of the physical design process. However, few tools have been developed to help the computer architect specify near-optimal microarchitectural configurations in the ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Computer designers now have more transistors and architectural alternatives than at any time. Computer-aided design tools automate much of the physical design process. However, few tools have been developed to help the computer architect specify near-optimal microarchitectural configurations in the early design stages. Such tools are needed to systematically guide the early design specifications subject to multiple objectives such as cost, performance, and power consumption. This paper illustrates an objective-driven microarchitectural design methodology that couples the specification design phase with an optimization technique. The design of a memory hierarchy with multiple performance objectives is used as a case study. This is a directed search problem with a high dimensionality. We show that the genetic algorithm, a global optimization technique based on the metaphor of natural selection and survival of the fittest, is an ideal candidate for such an objective-driven search in a hig...
Performance Aspects Of Computers With Graphical User Interfaces
, 1993
"... this memory behavior for extensive low-level traces collected on a DECstation 3100 workstation. This discussion motivates the development of different caching strategies studied in Chapter 7. Finally, Chapter 8 summarizes the contributions of this research. ..."
Abstract
- Add to MetaCart
this memory behavior for extensive low-level traces collected on a DECstation 3100 workstation. This discussion motivates the development of different caching strategies studied in Chapter 7. Finally, Chapter 8 summarizes the contributions of this research.

