Results 1 - 10
of
22
Hints for Computer Systems Design
- 9th ACM Symposium on Operating Systems Principles, Bretton Woods
, 1983
"... Studying the design and implementation of a number of computer has led to some general hints for system design. They are described here and illustrated by many examples, ranging from hardware such as the Alto and the Dorado to application programs such as Bravo and Star. 1. ..."
Abstract
-
Cited by 166 (0 self)
- Add to MetaCart
Studying the design and implementation of a number of computer has led to some general hints for system design. They are described here and illustrated by many examples, ranging from hardware such as the Alto and the Dorado to application programs such as Bravo and Star. 1.
Zero-Cycle Loads: Microarchitecture Support for Reducing Load Latency
- in Proceedings of the 28th Annual International Symposium on Microarchitecture
, 1995
"... Untolerated load instruction latencies often have a significant impact on overall program performance. As one means of mitigating this effect, we present an aggressive hardware-based mechanism that provides effective support for reducing the latency of load instructions. Through the judicious use of ..."
Abstract
-
Cited by 67 (5 self)
- Add to MetaCart
Untolerated load instruction latencies often have a significant impact on overall program performance. As one means of mitigating this effect, we present an aggressive hardware-based mechanism that provides effective support for reducing the latency of load instructions. Through the judicious use of instruction predecode, base register caching, and fast address calculation, it becomes possible to complete load instructions up to two cycles earlier than traditional pipeline designs. For a pipeline with one cycle data cache access, this results in what we term a zero-cycle load. A zero-cycle load produces a result prior to reaching the execute stage of the pipeline, allowing subsequent dependent instructions to issue unfettered by load dependencies. Programs executing on processors with support for zero-cycle loads experience significantly fewer pipeline stalls due to load instructions and increased overall performance. We present two pipeline designs supporting zero-cycle loads: one for...
Memory Dependence Prediction
, 1998
"... As the existing techniques that empower the modern high-performance processors are being refined and as the underlying technology trade-offs change, new bottlenecks are exposed and new challenges are raised. This thesis introduces a new tool, Memory Dependence Prediction that can be useful in combat ..."
Abstract
-
Cited by 36 (5 self)
- Add to MetaCart
As the existing techniques that empower the modern high-performance processors are being refined and as the underlying technology trade-offs change, new bottlenecks are exposed and new challenges are raised. This thesis introduces a new tool, Memory Dependence Prediction that can be useful in combating these bottlenecks and meeting the new challenges. Memory dependence prediction is a technique to guess whether a load or a store will experience a dependence. Memory dependence prediction exploits regularity in the memory dependence stream of ordinary programs, a phenomenon which is also identified in this thesis. To demonstrate the utility of memory dependence prediction this thesis also presents the following three novel microarchitectural techniques: 1. Dynamic Speculation/Synchronization of Memory Dependences: this thesis demonstrates that to exploit parallelism over larger regions of code waiting to determine the dependences a load has is not the best performing option. Higher performance is possible if memory dependence speculation is used especially if memory dependence prediction is used to guide this speculation.
Decoupling Local Variable Accesses in a Wide-Issue Superscalar Processor
- In Proc. 26th Ann. Int’l Symp. on Computer Architecture
, 1998
"... Providing adequate data bandwidth is extremely important for a wide-issue superscalar processor to achieve its full performance potential. Adding a large number of ports to a data cache, however, becomes increasingly inefficient and can add to the hardware complexity significantly. This paper takes ..."
Abstract
-
Cited by 28 (4 self)
- Add to MetaCart
Providing adequate data bandwidth is extremely important for a wide-issue superscalar processor to achieve its full performance potential. Adding a large number of ports to a data cache, however, becomes increasingly inefficient and can add to the hardware complexity significantly. This paper takes an alternative or complementary approach for providing more data bandwidth, called the data-decoupled architecture. The approach, with support from the compiler and/or hardware, partitions the memory stream into two independent streams early in the processor pipeline, and feeds each stream to a separate memory access queue and cache. Under this model, the paper studies the potential of decoupling memory accesses to program's local variables that are allocated on the run-time stack. Using a set of integer and floating-point programs from the SPEC95 benchmark suite, it is shown that local variable accesses constitute a large portion of all the memory references, while their reference space is ...
Region-Based Caching: An Energy-Delay Efficient Memory Architecture for Embedded Processors
- In CASES ’00: Proceedings of the 2000 international conference on Compilers, architecture, and
, 2000
"... Abstract Power consumption has been a major concern in designing microprocessors for portable systems such as notebook computers, hand-held computing and personal telecommunication devices. As these devices increase in popularity and are used in a wider range of applications, a low power design beco ..."
Abstract
-
Cited by 27 (7 self)
- Add to MetaCart
Abstract Power consumption has been a major concern in designing microprocessors for portable systems such as notebook computers, hand-held computing and personal telecommunication devices. As these devices increase in popularity and are used in a wider range of applications, a low power design becomes more critical. In this paper, we propose a new microarchitectural data cache design called region-based caching that can reduce power consumption. Power savings is achieved by re-organizing the the first level cache to more efficiently exploit memory reference characteristics produced by programming language semantics. These characteristics enable the cache to be partitioned by memory region (stack, global, heap), reducing power consumption, while retaining comparable performance to a conventional cache design. Applications from the MediaBench benchmark suite indicate that a design with two additional small region-based caches results in 66 % reduction in average in energy-delay product. 1.
Early Load Address Resolution Via Register Tracking
, 2000
"... Higher microprocessor frequencies accentuate the performance cost of memory accesses. This is especially noticeable in the Intel's IA32 architecture where lack of registers results in increased number of memory accesses. This paper presents novel, non-speculative technique that partially hides the i ..."
Abstract
-
Cited by 26 (1 self)
- Add to MetaCart
Higher microprocessor frequencies accentuate the performance cost of memory accesses. This is especially noticeable in the Intel's IA32 architecture where lack of registers results in increased number of memory accesses. This paper presents novel, non-speculative technique that partially hides the increasing loadto -use latency, by allowing the early issue of load instructions. Early load address resolution relies on register tracking to safely compute the addresses of memory references in the front-end part of the processor pipeline. Register tracking enables decode-time computation of register values by tracking simple operations of the form regimmediate. Register tracking may be performed in any pipeline stage following instruction decode and prior to execution. Several tracking schemes are proposed in this paper: . Stack pointer tracking allows safe early resolution of stack references by keeping track of the value of the ESP register (the stack pointer). About 25% of all loads are stack loads and 95% of these loads may be resolved in the front-end. . Absolute address tracking allows the early resolution of constant-address loads. . Displacement-based tracking tackles all loads with addresses of the form regimmediate by tracking the values of all general-purpose registers. This class corresponds to 82% of all loads, and about 65% of these loads can be safely resolved in the front-end pipeline. The paper describes the tracking schemes, analyzes their performance potential in a deeply pipelined processor and discusses the integration of tracking with memory disambiguation. 1.
L1 Data Cache Decomposition for Energy Efficiency
- In Proceedings of the International Symposium on Low Power Electronics and Design
, 2001
"... ABSTRACT The L1 data cache is a time-critical module and, at the same time,a major consumer of energy. To reduce its energy-delay product, we apply two principles of low-power design: specialize part of thecache structure and break the cache down into smaller caches. To this end, we propose a new L1 ..."
Abstract
-
Cited by 22 (1 self)
- Add to MetaCart
ABSTRACT The L1 data cache is a time-critical module and, at the same time,a major consumer of energy. To reduce its energy-delay product, we apply two principles of low-power design: specialize part of thecache structure and break the cache down into smaller caches. To this end, we propose a new L1 data cache structure that combinesa Specialized Stack Cache (SSC) and a Pseudo Set-Associative Cache (PSAC). Individually, our SSC and PSAC designs have alower energy-delay product than previously-proposed related designs. In addition, their combined operation is very effective. Rela-tive to a conventional 2-way 32 KB data cache, a design containing a 4-way 32 KB PSAC and a 512 B SSC reduces the energy-delayproduct of several applications by an average of 44%. 1.
The Store-Load Address Table and Speculative Register Promotion
- In International Symposium on Microarchitecture
, 2000
"... Register promotion is an optimization that allocates a value to a register for a region of its lifetime where it is provably not aliased. Conventional compiler analysis cannot always prove that a value is free of aliases, and thus promotion cannot always be applied. This paper proposes a new hardwar ..."
Abstract
-
Cited by 15 (1 self)
- Add to MetaCart
Register promotion is an optimization that allocates a value to a register for a region of its lifetime where it is provably not aliased. Conventional compiler analysis cannot always prove that a value is free of aliases, and thus promotion cannot always be applied. This paper proposes a new hardware structure, the store-load address table (SLAT), which watches both load and store instructions to see if they conflict with entries loaded into the SLAT by explicit software mapping instructions. One use of the SLAT is to allow values to be promoted to registers when they cannot be proven to be promotable by conventional compiler analysis. We call this new optimization speculative register promotion. Using this technique, a value can be promoted to a register and aliased loads and stores to that value's home memory location are caught and the proper fixup is performed. This paper will: a) describe the SLAT hardware and software; b) demonstrate that conventional register promotion is often ...
Implementation of Stack-Based Languages on Register Machines
, 1996
"... Languages with programmer-visible stacks (stack-based languages) are used widely, as intermediate languages (e.g., JavaVM, FCode), and as languages for human programmers (e.g., Forth, PostScript). However, the prevalent computer architecture is the register machine. This poses the problem of efficie ..."
Abstract
-
Cited by 13 (0 self)
- Add to MetaCart
Languages with programmer-visible stacks (stack-based languages) are used widely, as intermediate languages (e.g., JavaVM, FCode), and as languages for human programmers (e.g., Forth, PostScript). However, the prevalent computer architecture is the register machine. This poses the problem of efficiently implementing stack-based languages on register machines. A straight-forward implementation of the stack consists of a memory area that contains the stack items, and a pointer to the top-of-stack item. The basic
Access region locality for highbandwidth processor memory system design
- In Proceedings of MICRO-32
, 1999
"... This paper explores an important behavior of memory access instructions, called access region locality. Unlike the traditional temporal and spatial data locality that focuses on individual memory locations and how accesses to the locations are inter-related, the access region locality concerns with ..."
Abstract
-
Cited by 12 (1 self)
- Add to MetaCart
This paper explores an important behavior of memory access instructions, called access region locality. Unlike the traditional temporal and spatial data locality that focuses on individual memory locations and how accesses to the locations are inter-related, the access region locality concerns with each static memory instruction and its range of access locations at run time. We consider program’s data, heap, and stack regions in this paper. Our experimental study using a set of SPEC95 benchmark programs shows that most memory reference instructions access a single region at run time. Also shown is that it is possible to predict the access region of a memory instruction accurately at run time by scrutinizing the addressing mode of the instruction and the past access region history of it. An important implication of the access region locality is that two memory accesses to different regions are data independent. Utilizing this property, we evaluate a novel processor memory system and pipeline design which can provide high data cache bandwidth without increasing the critical path of the processor, by decoupling the memory instructions that access the program’s stack at an early pipeline stage. Experimental results indicate that the proposed technique achieves comparable or better performance than a conventional memory design with a heavily multi-ported data cache that can lead to much higher hardware complexity.

