Results 1 - 10
of
146
Measuring Experimental Error in Microprocessor Simulation
, 2001
"... We measure the ex4fl840E468 error that arises from the use of non-validated simulators in computer architecture research, with the goal of increasing the rigor of simulation -based studies. We describe the methodology that we used to validate a microprocessor simulator against a Compaq DS-10L work ..."
Abstract
-
Cited by 125 (28 self)
- Add to MetaCart
We measure the ex4fl840E468 error that arises from the use of non-validated simulators in computer architecture research, with the goal of increasing the rigor of simulation -based studies. We describe the methodology that we used to validate a microprocessor simulator against a Compaq DS-10L workstation, which contains an Alpha 21264 processor. Our evaluation suite consists of a set of 21 microbenchmarks that stress different aspects of the 21264 microarchitecture. Using the microbenchmark suite as the set of workloads, we describe how we reduced our simulator error to an arithmetic mean of 2%, and include details about the specific aspects of the pipeline that required ex8 a care to reduce the error. We show how these low-level optimizations reduce average error from 40% to less than 20% on macrobenchmarks drawn from the SPEC2000 suite. Finally, we exI837 the degree to which performance optimizations are stable across different simulators, showing that researchers would draw differ...
Cherry: Checkpointed Early Resource Recycling in Out-of-Order Microprocessors
- In Proceedings of the 35th Annual IEEE/ACM International Symposium on Microarchitecture
, 2002
"... This paper presents CHeckpointed Early Resource RecYcling (Cherry), a hybrid mode of execution based on ROB and checkpointing that decouples resource recycling and instruction retirement. Resources are recycled early, resulting in a more efficient utilization. Cherry relies on state checkpointing an ..."
Abstract
-
Cited by 116 (12 self)
- Add to MetaCart
(Show Context)
This paper presents CHeckpointed Early Resource RecYcling (Cherry), a hybrid mode of execution based on ROB and checkpointing that decouples resource recycling and instruction retirement. Resources are recycled early, resulting in a more efficient utilization. Cherry relies on state checkpointing and rollback to service exceptions for instructions whose resources have been recycled. Cherry leverages the ROB to (1) not require in-order execution as a fallback mechanism, (2) allow memory replay traps and branch mispredictions without rolling back to the Cherry checkpoint, and (3) quickly fall back to conventional out-of-order execution without rolling back to the checkpoint or flushing the pipeline. We present a Cherry implementation with early recycling at three different points of the execution engine: the load queue, the store queue, and the register file. We report average speedups of 1.06 and 1.26 in SPECint and SPECfp applications, respectively, relative to an aggressive conventional architecture. We also describe how Cherry and speculative multithreading can be combined and complement each other. 1
Power and Energy Reduction Via Pipeline Balancing
- In International Symposium on Computer Architecture
, 2001
"... Minimizing power dissipation is an important design require-ment for both portable and non-portable systems. In this work, we propose an architectural solution to the power problem that retains performance while reducing power The technique, known as Pipeline Balancing (PLB), dynamically tunes the r ..."
Abstract
-
Cited by 111 (4 self)
- Add to MetaCart
(Show Context)
Minimizing power dissipation is an important design require-ment for both portable and non-portable systems. In this work, we propose an architectural solution to the power problem that retains performance while reducing power The technique, known as Pipeline Balancing (PLB), dynamically tunes the resources of a general purpose processor to the needs of the program by mon-itoring performance within each program. We analyze metrics for triggering PLB, and detail instruction queue design and energy savings based on an extension of the Alpha 21264 processor Us-ing a detailed simulator, we present component and full chip power and energy savings for single and multi-threaded execution. Re-sults show an issue queue and execution unit power reduction of up to 23 % and 13%, respectively, with an average performance loss of 1 % to 2 %. 1.
A Large, Fast Instruction Window for Tolerating Cache Misses
"... Instruction window size is an important design parameter for many modern processors. Large instruction windows offer the potential advantage of exposing large amounts of instruction level parallelism. Unfortunately, naively scaling conventional window designs can significantly degrade clock cycle ti ..."
Abstract
-
Cited by 109 (1 self)
- Add to MetaCart
Instruction window size is an important design parameter for many modern processors. Large instruction windows offer the potential advantage of exposing large amounts of instruction level parallelism. Unfortunately, naively scaling conventional window designs can significantly degrade clock cycle time, undermining the benefits of increased parallelism. This paper presents a new instruction window design targeted at achieving the latency tolerance of large windows with the clock cycle time of small windows. The key observation is that instructions dependent on a long latency operation (e.g., cache miss) cannot execute until that source operation completes. These instructions are moved out of the conventional, small, issue queue to a much larger waiting instruction buffer (WIB). When the long latency operation completes, the instructions are reinserted into the issue queue. In this paper, we focus specifically on load cache misses and their dependent instructions. Simulations reveal that, for an 8-way processor, a 2K-entry WIB with a 32entry issue queue can achieve speedups of 20%, 84%, and 50% over a conventional 32-entry issue queue for a subset of the SPEC CINT2000, SPEC CFP2000, and Olden benchmarks, respectively.
Continual flow pipelines
- In International Conference on Architectural Support for Programming Languages and Operating Systems
, 2004
"... Increased integration in the form of multiple processor cores on a single die, relatively constant die sizes, shrinking power envelopes, and emerging applications create a new challenge for processor architects. How to build a processor that provides high single-thread performance and enables multip ..."
Abstract
-
Cited by 107 (0 self)
- Add to MetaCart
(Show Context)
Increased integration in the form of multiple processor cores on a single die, relatively constant die sizes, shrinking power envelopes, and emerging applications create a new challenge for processor architects. How to build a processor that provides high single-thread performance and enables multiple of these to be placed on the same die for high throughput while dynamically adapting for future applications? Conventional approaches for high single-thread performance rely on large and complex cores to sustain a large instruction window for memory tolerance, making them unsuitable for multi-core chips. We present Continual Flow Pipelines (CFP) as a new nonblocking processor pipeline architecture that achieves the performance of a large instruction window without requiring cycle-critical structures such as the scheduler and register file to be large. We show that to achieve benefits of a large instruction window, inefficiencies in management of both the scheduler and register file must be addressed, and we propose a unified solution. The non-blocking property of CFP keeps key processor structures affecting cycle time and power (scheduler, register file), and die size (second level cache) small. The memory latency-tolerant CFP core allows multiple cores on a single die while outperforming current processor cores for single-thread applications.
Reducing the Complexity of the Register File in Dynamic Superscalar Processors
, 2001
"... Dynamic superscalar processors execute multiple instructions out-of-order by looking for independent operations within a large window. The number of physical registers within the processor has a direct impact on the size of this window as most in-flight instructions require a new physical register a ..."
Abstract
-
Cited by 96 (1 self)
- Add to MetaCart
(Show Context)
Dynamic superscalar processors execute multiple instructions out-of-order by looking for independent operations within a large window. The number of physical registers within the processor has a direct impact on the size of this window as most in-flight instructions require a new physical register at dispatch. A large multiported register file helps improve the instruction-level parallelism (ILP), but may have a detrimental effect on clock speed, especially in future wire-limited technologies. In this paper, we propose a register file organization that reduces register file size and port requirements for a given amount of ILP. We use a two-level register file organization to reduce register file size requirements, and a banked organization to reduce port requirements. We demonstrate empirically that the resulting register file organizations have reduced latency and (in the case of the banked organization) energy requirements for similar instructions per cycle (IPC) performance and improved instructions per second (IPS) performance in comparison to a conventional monolithic register file. The choice of organization is dependent on design goals.
Reducing register ports for higher speed and lower energy
- In Proceedings of the International Symposium on Microarchitecture
, 2002
"... The key issues for register file design in high-performance processors are access time and energy. While previous work has focused on reducing the number of registers, we propose to reduce the number of register ports through two proposals, one for reads and the other for writes. For reads, we propo ..."
Abstract
-
Cited by 90 (2 self)
- Add to MetaCart
(Show Context)
The key issues for register file design in high-performance processors are access time and energy. While previous work has focused on reducing the number of registers, we propose to reduce the number of register ports through two proposals, one for reads and the other for writes. For reads, we propose bypass hint to reduce register port requirements by avoiding unnecessary register file reads for cases where values are bypassed. Current processors are unable to avoid these unnecessary reads due to timing constraints. For writes, we use register file banking. Current banking schemes assign different banks to instructions that are renamed together, which does not necessarily avoid conflicts among instructions that writeback together. We use decoupled rename, a technique which separates dependence and physical tagging of register operands. Decoupled rename allows us to perform physical register allocation just before writeback, avoiding bank conflicts. Our results show that combining bypass hint and write banking, our 1cycle register file with 6 read ports, and two 4-write-ported banks achieves a 9 % processor energy-delay savings over a system using a perfectly-pipelined, 2-cycle register file with 16 read ports and 8 write ports. 1
Banked Multiported Register Files for High-Frequency Superscalar Microprocessors
- In International Symposium on Computer Architecture
, 2003
"... Multiported register files are a critical component of high-performance superscalar microprocessors. Conventional multiported structures can consume significant power and die area. We examine the designs of banked multiported register files that employ multiple interleaved banks of fewer ported regi ..."
Abstract
-
Cited by 64 (5 self)
- Add to MetaCart
Multiported register files are a critical component of high-performance superscalar microprocessors. Conventional multiported structures can consume significant power and die area. We examine the designs of banked multiported register files that employ multiple interleaved banks of fewer ported register cells to reduce power and area. Banked register files designs have been shown to provide sufficient bandwidth for a superscalar machine, but previous designs had complex control structures that would likely limit cycle time and add to design complexity. We develop a banked register file with much simpler and faster control logic while only slightly increasing the number of ports per bank. We present area, delay, and energy numbers extracted from layouts of the banked register file. For a four-issue superscalar processor, we show that we can reduce area by a factor of three, access time by 20%, and energy by 40%, while decreasing IPC by less than 5%.
Half-Price Architecture
- In Proceedings of the International Symposium on Computer Architecture
, 2003
"... Current-generation microprocessors are designed to process instructions with one and two source operands at equal cost. Handling two source operands requires multiple ports for each instruction in structures--such as the register file and wakeup logic--which are often in the processor critical timi ..."
Abstract
-
Cited by 40 (4 self)
- Add to MetaCart
(Show Context)
Current-generation microprocessors are designed to process instructions with one and two source operands at equal cost. Handling two source operands requires multiple ports for each instruction in structures--such as the register file and wakeup logic--which are often in the processor critical timing paths. We argue that these structures are overdesigned since only a small fraction of instructions require two source operands to be processed simultaneously. [n this paper, we propose the half-price architecture that judiciously removes this overdesign by restricting the processor capability to handle two source operands in certain timing-critical cases. Two techniques are proposed and evaluated: one for the wakeup logic is sequential wakeup, which decouples half of the tag matching logic from the wakeup bus to reduce the load capacitance of the bus. The other technique for the register file is sequential register access, which halves the register read ports by sequentially accessing two values using a single port when needed. We show that a pipeline that optimizes scheduling and register access for a single operand achieves nearly the same performance as an ideal base machine that fully handles two operands, with 2.2% (worst case 4.8%) IPC degradation.