Results 1 - 10
of
11
Register packing: Exploiting narrow-width operands for reducing register file pressure
- In Proc. of the 37th Annual International Symposium on Microarchitecture
, 2004
"... A large percentage of computed results have fewer significant bits compared to the full width of a register. We exploit this fact to pack multiple results into a single physical register to reduce the pressure on the register file in a superscalar processor. Two schemes for dynamically packing multi ..."
Abstract
-
Cited by 19 (4 self)
- Add to MetaCart
A large percentage of computed results have fewer significant bits compared to the full width of a register. We exploit this fact to pack multiple results into a single physical register to reduce the pressure on the register file in a superscalar processor. Two schemes for dynamically packing multiple "narrow-width " results into partitions within a single register are evaluated. The first scheme is conservative and allocates a full-width register for a computed result. If the computed result turns out to be narrow, the result is reallocated to partitions within a common register, freeing up the full-width register. The second scheme allocates register partitions based on a prediction of the width of the result and reallocates register partitions when the actual result width is higher than what was predicted. If the actual width is narrower than what was predicted, allocated partitions are freed up. A detailed evaluation of our schemes show that average IPC gains of up to 15 % can be realized across the SPEC 2000 benchmarks on a somewhat register-constrained datapath. 1.
TMA: A Trap-Based Memory Architecture
- In ICS
, 2006
"... The advances in semiconductor technology have set the shared-memory server trend towards processors with multiple cores per die and multiple threads per core. We believe that this technology shift forces a reevaluation of how to interconnect multiple such chips to form larger systems. This paper arg ..."
Abstract
-
Cited by 7 (4 self)
- Add to MetaCart
(Show Context)
The advances in semiconductor technology have set the shared-memory server trend towards processors with multiple cores per die and multiple threads per core. We believe that this technology shift forces a reevaluation of how to interconnect multiple such chips to form larger systems. This paper argues that by adding support for coherence traps in future chip multiprocessors, large-scale server systems can be formed at a much lower cost. This is due to shorter design time, verification and time to market when compared to its traditional all-hardware counter part. In the proposed trap-based memory architecture (TMA), software trap handlers are responsible for obtaining read/write permission, whereas the coherence trap hardware is responsible for the actual permission check. In this paper we evaluate a TMA implementation (called TMA Lite) with a minimal amount of hardware extensions, all contained within the processor. The proposed mechanisms for coherence trap processing should not affect the critical path and have a negligible cost in terms of area and power for most processor designs. Our evaluation is based on detailed full system simulation using out-of-order processors with one or two dual-threaded cores per die as processing nodes. The results show that a TMA based distributed shared memory system can perform on par with a highly optimized hardware based design.
Serializing Instructions in System-Intensive Workloads: Amdahl’s Law Strikes Again
- 14TH INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE COMPUTER ARCHITECTURE (HPCA ’08)
, 2008
"... Serializing instructions (SIs), such as writes to control registers, have many complex dependencies, and are difficult to execute out-of-order (OoO). To avoid unnecessary complexity, processors often serialize the pipeline to maintain sequential semantics for these instructions. We observe frequent ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
(Show Context)
Serializing instructions (SIs), such as writes to control registers, have many complex dependencies, and are difficult to execute out-of-order (OoO). To avoid unnecessary complexity, processors often serialize the pipeline to maintain sequential semantics for these instructions. We observe frequent SIs across several system-intensive workloads and three ISAs, SPARC V9, X86-64, and PowerPC. As explained by Amdahl’s Law, these SIs, which create serial regions within the instruction-level parallel execution of a single thread, can have a significant impact on performance. For the SPARC ISA (after removing TLB and register window effects), we show that operating system (OS) code incurs a 8–45 % performance drop from SIs. We observe that the values produced by most control register writes are quickly consumed, but the writes are often effectively useless (EU), i.e., they do not actually change the execution of the consuming instructions. We propose EU prediction, which allows younger instructions to proceed, possibly reading a stale value, and yet still execute correctly. This technique improves the performance of OS code by 6–35%, and overall performance by 2–12%.
In-line interrupt handling and lockup free translation lookaside buffers (TLBs
- Trans. on Comp
"... This material is posted here with permission of the IEEE. Such permission of the IEEE does not in any way imply IEEE endorsement of any of the University of Maryland’s products or services. Internal or personal use of this material is permitted. However, permission to reprint/republish this material ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
This material is posted here with permission of the IEEE. Such permission of the IEEE does not in any way imply IEEE endorsement of any of the University of Maryland’s products or services. Internal or personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution must be obtained from the IEEE by writing to pubs-permissions@ieee.org. By choosing to view this document, you agree to all provisions of the copyright laws protecting it.
Selective Writeback: Improving Processor Performance and Energy Efficiency
- in Proceedings of 1 st Watson Conference on Interaction between Architecture, Circuits and Compilers (P=ac 2 ), Yorktown Heights
, 2004
"... A significant fraction of the result values in today's superscalar microprocessors are delivered to their consumers via forwarding and are never read out from the destination registers. Such transient values are kept in the register file solely for the purpose of recovering the processor state ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
(Show Context)
A significant fraction of the result values in today's superscalar microprocessors are delivered to their consumers via forwarding and are never read out from the destination registers. Such transient values are kept in the register file solely for the purpose of recovering the processor state on interrupts or exceptions. In this paper, we propose a simple technique to identify such transient register values and avoid their writebacks into the register file. Our scheme results in significant performance improvement, as high as 40 % for some benchmarks and 12 % on the average because the register file is utilized more efficiently. Energy savings of 27 % within the register file are also achieved because much fewer writes to the register file are performed. 1.
Exploring Processor Design Options for Java-Based Middleware
"... Java-based middleware is a rapidly growing workload for high-end server processors, particularly Chip Multiprocessors (CMP). To help architects design future microprocessors to run this important new workload, we provide a detailed characterization of two popular Java server benchmarks, ECperf and S ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
Java-based middleware is a rapidly growing workload for high-end server processors, particularly Chip Multiprocessors (CMP). To help architects design future microprocessors to run this important new workload, we provide a detailed characterization of two popular Java server benchmarks, ECperf and SPECjbb2000. We first estimate the amount of instruction-level parallelism in these workloads by simulating a very wide issue processor with perfect caches and perfect branch predictors. We then identify performance bottlenecks for these workloads on a more realistic processor by selectively idealizing individual processor structures. Finally, we combine our findings on available ILP in Java middleware with results from previous papers that characterize the availibility of TLP to investigate the optimal balance between ILP and TLP in CMPs. We find that, like other commercial workloads, Java middleware has only a small amount of instruction-level parallelism, even when run on very aggressive processors. When run on processors resembling currently available processors, the performance of Java middleware is limited by frequent traps, address translation and stalls in the memory system. We find that SPECjbb2000 differs from ECperf in two meaningful ways: (1) the performance of ECperf is affected much more by cache and TLB misses during instruction fetch and (2) SPECjbb2000 has more memory-level parallelism.
IN
"... An exponentially increasing demand for online services continues pushing server performance into the forefront of computer architecture. While the diversity and complexity of server workloads places demands on many aspects of server processors, the memory system has been among the key exposed bottle ..."
Abstract
- Add to MetaCart
An exponentially increasing demand for online services continues pushing server performance into the forefront of computer architecture. While the diversity and complexity of server workloads places demands on many aspects of server processors, the memory system has been among the key exposed bottlenecks. In particular, long-latency instruction accesses have long been recognized as one of the key factors limiting the performance of servers. Server workloads span multiple application binaries, shared libraries, and operating system modules which comprise hundreds of kilobytes to megabytes of code. While steady technological improvements have enabled growth in the total on-chip cache capacity, cache access latency constraints preclude building L1 instruction caches large enough to capture the instruction working sets of server workloads, leaving L1 instruction-cache misses as a major bottleneck. In this work, we make the observation that instruction-cache misses repeat in long recurring sequences that we call Temporal Instruction Streams. Temporal instruction streams comprise sequences of tens to thousands of instruction-cache blocks which recur frequently during program execution. The stability and length of the instruction streams lend themselves well to prediction, allowing accurate prediction of long sequences of upcoming instruction accesses once a previously
Amdahl’s Law Strikes Again
"... To maintain a reasonable level of complexity, processor implementations contain Serializing Instructions (SIs) — instructions, such as those that write control registers, that cannot be executed out-of-order (OoO). Maintaining sequential semantics may force SIs to serialize the pipeline and execute ..."
Abstract
- Add to MetaCart
(Show Context)
To maintain a reasonable level of complexity, processor implementations contain Serializing Instructions (SIs) — instructions, such as those that write control registers, that cannot be executed out-of-order (OoO). Maintaining sequential semantics may force SIs to serialize the pipeline and execute as the only instruction in the window. We examine the frequency of SIs in three ISAs, SPARC V9, X86-64, and PowerPC, for several systemintensive workloads. Across ISAs, we observe 2–8 SIs per thousand instructions for most workloads. As explained by Amdahl’s Law, such frequent SIs, which create serial regions within the instruction-level parallel execution of a single thread, can have a significant impact on performance. For the SPARC ISA (after removing TLB and register window effects), we observe a 4–17 % performance difference between a modest out-of-order processor and a hypothetical processor which idealizes serializing instructions. We examine the consumption of values produced by several SIs, and observe that most values are consumed, but that the values are Effectively Useless (EU) — i.e. they do not actually change the execution of the consuming instructions. To improve the performance of such SIs, we propose EU prediction, which can allow younger instructions to proceed, possibly reading a stale value, and yet still correctly execute. This simple technique improves the performance of five of our seven workloads by 8–12%. 1
ABSTRACT Title of Thesis: In-line Interrupt Handling and Lockup Free TLBs
"... The effects of the general-purpose precise interrupt mecha-nisms in use for the past few decades have received very little attention. When modern out-of-order processors handle inter-rupts precisely, they typically begin by flushing the pipeline to make the CPU available to execute handler instructi ..."
Abstract
- Add to MetaCart
The effects of the general-purpose precise interrupt mecha-nisms in use for the past few decades have received very little attention. When modern out-of-order processors handle inter-rupts precisely, they typically begin by flushing the pipeline to make the CPU available to execute handler instructions. In doing so, the CPU ends up flushing many instructions that have been brought in to the reorder buffer. In particular, these instructions may have reached a very deep stage in the pipe-line—representing significant work that is wasted. In addition, an overhead of several cycles and wastage of energy (per exception detected) can be expected in re-fetching and re-exe-cuting the instructions flushed. This thesis concentrates on improving the performance of precisely handling software managed translation lookaside buffer (TLB) interrupts, one of the most frequently occurring interrupts. The thesis presents a