Results 1 - 10
of
56
The predictability of data values
- IN PROCEEDINGS OF THE 30TH INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE
, 1997
"... ..."
ProfileMe: Hardware Support for Instruction-Level Profiling on Out-of-Order Processors
, 1997
"... Profile data is valuable for identifying performance bottlenecks and guiding optimizations. Periodic sampling of a processor's performance monitoring hardware is an effective, unobtrusive way to obtain detailed profiles. Unfortunately, existing hardware simply counts events, such as cache misses and ..."
Abstract
-
Cited by 118 (2 self)
- Add to MetaCart
Profile data is valuable for identifying performance bottlenecks and guiding optimizations. Periodic sampling of a processor's performance monitoring hardware is an effective, unobtrusive way to obtain detailed profiles. Unfortunately, existing hardware simply counts events, such as cache misses and branch mispredictions, and cannot accurately attribute these events to instructions, especially on out-of-order machines. We propose an alternative approach, called ProfileMe, that samples instructions. As a sampled instruction moves through the processor pipeline, a detailed record of all interesting events and pipeline stage latencies is collected. ProfileMe also support paired sampling, which captures information about the interactions between concurrent instructions, revealing information about useful concurrency and the utilization of various pipeline stages while an instruction is in flight. We describe an inexpensive hardware implementation of ProfileMe, outline a variety of software...
Target Prediction for Indirect Jumps
- In Proceedings of the 24th Annual International Symposium on Computer Architecture
, 1997
"... As the issue rate and pipeline depth of high performance superscalar processors increase, the amount of speculative work issued also increases. Because speculative work must be thrown away in the event of a branch misprediction, wide-issue, deeply pipelined processors must employ accurate branch pre ..."
Abstract
-
Cited by 80 (3 self)
- Add to MetaCart
As the issue rate and pipeline depth of high performance superscalar processors increase, the amount of speculative work issued also increases. Because speculative work must be thrown away in the event of a branch misprediction, wide-issue, deeply pipelined processors must employ accurate branch predictors to effectively exploit their performance potential. Many existing branch prediction schemes are capable of accurately predicting the direction of conditional branches. However, these schemes are ineffective in predicting the targets of indirect jumps achieving, on average, a prediction accuracy rate of 51.8% for the SPECint95 benchmarks. In this paper, we propose a new prediction mechanism, the target cache, for predicting indirect jump targets. For the perl and gcc benchmarks, this mechanism reduces the indirect jump misprediction rate by 93.4% and 63.3% and the overall execution time by 14% and 5%. 1 Introduction As the issue rate and pipeline depth of high performance superscala...
Analysis of Branch Prediction via Data Compression
- in Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems
, 1996
"... Branch prediction is an important mechanism in modem microprocessor design. The focus of research in this area has been on designing new branch prediction schemes. In contrast, very few studies address the theoretical basis behind these prediction schemes. Knowing this theoretical basis helps us to ..."
Abstract
-
Cited by 79 (3 self)
- Add to MetaCart
Branch prediction is an important mechanism in modem microprocessor design. The focus of research in this area has been on designing new branch prediction schemes. In contrast, very few studies address the theoretical basis behind these prediction schemes. Knowing this theoretical basis helps us to evaluate how good a prediction scheme is and how much we can expect to improve its accuracy.
Exploiting Instruction Level Parallelism in Processors by Caching Scheduled Groups
, 1996
"... Modern processors employ a large amount of hardware to dynamically detect parallelism in single-threaded programs and maintain the sequential semantics implied by these programs. The complexity of some of this hardware diminishes the gains due to parallelism because of longer clock period or increas ..."
Abstract
-
Cited by 65 (0 self)
- Add to MetaCart
Modern processors employ a large amount of hardware to dynamically detect parallelism in single-threaded programs and maintain the sequential semantics implied by these programs. The complexity of some of this hardware diminishes the gains due to parallelism because of longer clock period or increased pipeline latency of the machine. In this paper we propose a processor implementation which dynamically schedules groups of instructions while executing them on a fast simple engine and caches them for repeated execution on a fast VLIW-type engine. Our experiments show that scheduling groups spanning several basic blocks and caching these scheduled groups results in significant performance gain over fill buffer approaches for a standard VLIW cache. This concept, which we call DIF (Dynamic Instruction Formatting), unifies and extends principles underlying several schemes being proposed today to reduce superscalar processor complexity. This paper examines various issues in designing such a p...
Replay: A Hardware Framework for Dynamic Optimization
- IEEE Transaction on Computers
, 2001
"... AbstractÐIn this paper, we propose a new processor framework that supports dynamic optimization. The rePLay Framework embeds an optimization engine atop a high-performance execution engine. The heart of the rePLay Framework is the concept of a frame. Frames are large, single-entry, single-exit optim ..."
Abstract
-
Cited by 58 (5 self)
- Add to MetaCart
AbstractÐIn this paper, we propose a new processor framework that supports dynamic optimization. The rePLay Framework embeds an optimization engine atop a high-performance execution engine. The heart of the rePLay Framework is the concept of a frame. Frames are large, single-entry, single-exit optimization regions spanning many basic blocks in the program's dynamic instruction stream, yet containing only a single flow of control. This atomic property of frames increases the flexibilty in applying optimizations. To support frames, rePLay includes a hardware-based recovery mechanism that rolls back the architectural state to the beginning of a frame if, for example, an early exit condition is detected. This mechanism permits the optimizer to make speculative, aggressive optimizations upon frames. In this paper, we investigate some of the underlying phenomenon that support rePLay. Primarily, we evaluate rePLay's region formation strategy. A rePLay configuration with a 256-entry frame cache, using 74KB frame constructor and frame sequencer, achieves an average frame size of 88 Alpha AXP instructions with 68 percent coverage of the dynamic istream, an average frame completion rate of 97.81 percent, and a frame predictor accuracy of 81.26 percent. These results soundly demonstrate that the frames upon which the optimizations are performed are large and stable. Using the most frequently initiated frames from rePLay executions as samples, we also highlight possible strategies for the rePLay optimization engine. Coupled with the high coverage of frames achieved through the dynamic frame construction, the success of these optimizations demonstrates the significance of the rePLay Framework. We believe that the concept of frames, along with the mechanisms and strategies outlined in this paper, will play an important role in future processor architecture. Index TermsÐHigh-performance microarchitecture, dynamic optimization, trace caches. æ 1
Branch prediction, instruction-window size, and cache size: Performance tradeoffs and simulation techniques
- IEEE Transactions on Computers
, 1999
"... Design parameters interact in complex ways in modern processors, especially because out-of-order issue and decoupling buffers allow latencies to be overlapped. Tradeoffs among instruction-window size, branch-prediction accuracy, and instruction- and datacache size can change as these parameters move ..."
Abstract
-
Cited by 57 (18 self)
- Add to MetaCart
Design parameters interact in complex ways in modern processors, especially because out-of-order issue and decoupling buffers allow latencies to be overlapped. Tradeoffs among instruction-window size, branch-prediction accuracy, and instruction- and datacache size can change as these parameters move through different domains. For example, modeling unrealistic caches can under- or over-state the benefits of better prediction or a larger instruction window. Avoiding such pitfalls requires understanding how all these parameters interact. Because such methodological mistakes are common, this paper provides a comprehensive set of SimpleScalar simulation results from SPECint95 programs, showing the interactions among these major structures. In addition to presenting this database of simulation results, major mechanisms driving the observed tradeoffs are described. The paper also considers appropriate simulation techniques when sampling full-length runs with the SPEC reference inputs. In particular, the results show that branch mispredictions limit the benefits of larger instruction windows, that better branch prediction and better instruction cache behavior have synergistic effects, and that the benefits of larger instruction windows and larger data caches trade off and have overlapping effects. In addition, simulations of only 50 million instructions can yield representative results if these short windows are carefully selected.
Transparent dynamic optimization: The design and implementation of Dynamo
, 1999
"... dynamic optimization, compiler, trace selection, binary translation © Copyright Hewlett-Packard Company 1999 Dynamic optimization refers to the runtime optimization of a native program binary. This report describes the design and implementation of Dynamo, a prototype dynamic optimizer that is capabl ..."
Abstract
-
Cited by 49 (4 self)
- Add to MetaCart
dynamic optimization, compiler, trace selection, binary translation © Copyright Hewlett-Packard Company 1999 Dynamic optimization refers to the runtime optimization of a native program binary. This report describes the design and implementation of Dynamo, a prototype dynamic optimizer that is capable of optimizing a native program binary at runtime. Dynamo is a realistic implementation, not a simulation, that is written entirely in user-level software, and runs on a PA-RISC machine under the HPUX operating system. Dynamo does not depend on any special programming language,
Profile-based dynamic voltage and frequency scaling for a multiple clock domain microprocessor
- In Proceedings of the International Symposium on Computer Architecture
, 2003
"... A Multiple Clock Domain (MCD) processor addresses the challenges of clock distribution and power dissipation by dividing a chip into several (coarse-grained) clock domains, allowing frequency and voltage to be reduced in domains that are not currently on the application’s critical path. Given a reco ..."
Abstract
-
Cited by 45 (8 self)
- Add to MetaCart
A Multiple Clock Domain (MCD) processor addresses the challenges of clock distribution and power dissipation by dividing a chip into several (coarse-grained) clock domains, allowing frequency and voltage to be reduced in domains that are not currently on the application’s critical path. Given a reconfiguration mechanism capable of choosing appropriate times and values for voltage/frequency scaling, an MCD processor has the potential to achieve significant energy savings with low performance degradation. Early work on MCD processors evaluated the potential for energy savings by manually inserting reconfiguration instructions into applications, or by employing an oracle driven by off-line analysis of (identical) prior program runs. Subsequent work developed a hardware-based on-line mechanism that averages 75–85 % of the energy-delay improvement achieved via off-line analysis. In this paper we consider the automatic insertion of reconfiguration instructions into applications, using profiledriven binary rewriting. Profile-based reconfiguration introduces the need for “training runs ” prior to production use of a given application, but avoids the hardware complexity of on-line reconfiguration. It also has the potential to yield significantly greater energy savings. Experimental results (training on small data sets and then running on larger, alternative data sets) indicate that the profile-driven approach is more stable than hardware-based reconfiguration, and yields virtually all of the energy-delay improvement achieved via off-line analysis. 1.
Variable Length Path Branch Prediction
- In Proceedings of the 8th International Conference on Architectural Support for Programming Languages and Operating Systems
, 1998
"... ing with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Publications Dept, ACM Inc., fax +1 (212) 869-0481, or permissions@acm.org. Variable Length Path Branch Predictio ..."
Abstract
-
Cited by 41 (2 self)
- Add to MetaCart
ing with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Publications Dept, ACM Inc., fax +1 (212) 869-0481, or permissions@acm.org. Variable Length Path Branch Prediction Jared Stark Marius Evers Yale N. Patt Department of Electrical Engineering and Computer Science The University of Michigan Ann Arbor, Michigan 48109-2122 fstarkj,olaf,pattg@eecs.umich.edu Abstract Accurate branch prediction is required to achieve high performance in deeply pipelined, wide-issue processors. Recent studies have shown that conditional and indirect (or computed) branch targets can be accurately predicted by recording the path, which consists of the target addresses of recent branches, leading up to the branch. In current path based branch predictors, the N most recent target addresses are hashed together to form an index into a table, where N is some fixed integer. The inde...

