Results 1 - 10
of
113
Execution-based Prediction Using Speculative Slices.
- In Proceedings of the 28th Annual International Symposium on Computer Architecture,
, 2001
"... Abstract instructions can move smoothly through the pipeline because the slice has tolerated the latency of the memory hierarchy (for loads) or the pipeline (for branches). This technique results in speedups up to 43 percent over an aggressive baseline machine. To benefit from branch predictions ge ..."
Abstract
-
Cited by 173 (6 self)
- Add to MetaCart
(Show Context)
Abstract instructions can move smoothly through the pipeline because the slice has tolerated the latency of the memory hierarchy (for loads) or the pipeline (for branches). This technique results in speedups up to 43 percent over an aggressive baseline machine. To benefit from branch predictions generated by speculative slices, the predictions must be bound to specific dynamic branch instances. We present a technique that invalidates predictions when it can be determined (by monitoring the program's execution path) that they will not be used. This enables the remaining predictions to be correctly correlated.
Dynamic Branch Prediction with Perceptrons
"... This paper presents a new method for branch prediction. The key idea is to use one of the simplest possible neural networks, the perceptron, as an alternative to the commonly used two-bit counters. Our predictor achieves increased accuracy by making use of long branch histories, which are possible b ..."
Abstract
-
Cited by 164 (20 self)
- Add to MetaCart
(Show Context)
This paper presents a new method for branch prediction. The key idea is to use one of the simplest possible neural networks, the perceptron, as an alternative to the commonly used two-bit counters. Our predictor achieves increased accuracy by making use of long branch histories, which are possible because the hardware resources for our method scale linearly with the history length. By contrast, other purely dynamic schemes require exponential resources. We describe our design and evaluate it with respect to two well known predictors. We show that for a 4K byte hardware budget our method improves misprediction rates for the SPEC 2000 benchmarks by 10.1 % over the gshare predictor. Our experiments also provide a better understanding of the situations in which traditional predictors do and do not perform well. Finally, we describe techniques that allow our complex predictor to operate in one cycle.
Selective value prediction
- In 26th Annual International Symposium on Computer Architecture
, 1999
"... Value Prediction is a relatively new technique to increase instruction-level parallelism by breaking true data dependence chains. A value prediction architecture produces values, which may be later consumed by instructions that execute speculatively using the predicted value. This paper examines sel ..."
Abstract
-
Cited by 138 (13 self)
- Add to MetaCart
(Show Context)
Value Prediction is a relatively new technique to increase instruction-level parallelism by breaking true data dependence chains. A value prediction architecture produces values, which may be later consumed by instructions that execute speculatively using the predicted value. This paper examines selective techniques for using value prediction in the presence of predictor capacity constraints and reasonable misprediction penalties. We examine prediction and confidence mechanisms in light of these constraints, and we minimize capacity conflicts through instruction filtering. The latter technique filters which instructions put values into the value prediction table. We examine filtering techniques based on instruction type, as well as giving priority to instructions belonging to the longest data dependence path in the processor’s active instruction window. We apply filtering both to the producers of predicted values and the consumers. In addition, we examine the benefit of using different confidence levels for instructions using predicted values on the longest dependence path. 1
Neural Methods for Dynamic Branch Prediction
- ACM Transactions on Computer Systems
, 2002
"... This paper presents a new method for branch prediction that is highly accurate. The key idea is to use one of the simplest possible neural methods, the perceptron, as an alternative to the commonly used two-bit counters. The source of our predictor's accuracy is its ability to use long history ..."
Abstract
-
Cited by 95 (10 self)
- Add to MetaCart
This paper presents a new method for branch prediction that is highly accurate. The key idea is to use one of the simplest possible neural methods, the perceptron, as an alternative to the commonly used two-bit counters. The source of our predictor's accuracy is its ability to use long history lengths, because the hardware resources for our method scale linearly, rather than exponentially, with the history length.
The impact of delay on the design of branch predictors
- In Proceedings of the 33th Annual International Symposium on Microarchitecture
, 2000
"... Modern microprocessors employ increasingly complicated branch predictors to achieve instruction fetch bandwidth that is sufficient for wide out-of-order execution cores. While existing predictors can still be accessed in a single clock cycle, recent studies show that slower wires and faster clock ra ..."
Abstract
-
Cited by 89 (10 self)
- Add to MetaCart
(Show Context)
Modern microprocessors employ increasingly complicated branch predictors to achieve instruction fetch bandwidth that is sufficient for wide out-of-order execution cores. While existing predictors can still be accessed in a single clock cycle, recent studies show that slower wires and faster clock rates will require multi-cycle access times to large on-chip structures, such as branch prediction tables. Thus, future branch predictors must consider not only area and accuracy, but also delay. This paper explores these tradeoffs in designing branch predictors and shows that increased accuracy alone cannot overcome the penalties in delay that arise with larger predictor structures. We evaluate three schemes for accommodating delay: a caching approach, an overriding approach, and a cascading lookahead approach. While we use a common branch predictor, gshare, as the prediction component, these schemes can be constructed using most types of predictors. 1
Understanding the Backward Slices of Performance Degrading Instructions
- in Proceedings of the 27th Annual International Symposium on Computer Architecture
, 2000
"... For many applications, branch mispredictions and cache misses limit a processor's performance to a level well below its peak instruction throughput. A small fraction of static instructions, whose behavior cannot be anticipated using current branch predictors and caches, contribute a large fract ..."
Abstract
-
Cited by 85 (3 self)
- Add to MetaCart
(Show Context)
For many applications, branch mispredictions and cache misses limit a processor's performance to a level well below its peak instruction throughput. A small fraction of static instructions, whose behavior cannot be anticipated using current branch predictors and caches, contribute a large fraction of such performance degrading events. This paper analyzes the dynamic instruction stream leading up to these performance degrading instructions to identify the operations necessary to execute them early. The backward slice (the subset of the program that relates to the instruction) of these performance degrading instructions, if small compared to the whole dynamic instruction stream, can be pre-executed to hide the instruction's latency. To overcome conservative dependence assumptions that result in large slices, speculation can be used, resulting in speculative slices. This paper provides an initial characterization of the backward slices of L2 data cache misses and branch mispredictions, and shows the effectiveness of techniques, including memory dependence prediction and control independence, for reducing the size of these slices. Through the use of these techniques, many slices can be reduced to less than one tenth of the full dynamic instruction stream when considering the 512 instructions before the performance degrading instruction. 1
Full-System Timing-First Simulation
- IN PROCEEDINGS OF THE 2002 ACM SIGMETRICS CONFERENCE ON MEASUREMENT AND MODELING OF COMPUTER SYSTEMS
, 2002
"... Computer system designers often evaluate future design alternatives with detailed simulators that strive for functional fidelity (to execute relevant workloads) and performance fidelity (to rank design alternatives). Trends toward multithreaded architectures, more complex micro-architectures, a ..."
Abstract
-
Cited by 82 (13 self)
- Add to MetaCart
Computer system designers often evaluate future design alternatives with detailed simulators that strive for functional fidelity (to execute relevant workloads) and performance fidelity (to rank design alternatives). Trends toward multithreaded architectures, more complex micro-architectures, and richer workloads, make authoring detailed simulators increasingly difficult. To manage simulator complexity, this paper advocates decoupled simulator organizations that separate functional and performance concerns. Furthermore, we define an approach, called timing-first simulation, that uses an augmented timing simulator to execute instructions important to performance in conjunction with a functional simulator to insure correctness. This design simplifies software development, leverages existing simulators, and can model microarchitecture timing in detail. We describe
Dynamic metrics for Java
- In Proceedings of the 18th ACM SIGPLAN conference on Object-oriented programing, systems, languages, and applications
, 2003
"... ..."
(Show Context)
Adaptive Cache Compression for High-Performance Processors
- In Proc. ISCA
, 2004
"... Modern processors use two or more levels of cache memories to bridge the rising disparity between processor and memory speeds. Compression can improve cache performance by increasing effective cache capacity and eliminating misses. However, decompressing cache lines also increases cache access laten ..."
Abstract
-
Cited by 76 (4 self)
- Add to MetaCart
(Show Context)
Modern processors use two or more levels of cache memories to bridge the rising disparity between processor and memory speeds. Compression can improve cache performance by increasing effective cache capacity and eliminating misses. However, decompressing cache lines also increases cache access latency, potentially degrading performance. In this paper, we develop an adaptive policy that dynamically adapts to the costs and benefits of cache compression. We propose a two-level cache hierarchy where the L1 cache holds uncompressed data and the L2 cache dynamically selects between compressed and uncompressed storage. The L2 cache is 8-way set-associative with LRU replacement, where each set can store up to eight compressed lines but has space for only four uncompressed lines. On each L2 reference, the LRU stack depth and compressed size determine whether compression (could have) eliminated a miss or incurs an unnecessary decompression overhead. Based on this outcome, the adaptive policy updates a single global saturating counter, which predicts whether to allocate lines in compressed or uncompressed form. We evaluate adaptive cache compression using full-system simulation and a range of benchmarks. We show that compression can improve performance for memory-intensive commercial workloads by up to 17%. However, always using compression hurts performance for low-miss-rate benchmarks—due to unnecessary decompression overhead—degrading performance by up to 18%. By dynamically monitoring workload behavior, the adaptive policy achieves comparable benefits from compression, while never degrading performance by more than 0.4%. 1
Computation spreading: Employing hardware migration to specialize CMP cores on-the-fly
- In Proc. of 12th ASPLOS
, 2006
"... In canonical parallel processing, the operating system (OS) assigns a processing core to a single thread from a multithreaded server application. Since different threads from the same application often carry out similar computation, albeit at different times, we observe extensive code reuse among di ..."
Abstract
-
Cited by 71 (8 self)
- Add to MetaCart
(Show Context)
In canonical parallel processing, the operating system (OS) assigns a processing core to a single thread from a multithreaded server application. Since different threads from the same application often carry out similar computation, albeit at different times, we observe extensive code reuse among different processors, causing redundancy (e.g., in our server workloads, 45–65 % of all instruction blocks are accessed by all processors). Moreover, largely independent fragments of computation compete for the same private resources causing destructive interference. Together, this redundancy and interference lead to poor utilization of private microarchitecture resources such as caches and branch predictors. We present Computation Spreading (CSP), which employs hardware migration to distribute a thread’s dissimilar fragments of computation across the multiple processing cores of a chip multiprocessor (CMP), while grouping similar computation fragments from different threads together. This paper focuses on a specific example of CSP for OS intensive server applications: separating application level (user) computation from the OS calls it makes. When performing CSP, each core becomes temporally specialized to execute certain computation fragments, and the same core is repeatedly used for such fragments. We examine two specific thread assignment policies for CSP, and show that these policies, across four server workloads, are able to reduce instruction misses in private L2 caches by 27–58%, private L2 load misses by 0–19%, and branch mispredictions by 9–25%.