Results 1 - 10
of
218
Highly Accurate Data Value Prediction using Hybrid Predictors
, 1997
"... Data dependences (data flow constraints) present a major hurdle to the amount of instruction-level parallelism that can be exploited from a program. Recent work has suggested that the limits imposed by data dependences can be overcome to some extent with the use of data value prediction. That is, wh ..."
Abstract
-
Cited by 182 (3 self)
- Add to MetaCart
Data dependences (data flow constraints) present a major hurdle to the amount of instruction-level parallelism that can be exploited from a program. Recent work has suggested that the limits imposed by data dependences can be overcome to some extent with the use of data value prediction. That is, when an instruction is fetched, its result can be predicted so that subsequent instructions that depend on the result can use this pre- dicted value. l/Vhen the correct result becomes avail- able, all instructions that are data dependent on that prediction can be validated. This paper investigates a variety of techniques to carry out highly accurate data value predictions. The first technique investigates the potential of monitoring the strides by which the results produced by different instances of an instruction change. The second technique investigates the potential of pattern-based two-level prediction schemes. Simulation results of these two schemes show improvements over the existing method of predicting the last outcome. In particular, some benchmarks show improvement with the stride-based predictor and others show improvement with the pattern-based predictor. To do uniformly well across benchmarks, we combine these two predictors to form a hybrid predictor. Simulation analysis of the hybrid predictor shows its overall prediction accuracy to be better than that of the component predictors across all benchmarks.
Dynamic Instruction Reuse
, 1997
"... This paper introduces the concept of dynamic instruction reuse. Empirical observations suggest that many instructions, and groups of instructions, having the same inputs, are executed dynamically. Such instructions do not have to be executed repeatedly --- their results can be obtained from a buffer ..."
Abstract
-
Cited by 164 (4 self)
- Add to MetaCart
This paper introduces the concept of dynamic instruction reuse. Empirical observations suggest that many instructions, and groups of instructions, having the same inputs, are executed dynamically. Such instructions do not have to be executed repeatedly --- their results can be obtained from a buffer where they were saved previously. This paper presents three hardware schemes for exploiting the phenomenon of dynamic instruction reuse, and evaluates their effectiveness using execution-driven simulation. We find that in some cases over 50% of the instructions can be reused. The speedups so obtained, though less striking than the percentage of instructions reused, are still quite significant. 1 Introduction There are three parameters that influence the execution time of a program. Microarchitecture has concentrated on two of them: (i) the number of instructions executed per clock cycle, i.e., the IPC, and (ii) the clock cycle time. The third parameter: (iii) the total number of instructi...
Phase Tracking and Prediction
, 2003
"... In a single second a modern processor can execute billions of instructions. Obtaining a bird's eye view of the behavior of a program at these speeds can be a difficult task when all that is available is cycle by cycle examination. In many programs, behavior is anything but steady state, and understa ..."
Abstract
-
Cited by 157 (19 self)
- Add to MetaCart
In a single second a modern processor can execute billions of instructions. Obtaining a bird's eye view of the behavior of a program at these speeds can be a difficult task when all that is available is cycle by cycle examination. In many programs, behavior is anything but steady state, and understanding the patterns of behavior, at run-time, can unlock a multitude of optimization opportunities.
Slipstream processors: improving both performance and fault tolerance
- In Proceedings of the ninth international conference on Architectural
"... Processors execute the full dynamic instruction stream to arrive at the final output of a program, yet there exist shorter instruction streams that produce the same overall effect. We propose creating a shorter but otherwise equivalent version of the original program by removing ineffectual computat ..."
Abstract
-
Cited by 145 (6 self)
- Add to MetaCart
Processors execute the full dynamic instruction stream to arrive at the final output of a program, yet there exist shorter instruction streams that produce the same overall effect. We propose creating a shorter but otherwise equivalent version of the original program by removing ineffectual computation and computation related to highly-predictable control flow. The shortened program is run concurrently with the full program on a chip multiprocessor or simultaneous multithreaded processor, with two key advantages: 1) Improved single-program performance. The shorter program speculatively runs ahead of the full program and supplies the full program with control and data flow outcomes. The full program executes efficiently due to the communicated outcomes, at the same time validating the speculative, shorter program. The two programs combined run faster than the original program alone. Detailed simulations of an example implementation show an average improvement of 7 % for the SPEC95 integer benchmarks. 2) Fault tolerance. The shorter program is a subset of the full program and this partial-redundancy is transparently leveraged for detecting and recovering from transient hardware faults. 1.
Selective value prediction
- In 26th Annual International Symposium on Computer Architecture
, 1999
"... Value Prediction is a relatively new technique to increase instruction-level parallelism by breaking true data dependence chains. A value prediction architecture produces values, which may be later consumed by instructions that execute speculatively using the predicted value. This paper examines sel ..."
Abstract
-
Cited by 118 (14 self)
- Add to MetaCart
Value Prediction is a relatively new technique to increase instruction-level parallelism by breaking true data dependence chains. A value prediction architecture produces values, which may be later consumed by instructions that execute speculatively using the predicted value. This paper examines selective techniques for using value prediction in the presence of predictor capacity constraints and reasonable misprediction penalties. We examine prediction and confidence mechanisms in light of these constraints, and we minimize capacity conflicts through instruction filtering. The latter technique filters which instructions put values into the value prediction table. We examine filtering techniques based on instruction type, as well as giving priority to instructions belonging to the longest data dependence path in the processor’s active instruction window. We apply filtering both to the producers of predicted values and the consumers. In addition, we examine the benefit of using different confidence levels for instructions using predicted values on the longest dependence path. 1
Value Profiling
- In MICRO-97
, 1997
"... Identifying variables as invariant or constant at compile-time allows the compiler to perform optimizations including constant folding, code specialization, and partial evaluation. Some variables, which cannot be labeled as constants, may exhibit semi-invariant behavior. A semiinvariant variable is ..."
Abstract
-
Cited by 99 (5 self)
- Add to MetaCart
Identifying variables as invariant or constant at compile-time allows the compiler to perform optimizations including constant folding, code specialization, and partial evaluation. Some variables, which cannot be labeled as constants, may exhibit semi-invariant behavior. A semiinvariant variable is one that cannot be identified as a constant at compile-time, but has a high degree of invariant behavior at run-time. If run-time information was available to identify these variables as semi-invariant, they could then benefit from invariant-based compiler optimizations. In this paper we examine the invariance found from profiling instruction values, and show that many instructions have semi-invariant values even across different inputs. We also investigate the ability to estimate the invariance for all instructions in a program from only profiling load instructions. In addition, we propose a new type of profiling called Convergent Profiling. Estimating the invariance from loads and converg...
Improving the Accuracy and Performance of Memory Communication Through Renaming
- in 30th Annual International Symposium on Microarchitecture
, 1997
"... As processors continue to exploit more instruction level parallelism, a greater demand is placed on reducing the effects of memory access latency. In this paper, we introduce a novel modification of the processor pipeline called memory renaming. Memory renaming applies register access techniques to ..."
Abstract
-
Cited by 94 (5 self)
- Add to MetaCart
As processors continue to exploit more instruction level parallelism, a greater demand is placed on reducing the effects of memory access latency. In this paper, we introduce a novel modification of the processor pipeline called memory renaming. Memory renaming applies register access techniques to load instructions, reducing the effect of delays caused by the need to calculate effective addresses for the load and all preceding stores before the data can be fetched. Memory renaming allows the processor to speculatively fetch values when the producer of the data can be reliably determined without the need for an effective address. This work extends previous studies of data value and dependence speculation. When memory renaming is added to the processor pipeline, renaming can be applied to 30% to 50% of all memory references, translating to an overall improvement in execution time of up to 41%. Furthermore, this improvement is seen across all memory segments -- including the heap segment...
Reconfigurable Caches and their Application to Media Processing
- In Proceedings of the 27th Annual International Symposium on Computer Architecture
, 2000
"... High performance general-purpose processors are increasingly being used for a variety of application domains -- scientific, engineering, databases, and more recently, media processing. It is therefore important to ensure that architectural features that use a significant fraction of the on-chip tra ..."
Abstract
-
Cited by 91 (4 self)
- Add to MetaCart
High performance general-purpose processors are increasingly being used for a variety of application domains -- scientific, engineering, databases, and more recently, media processing. It is therefore important to ensure that architectural features that use a significant fraction of the on-chip transistors are applicable across these different domains. For example, current processor designs often devote the largest fraction of on-chip transistors (up to 80%) to caches. Many workloads, however, do not make effective use of large caches; e.g., media processing workloads which often have streaming data access patterns and large working sets. This paper proposes a new reconfigurable cache design. This design enables the cache SRAM arrays to be dynamically divided into multiple partitions that can be used for different processor activities. These activities can benefit applications that would otherwise not use the storage allocated to large conventional caches. Our design involves relativ...
Streamlining Inter-operation Memory Communication via Data Dependence Prediction
- In Proceedings of the 30th International Symposium on Microarchitecture
, 1997
"... We revisit memory hierarchy design viewing memory as an inter-operation communication agent. This perspective leads to the development of novel methods of performing inter-operation memory communication: (1) We use data dependence prediction to identify and link dependent loads and stores so that th ..."
Abstract
-
Cited by 89 (0 self)
- Add to MetaCart
We revisit memory hierarchy design viewing memory as an inter-operation communication agent. This perspective leads to the development of novel methods of performing inter-operation memory communication: (1) We use data dependence prediction to identify and link dependent loads and stores so that they can communicate speculatively without incurring the overhead of address calculation, disambiguation and data cache access. (2) We also use data dependence prediction to convert, DEFstore-load-USE chains within the instruction window into DEF-USE chains prior to address calculation and disambiguation. (3) We use true and output data dependence status prediction to introduce and manage a small storage structure called the Transient Value Cache (TVC). The TVC captures memory values that are short-lived. It also captures recently stored values that are likely to be accessed soon. Accesses that are serviced by the TVC do not have to be serviced by other parts of the memory hierarchy, e.g., the data cache. The first two techniques are aimed at reducing the effective communication latency whereas the last technique is aimed at reducing data cache bandwidth requirements. Experimental analysis of the proposed techniques shows that: (i) the proposed speculative communication methods correctly handle a large fraction of memory dependences, and (ii) a large number of the loads and stores do not have to ever reach the data cache when the TVC is in place. 1

