Results 1 - 10
of
57
The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization
- HPCA-4
, 1998
"... As we look to the future, and the prospect of a billion transistors on a chip, it seems inevitable that microprocessors will exploit having multiple parallel threads. To achieve the full potential of these "single-chip multiprocessors," however, we must find a way to parallelize non-numeric applicat ..."
Abstract
-
Cited by 210 (8 self)
- Add to MetaCart
As we look to the future, and the prospect of a billion transistors on a chip, it seems inevitable that microprocessors will exploit having multiple parallel threads. To achieve the full potential of these "single-chip multiprocessors," however, we must find a way to parallelize non-numeric applications. Unfortunately, compilers have had little success in parallelizing non-numeric codes due to their complex access patterns. This paper explores the potential for using thread-level data speculation (TLDS) to overcome this limitation by allowing the compiler to view parallelization solely as a cost/benefit tradeoff, rather than something which is likely to violate program correctness. Our experimental results demonstrate that with realistic compiler support, TLDS can offer significant program speedups. We also demonstrate that through modest hardware extensions, a generic single-chip multiprocessor could support TLDS by augmenting its cache coherence scheme to detect dependence violations, and by using the primary data caches to buffer speculative state.
Dynamic Speculation and Synchronization of Data Dependencies
, 1997
"... Data dependence speculation is used in instruction-level parallel (ILP) processors to allow early execution of an instruction before a logically preceding instruction on which it may be data dependent. If the instruction is independent, data dependence speculation succeeds; if not, it fails, and the ..."
Abstract
-
Cited by 165 (21 self)
- Add to MetaCart
Data dependence speculation is used in instruction-level parallel (ILP) processors to allow early execution of an instruction before a logically preceding instruction on which it may be data dependent. If the instruction is independent, data dependence speculation succeeds; if not, it fails, and the two instructions must be synchronized. The modern dynamically scheduled processors that use data dependence speculation do so blindly (i.e., every load instruction with unresolved dependences is speculated). In this paper, we demonstrate that as dynamic instruction windows get larger, significant performance benefits can result when intelligent decisions about data dependence speculation are made. We propose dynamic data dependence speculation techniques: (i) to predict if the execution of an instruction is likely to result in a data dependence mis-speculation, and (ii) to provide the synchronization needed to avoid a mis-speculation. Experimental results evaluating the effectiveness of the...
Value Profiling
- In MICRO-97
, 1997
"... Identifying variables as invariant or constant at compile-time allows the compiler to perform optimizations including constant folding, code specialization, and partial evaluation. Some variables, which cannot be labeled as constants, may exhibit semi-invariant behavior. A semiinvariant variable is ..."
Abstract
-
Cited by 99 (5 self)
- Add to MetaCart
Identifying variables as invariant or constant at compile-time allows the compiler to perform optimizations including constant folding, code specialization, and partial evaluation. Some variables, which cannot be labeled as constants, may exhibit semi-invariant behavior. A semiinvariant variable is one that cannot be identified as a constant at compile-time, but has a high degree of invariant behavior at run-time. If run-time information was available to identify these variables as semi-invariant, they could then benefit from invariant-based compiler optimizations. In this paper we examine the invariance found from profiling instruction values, and show that many instructions have semi-invariant values even across different inputs. We also investigate the ability to estimate the invariance for all instructions in a program from only profiling load instructions. In addition, we propose a new type of profiling called Convergent Profiling. Estimating the invariance from loads and converg...
Value Profiling and Optimization
, 1999
"... Variables and instructions that have invariant or predictable values at run-time, but cannot be identified as such using compiler analysis, can benefit from value-based compiler optimizations. Value-based optimizations include all optimizations based on a predictable value or range of values for a v ..."
Abstract
-
Cited by 63 (5 self)
- Add to MetaCart
Variables and instructions that have invariant or predictable values at run-time, but cannot be identified as such using compiler analysis, can benefit from value-based compiler optimizations. Value-based optimizations include all optimizations based on a predictable value or range of values for a variable or instruction at run-time. These include constant propagation, code specialization, optimizations assuming the value predictability of an instruction, continuous optimization, and partial evaluation. This paper explores...
Compiler Optimization of Scalar Value Communication Between Speculative Threads
- In Proceedings of the 10th ASPLOS
, 2002
"... While there have been many recent proposals for hardware that supports Thread-Level Speculation (TLS), there has been relatively little work on compiler optimizations to fully exploit this potential for parallelizing programs optimistically. In this paper, we focus on one important limitation of pro ..."
Abstract
-
Cited by 56 (17 self)
- Add to MetaCart
While there have been many recent proposals for hardware that supports Thread-Level Speculation (TLS), there has been relatively little work on compiler optimizations to fully exploit this potential for parallelizing programs optimistically. In this paper, we focus on one important limitation of program performance under TLS, which is stalls due to forwarding scalar values between threads that would otherwise cause frequent data dependences. We present and evaluate dataflow algorithms for three increasingly-aggressive instruction scheduling techniques that reduce the critical forwarding path introduced by the synchronization associated with this data forwarding. In addition, we contrast our compiler techniques with related hardware-only approaches. With our most aggressive compiler and hardware techniques, we improve performance under TLS by 6.2--28.5% for 6 of 14 applications, and by at least 2.7% for half of the other applications.
Value Locality And Speculative Execution
, 1997
"... This thesis introduces a program attribute called value locality and proposes speculative execution under the weak dependence model. The weak dependence model lays a theoretical foundation for exploiting value locality and other program attributes by speculatively relaxing and deferring the detectio ..."
Abstract
-
Cited by 51 (1 self)
- Add to MetaCart
This thesis introduces a program attribute called value locality and proposes speculative execution under the weak dependence model. The weak dependence model lays a theoretical foundation for exploiting value locality and other program attributes by speculatively relaxing and deferring the detection and enforcement of control- and data-flow dependences between instructions to expose more instruction-level parallelism without violating program correctness. Value locality is a program attribute that describes the likelihood of the recurrence of a previously-seen value within a storage location inside a computer system. Most modern processors already exploit value locality through the use of control speculation (i.e. branch prediction), which seeks to predict the future values of condition code bits and branch-target addresses based on previously-seen values. Experimental results indicate that value locality exists for condition codes and branch target addresses, and for general-purpose ...
The Declining Effectiveness of Dynamic Caching for General-Purpose Microprocessors
, 1995
"... The computational power of commodity general-purpose microprocessors is racing to truly amazing levels. As peak levels of performance rise, the building of memory systems that can keep pace becomes increasingly problematic. We claim that in addition to the latency associated with waiting for operand ..."
Abstract
-
Cited by 47 (7 self)
- Add to MetaCart
The computational power of commodity general-purpose microprocessors is racing to truly amazing levels. As peak levels of performance rise, the building of memory systems that can keep pace becomes increasingly problematic. We claim that in addition to the latency associated with waiting for operands, the bandwidth of the memory system, especially that across the chip boundary, will become a progressively greater limit to high performance. After describing the current state of microsolutions aimed at alleviating the memory bottleneck, this paper postulates that dynamic caches themselves use memory inefficiently and will impede attempts to solve the memory problem. We present an analysis of several important algorithms, which shows that increasing levels of integration will not result in computational requirements outstripping off-chip bandwidth needs, thereby preserving the memory bottleneck. We then present results from two sets of simulations, which measured both the efficiency with which current caching techniques use memory (generally less than 20%), and how well (or poorly) caches reduce traffic to main memory (cache sizes up to 2000 times worse than optimal). We then discuss how two classes of techniques, (i) decoupling memory operations from computation, and (ii) explicit compiler management of the memory hierarchy, provide better long-term solutions to lowering a program's memory latencies and bandwidth requirements. Finally, we describe Galileo, a new project that will attempt to provide a long-term solution to the pernicious memory bottleneck.
Store Vulnerability Window (SVW): Re-Execution Filtering for Enhanced Load Optimization
- In International Symposium on Computer Architecture
, 2005
"... The load-store unit is a performance critical component of a dynamically-scheduled processor. It is also a complex and non-scalable component. Several recently proposed techniques use some form of speculation to simplify the load-store unit and check this speculation by re-executing some of the load ..."
Abstract
-
Cited by 40 (11 self)
- Add to MetaCart
The load-store unit is a performance critical component of a dynamically-scheduled processor. It is also a complex and non-scalable component. Several recently proposed techniques use some form of speculation to simplify the load-store unit and check this speculation by re-executing some of the loads prior to commit. We call such techniques load optimizations. One recent load optimization improves load queue (LQ) scalability by using re-execution rather than associative search to check speculative intra- and inter- thread memory ordering. A second technique improves store queue (SQ) scalability by speculatively filtering some load accesses and some store entries from it and re-executing loads to check that speculation. A third technique speculatively removes redundant loads from the execution engine; re-execution detects false eliminations. Unfortunately, the benefits of a load optimization are often mitigated by re-execution itself. Re-execution contends for cache bandwidth with store commit, and serializes load re-execution with subsequent store commit. If a given load optimization requires a sufficient number of load re-executions, the aggregate re-execution cost may overwhelm the benefits of the technique entirely and even cause drastic slowdowns. Store Vulnerability Window (SVW) is a new mechanism that significantly reduces the re-execution requirements of a given load optimization. SVW is based on monotonic store sequence numbering and an adaptation of Bloom filtering. The cost of a typical SVW implementation is a 1KB buffer and a 16-bit field per LQ entry. Across the three optimizations we study, SVW reduces re-executions by an average of 85%. This reduction relieves cache port contention and removes many of the dynamic serialization events that contribute the bulk of re-execution’s cost, allows these load optimizations to perform up to their full potential. For the speculative SQ, this means the chance to perform at all, as without SVW it posts significant slowdowns. 1.
Exploiting Instruction Level Parallelism in the Presence of Conditional Branches
, 1996
"... Wide issue superscalar and VLIW processors utilize instruction-level parallelism (ILP) to achieve high performance. However, if insufficient ILP is found, the performance potential of these processors suffers dramatically. Branch instructions, which are one of the major limitations to exploiting ILP ..."
Abstract
-
Cited by 37 (3 self)
- Add to MetaCart
Wide issue superscalar and VLIW processors utilize instruction-level parallelism (ILP) to achieve high performance. However, if insufficient ILP is found, the performance potential of these processors suffers dramatically. Branch instructions, which are one of the major limitations to exploiting ILP, enforce strict ordering conditions in programs to ensure correct execution. Therefore, it is difficult to achieve the desired overlap of instruction execution with branches in the instruction stream. To effectively exploit ILP in the presence of branches requires efficient handling of branches and the dependences they impose. This dissertation investigates two techniques for exposing and enhancing ILP in the presence of branches, speculative execution and predicated execution. Speculative execution enables an ILP compiler to remove dependences between instructions and prior branches. In this manner, the execution of instructions and predicted future instructions may be overlapped. Compiler-controlled speculative execution is employed using an efficient structure called the superblock. The formation and optimization of superblocks increase ILP along important execution paths by systematically removing constraints due to unimportant paths. In conjunction with superblock optimizations, speculative execution is utilized to remove control dependences in the superblock

