Results 1 -
4 of
4
Speculative Versioning Cache
- In Proceedings of the Fourth International Symposium on High-Performance Computer Architecture
, 1998
"... Dependences among loads and stores whose addresses are unknown hinder the extraction of instruction level parallelism during the execution of a sequential program. Such ambiguous memory dependences can be overcome by memory dependence speculation which enables a load or store to be speculatively exe ..."
Abstract
-
Cited by 207 (8 self)
- Add to MetaCart
(Show Context)
Dependences among loads and stores whose addresses are unknown hinder the extraction of instruction level parallelism during the execution of a sequential program. Such ambiguous memory dependences can be overcome by memory dependence speculation which enables a load or store to be speculatively executed before the addresses of all preceding loads and stores are known. Furthermore, multiple speculative stores to a memory location create multiple speculative versions of the location. Program order among the speculative versions must be tracked to maintain sequential semantics. A previously proposed approach, the Address Resolution Buffer(ARB) uses a centralized buffer to support speculative versions. Our proposal, called the Speculative Versioning Cache(SVC), uses distributed caches to eliminate the latency and bandwidth problems of the ARB. The SVC conceptually unifies cache coherence and speculative versioning by using an organization similar to snooping bus-based coherent caches. A preliminary evaluation for the Multiscalar architecture shows that hit latency is an important factor affecting performance, and private cache solutions trade-off hit rate for hit latency. 1.
Trace processors
- IN PROCEEDINGS OF THE 30TH INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE
, 1997
"... ..."
(Show Context)
The stampede approach to thread-level speculation
- ACM Transactions on Computer Systems
, 2005
"... Multithreaded processor architectures are becoming increasingly commonplace: many current and upcoming designs support chip multiprocessing, simultaneous multithreading, or both. While it is relatively straightforward to use these architectures to improve the throughput of a multithreaded or multipr ..."
Abstract
-
Cited by 71 (9 self)
- Add to MetaCart
Multithreaded processor architectures are becoming increasingly commonplace: many current and upcoming designs support chip multiprocessing, simultaneous multithreading, or both. While it is relatively straightforward to use these architectures to improve the throughput of a multithreaded or multiprogrammed workload, the real challenge is how to easily create parallel software to allow single programs to effectively exploit all of this raw performance potential. One promising technique for overcoming this problem is Thread-Level Speculation (TLS), which enables the compiler to optimistically create parallel threads despite uncertainty as to whether those threads are actually independent. In this article, we propose and evaluate a design for supporting TLS that seamlessly scales both within a chip and beyond because it is a straightforward extension of writeback invalidation-based cache coherence (which itself scales both up and down). Our experimental results demonstrate that our scheme performs well on single-chip multiprocessors where the first level caches are either private or shared. For our private-cache design, the program performance of two of 13 general purpose applications studied improves by 86 % and 56%, four others by more than 8%, and an average across all applications of 16%—confirming that TLS is a promising way
Trace processors: Exploiting hierarchy and speculation
, 1999
"... In high-performance processors, increasing the number of instructions fetched and executed in parallel is becoming increasingly complex, and the peak bandwidth is often underutilized due to control and data dependences. A trace processor 1) efficiently sequences through programs in large units, call ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
(Show Context)
In high-performance processors, increasing the number of instructions fetched and executed in parallel is becoming increasingly complex, and the peak bandwidth is often underutilized due to control and data dependences. A trace processor 1) efficiently sequences through programs in large units, called traces, and allocates trace-sized units of work to distributed processing elements (PEs), and 2) uses aggressive speculation to par-tially alleviate the effects of control and data dependences. A trace is a dynamic sequence of instructions, typically 16 to 32 instructions in length, which embeds any number of taken or not-taken branch instructions. The hierarchical, trace-based approach to increas-ing parallelism overcomes basic inefficiencies of managing fetch and execution resources on an individual instruction basis. This thesis shows the trace processor is a good microarchitecture for implementing wide-issue machines. Three key points support this conclusion. 1. Trace processors perform better than wide-issue superscalar counterparts because they deliver high instruction throughput without significantly increasing cycle time. The