Results 1 - 10
of
92
Runahead execution: An alternative to very large instruction windows for out-of-order processors
- In HPCA-9
, 2003
"... Today’s high performance processors tolerate long latency operations by means of out-of-order execution. However, as latencies increase, the size of the instruction window must increase even faster if we are to continue to tolerate these latencies. We have already reached the point where the size of ..."
Abstract
-
Cited by 123 (19 self)
- Add to MetaCart
Today’s high performance processors tolerate long latency operations by means of out-of-order execution. However, as latencies increase, the size of the instruction window must increase even faster if we are to continue to tolerate these latencies. We have already reached the point where the size of an instruction window that can handle these latencies is prohibitively large, in terms of both design complexity and power consumption. And, the problem is getting worse. This paper proposes runahead execution as an effective way to increase memory latency tolerance in an out-of-order processor, without requiring an unreasonably large instruction window. Runahead execution unblocks the instruction window blocked by long latency operations allowing the processor to execute far ahead in the program path. This results in data being prefetched into caches long before it is needed. On a machine model based on the Intel R ○ Pentium R ○ 4 processor, having a 128-entry instruction window, adding runahead execution improves the IPC (Instructions Per Cycle) by 22 % across a wide range of memory intensive applications. Also, for the same machine model, runahead execution combined with a 128-entry window performs within 1 % of a machine with no runahead execution and a 384-entry instruction window. 1.
Handling Long-latency Loads in a Simultaneous Multithreading Processor
, 2001
"... Simultaneous multithreading architectures have been defined previously with fully shared execution resources. When one thread in such an architecture experiences a very longlatency operation, such as a load miss, the thread will eventually stall, potentially holding resources which other threads cou ..."
Abstract
-
Cited by 93 (11 self)
- Add to MetaCart
Simultaneous multithreading architectures have been defined previously with fully shared execution resources. When one thread in such an architecture experiences a very longlatency operation, such as a load miss, the thread will eventually stall, potentially holding resources which other threads could be using to make forward progress. This paper shows that in many cases it is better to free the resources associated with a stalled thread rather than keep that thread ready to immediately begin execution upon return of the loaded data. Several possible architectures are examined, and some simple solutions are shown to be very effective, achieving speedups close to 6.0 in some cases, and averaging 15% speedup with four threads and over 100% speedup with two threads running. Response times are cut in half for several workloads in open system experiments. 1
Dynamic Speculative Precomputation
, 2001
"... A large number of memory accesses in memory-bound applications are irregular, such as pointer dereferences, and can be effectively targeted by thread-based prefetching techniques like Speculative Precomputation. These techniques execute instructions, for example on an available SMT thread context, t ..."
Abstract
-
Cited by 77 (10 self)
- Add to MetaCart
A large number of memory accesses in memory-bound applications are irregular, such as pointer dereferences, and can be effectively targeted by thread-based prefetching techniques like Speculative Precomputation. These techniques execute instructions, for example on an available SMT thread context, that have been extracted directly from the program they are trying to accelerate. Proposed techniques typically require manual user intervention to extract and optimize instruction sequences. This paper proposes Dynamic Speculative Precomputation, which performs all necessary instruction analysis, extraction, and optimization through the use of back-end instruction analysis hardware, located off the processor's critical path. For a set of memory limited benchmarks an average speedup of 14% is achieved when constructing simple p-slices, and this gain grows to 33% when making use of aggressive optimizations. 1.
Design and Evaluation of Compiler Algorithms for Pre-Execution
- In Proceedings of the Tenth International Conference on Architectural Support for Programming Languages and Operating Systems
, 2002
"... Pre-execution is a promising latency tolerance technique that uses one or more helper threads running in spare hardware contexts ahead of the main computation to trigger long-latency memory operations early, hence absorbing their latency on behalf of the main computation. This paper investigates a s ..."
Abstract
-
Cited by 40 (8 self)
- Add to MetaCart
Pre-execution is a promising latency tolerance technique that uses one or more helper threads running in spare hardware contexts ahead of the main computation to trigger long-latency memory operations early, hence absorbing their latency on behalf of the main computation. This paper investigates a source-to-source C compiler for extracting preexecution thread code automatically, thus relieving the programmer or hardware from this onerous task. At the heart of our compiler are three algorithms. First, program slicing removes non-critical code for computing cache-missing memory references, reducing pre-execution overhead. Second, prefetch conversion replaces blocking memory references with non-blocking prefetch instructions to minimize pre-execution thread stalls. Finally, threading scheme selection chooses the best scheme for initiating pre-execution threads, speculatively parallelizing loops to generate threadlevel parallelism when necessary for latency tolerance. We prototyped our algorithms using the Stanford University Intermediate Format (SUIF) framework and a publicly available program slicer, called Unravel [13], and we evaluated our compiler on a detailed architectural simulator of an SMT processor. Our results show compiler-based pre-execution improves the performance of 9 out of 13 applications, reducing execution time by 22.7%. Across all 13 applications, our technique delivers an average speedup of 17.0%. These performance gains are achieved fully automatically on conventional SMT hardware, with only minimal modifications to support pre-execution threads.
Guided Region Prefetching: A Cooperative Hardware/Software Approach
- In Proceedings of the 30th International Symposium on Computer Architecture
, 2003
"... Despite large caches, main-memory access latencies still cause significant performance losses in many applications. Numerous hardware and software prefetching schemes have been proposed to tolerate these latencies. Software prefetching typically provides better prefetch accuracy than hardware, but i ..."
Abstract
-
Cited by 39 (5 self)
- Add to MetaCart
Despite large caches, main-memory access latencies still cause significant performance losses in many applications. Numerous hardware and software prefetching schemes have been proposed to tolerate these latencies. Software prefetching typically provides better prefetch accuracy than hardware, but is limited by prefetch instruction overheads and the compiler's limited ability to schedule prefetches sufficiently far in advance to cover level-two cache miss latencies. Hardware prefetching can be effective at hiding these large latencies, but generates many useless prefetches and consumes considerable memory bandwidth. In this paper, we propose a cooperative hardware-software prefetching scheme called Guided Region Prefetching (GRP), which uses compiler-generated hints encoded in load instructions to regulate an aggressive hardware prefetching engine. We compare GRP against a sophisticated pure hardware stride prefetcher and a scheduled region prefetching (SRP) engine. SRP and GRP show the best performance, with respective 22% and 21% gains over no prefetching, but SRP incurs 180% extra memory traffic---nearly tripling bandwidth requirements. GRP achieves performance close to SRP, but with a mere eighth of the extra prefetching traffic, a 23% increase over no prefetching. The GRP hardware-software collaboration thus combines the accuracy of compilerbased program analysis with the performance potential of aggressive hardware prefetching, bringing the performance gap versus a perfect L2 cache under 20%.
Dynamically Allocating Processor Resources between Nearby and Distant ILP
, 2001
"... Modern superscalar processors use wide instruction issue widths and out-of-order execution in order to increase instruction-level parallelism (ILP). Because instructions must be committed in order so as to guarantee precise exceptions, increasing ILP implies increasing the sizes of structures such a ..."
Abstract
-
Cited by 38 (2 self)
- Add to MetaCart
Modern superscalar processors use wide instruction issue widths and out-of-order execution in order to increase instruction-level parallelism (ILP). Because instructions must be committed in order so as to guarantee precise exceptions, increasing ILP implies increasing the sizes of structures such as the register file, issue queue, and reorder buffer. Simultaneously, cycle time constraints limit the sizes of these structures, resulting in conflicting design requirements. In this paper, we present a novel microarchitecture designed to overcome the limitations of a register file size dictated by cycle time constraints. Available registers are dynamically allocated between the primary program thread and a future thread. The future thread executes instructions when the primary thread is limited by resource availability. The future thread is not constrained by in-order commit requirements. It is therefore able to examine a much larger instruction window and jump far ahead to execute ready instructions. Results are communicated back to the primary thread by warming up the register file, instruction cache, data cache, and instruction reuse buffer, and by resolving branch mispredicts early. The proposed microarchitecture is able to get an overall speedup of 1.17 over the base processor for our benchmark set, with speedups of up to 1.64. 1
Mitosis compiler: An infrastructure for speculative threading based on pre-computation slices
- In PLDI ’05: Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
, 2005
"... Speculative parallelization can provide significant sources of additional thread-level parallelism, especially for irregular applications that are hard to parallelize by conventional approaches. In this paper, we present the Mitosis compiler, which partitions applications into speculative threads, w ..."
Abstract
-
Cited by 37 (4 self)
- Add to MetaCart
Speculative parallelization can provide significant sources of additional thread-level parallelism, especially for irregular applications that are hard to parallelize by conventional approaches. In this paper, we present the Mitosis compiler, which partitions applications into speculative threads, with special emphasis on applications for which conventional parallelizing approaches fail. The management of inter-thread data dependences is crucial for the performance of the system. The Mitosis framework uses a pure software approach to predict/compute the thread’s input values. This software approach is based on the use of pre-computation slices (p-slices), which are built by the Mitosis compiler and added at the beginning of the speculative thread. P-slices must compute thread input values accurately but they do not need to guarantee correctness, since the underlying architecture can detect and recover from misspeculations. This allows the compiler to use aggressive/unsafe optimizations to significantly reduce their overhead. The most important optimizations included in the Mitosis compiler and presented in this paper are branch pruning, memory and register dependence speculation, and early thread squashing. Performance evaluation of Mitosis compiler/architecture shows an average speedup of 2.2.
Dual-core execution: building a highly scalable single-thread instruction window
, 2005
"... Current integration trends embrace the prosperity of single-chip multi-core processors. Although multi-core processors deliver significantly improved system throughput, single-thread performance is not addressed. In this paper, we propose a new execution paradigm that utilizes multi-cores on a singl ..."
Abstract
-
Cited by 33 (2 self)
- Add to MetaCart
Current integration trends embrace the prosperity of single-chip multi-core processors. Although multi-core processors deliver significantly improved system throughput, single-thread performance is not addressed. In this paper, we propose a new execution paradigm that utilizes multi-cores on a single chip collaboratively to achieve high performance for single-thread memoryintensive workloads while maintaining the flexibility to support multithreaded applications. The proposed execution paradigm, dual-core execution, consists of two superscalar cores (a front and back processor) coupled with a queue. The front processor fetches and preprocesses instruction streams and retires processed instructions into the queue for the back processor to consume. The front processor executes instructions as usual except for cache-missing loads, which produce an invalid value instead of blocking the pipeline. As a result, the front processor runs far ahead to warm up the data caches and fix branch mispredictions for the back processor. In-flight instructions are distributed in the front processor, the queue, and the back processor, forming a very large instruction window for single-thread out-oforder execution. The proposed architecture incurs only minor hardware changes and does not require any large centralized structures such as large register files, issue queues, load/store queues, or reorder buffers. Experimental results show remarkable latency hiding capabilities of the proposed architecture, even outperforming more complex single-thread processors with much larger instruction windows than the front or back processor. 1.
Enhancing Memory Level Parallelism via Recovery-Free Value Prediction
- In Proceedings of the 17th International Conference on Supercomputing
, 2003
"... The ever-increasing computational power of contemporary microprocessors reduces the execution time spent on arithmetic computations (i.e., the computations not involving slow memory operations such as cache misses) significantly. Therefore, for memory intensive workloads, it becomes more important t ..."
Abstract
-
Cited by 27 (3 self)
- Add to MetaCart
The ever-increasing computational power of contemporary microprocessors reduces the execution time spent on arithmetic computations (i.e., the computations not involving slow memory operations such as cache misses) significantly. Therefore, for memory intensive workloads, it becomes more important to overlap multiple cache misses than to overlap slow memory operations with other computations. In this paper, we propose a novel technique to parallelize sequential cache misses, thereby increasing memory-level parallelism (MLP). Our idea is based on the value prediction, which was proposed originally as an instruction-level-parallelism (ILP) optimization to break true data dependencies. In this paper, we advocate value prediction in its capability to enhance MLP instead of ILP. We propose to use value prediction and value speculative execution only for prefetching so that the complex prediction validation and misprediction recovery mechanisms are avoided and only minor changes in the microarchitecture are needed. The same hardware modifications also enable aggressive memory disambiguation for prefetching. The experimental results show that our technique enhances MLP effectively and achieves significant speedups even with a simple stride value predictor.
Speculative Precomputation: Exploring the Use of Multithreading for Latency
- INTEL TECHNOLOGY JOURNAL
, 2002
"... Speculative Precomputation (SP) is a technique to improve the latency of single-threaded applications by utilizing idle multithreading hardware resources to perform aggressive long-range data prefetches. Instead of trying to explicitly parallelize a single-threaded application, SP does the followin ..."
Abstract
-
Cited by 24 (4 self)
- Add to MetaCart
Speculative Precomputation (SP) is a technique to improve the latency of single-threaded applications by utilizing idle multithreading hardware resources to perform aggressive long-range data prefetches. Instead of trying to explicitly parallelize a single-threaded application, SP does the following: . Targets only a small set of static load instructions, called delinquent loads, which incur the most performance degrading cache miss penalties.

