Results 1 - 10
of
36
Dynamic Speculation and Synchronization of Data Dependences
- In Proc. 24th International Symposium on Computer Architecture
, 1997
"... Data dependence speculation is used in instruction-level parallel (ILP) processors to allow early execution of an instruction before a logically preceding instruction on which it may be data dependent. If the instruction is independent, data dependence speculation succeeds; if not, it fails, and the ..."
Abstract
-
Cited by 185 (22 self)
- Add to MetaCart
(Show Context)
Data dependence speculation is used in instruction-level parallel (ILP) processors to allow early execution of an instruction before a logically preceding instruction on which it may be data dependent. If the instruction is independent, data dependence speculation succeeds; if not, it fails, and the two instructions must be synchronized. The modern dynamically scheduled processors that use data dependence speculation do so blindly (i.e., every load instruction with unresolved dependences is speculated). In this paper, we demonstrate that as dynamic instruction windows get larger, significant performance benefits can result when intelligent decisions about data dependence speculation are made. We propose dynamic data dependence speculation techniques: (i) to predict if the execution of an instruction is likely to result in a data dependence mis-speculation, and (ii) to provide the synchronization needed to avoid a mis-speculation. Experimental results evaluating the effectiveness of the proposed techniques are presented within the context of a Multiscalar processor. 1
Trace processors
- IN PROCEEDINGS OF THE 30TH INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE
, 1997
"... ..."
(Show Context)
A Chip-Multiprocessor Architecture with Speculative Multithreading.
- IEEE Trans. on Computers,
, 1999
"... ..."
(Show Context)
The stampede approach to thread-level speculation
- ACM Transactions on Computer Systems
, 2005
"... Multithreaded processor architectures are becoming increasingly commonplace: many current and upcoming designs support chip multiprocessing, simultaneous multithreading, or both. While it is relatively straightforward to use these architectures to improve the throughput of a multithreaded or multipr ..."
Abstract
-
Cited by 71 (9 self)
- Add to MetaCart
Multithreaded processor architectures are becoming increasingly commonplace: many current and upcoming designs support chip multiprocessing, simultaneous multithreading, or both. While it is relatively straightforward to use these architectures to improve the throughput of a multithreaded or multiprogrammed workload, the real challenge is how to easily create parallel software to allow single programs to effectively exploit all of this raw performance potential. One promising technique for overcoming this problem is Thread-Level Speculation (TLS), which enables the compiler to optimistically create parallel threads despite uncertainty as to whether those threads are actually independent. In this article, we propose and evaluate a design for supporting TLS that seamlessly scales both within a chip and beyond because it is a straightforward extension of writeback invalidation-based cache coherence (which itself scales both up and down). Our experimental results demonstrate that our scheme performs well on single-chip multiprocessors where the first level caches are either private or shared. For our private-cache design, the program performance of two of 13 general purpose applications studied improves by 86 % and 56%, four others by more than 8%, and an average across all applications of 16%—confirming that TLS is a promising way
Task Selection for a Multiscalar Processor
- In Proceedings of the 31st annual international symposium on Microarchitecture
, 1998
"... The Multiscalar architecture advocates a distributed processor organization and task-level speculation to exploit high degrees of instruction level parallelism (ILP) in sequential programs without impeding improvements in clock speeds. The main goal of this paper is to understand the key implication ..."
Abstract
-
Cited by 70 (6 self)
- Add to MetaCart
(Show Context)
The Multiscalar architecture advocates a distributed processor organization and task-level speculation to exploit high degrees of instruction level parallelism (ILP) in sequential programs without impeding improvements in clock speeds. The main goal of this paper is to understand the key implications of the architectural features of distributed processor organization and task-level speculation for compiler task selection from the point of view of performance. We identify the fundamental performance issues to be: control flow speculation, data communication, data dependence speculation, load imbalance, and task overhead. We show that these issues are intimately related to a few key characteristics of tasks: task size, inter-task control flow, and inter-task data dependence. We describe compiler heuristics to select tasks with favorable characteristics. We report experimental results to show that the heuristics are successful in boosting overall performance by establishing larger ILP win...
Hardware and Software Support for Speculative Execution of Sequential Binaries on a Chip-Multiprocessor
- IN PROC. OF 1998 INT. CONF. ON SUPERCOMPUTING
, 1998
"... Chip-multiprocessors (CMP) are a promising approach for exploiting the increasing transistor count on a chip. To allow sequential applications to be executed on this architecture, current proposals incorporate hardware support to exploit speculative parallelism. However, these proposals either requi ..."
Abstract
-
Cited by 64 (9 self)
- Add to MetaCart
Chip-multiprocessors (CMP) are a promising approach for exploiting the increasing transistor count on a chip. To allow sequential applications to be executed on this architecture, current proposals incorporate hardware support to exploit speculative parallelism. However, these proposals either require re-compilation of the source program or use substantial hardware that tailors the architecture for speculative execution, thereby resulting in wasted resources when running parallel applications. In this paper, we present a CMP architecture that has hardware and software support for speculative execution of sequential binaries without the need for source re-compilation. The software support enables identification of threads from sequential binaries, while a modest amount of hardware allows register-level communication and enforces true inter-thread memory dependences. We evaluate this architecture and show that it is able to deliver high performance.
Min-Cut Program Decomposition for Thread-Level Speculation
, 2004
"... With billion-transistor chips on the horizon, single-chip multiprocessors (CMPs) are likely to become commodity components. Speculative CMPs use hardware to enforce dependence, allowing the compiler to improve performance by speculating on ambiguous dependences without absolute guarantees of indepen ..."
Abstract
-
Cited by 62 (2 self)
- Add to MetaCart
With billion-transistor chips on the horizon, single-chip multiprocessors (CMPs) are likely to become commodity components. Speculative CMPs use hardware to enforce dependence, allowing the compiler to improve performance by speculating on ambiguous dependences without absolute guarantees of independence. The compiler is responsible for decomposing a sequential program into speculatively parallel threads, while considering multiple performance overheads related to data dependence, load imbalance, and thread prediction. Although the decomposition problem lends itself to a min-cut-based approach, the overheads depend on the thread size, requiring the edge weights to be changed as the algorithm progresses. The changing weights make our approach di#erent from graph-theoretic solutions to the general problem of task scheduling. One recent work uses a set of heuristics, each targeting a specific overhead in isolation, and gives precedence to thread prediction, without comparing the performance of the threads resulting from each heuristic. By contrast, our method uses a sequence of balanced min-cuts that give equal consideration to all the overheads, and adjusts the edge weights after every cut. This method achieves an (geometric) average speedup of 74% for floating-point programs and 23% for integer programs on a four-processor chip, improving on the 52% and 13% achieved by the previous heuristics.
Compiling for the Multiscalar Architecture
, 1998
"... High-performance, general-purpose microprocessors serve as compute engines for computers ranging from personal computers to supercomputers. Sequential programs constitute a major portion of real-world software that run on the computers. State-of-the-art microprocessors exploit instruction level para ..."
Abstract
-
Cited by 53 (2 self)
- Add to MetaCart
High-performance, general-purpose microprocessors serve as compute engines for computers ranging from personal computers to supercomputers. Sequential programs constitute a major portion of real-world software that run on the computers. State-of-the-art microprocessors exploit instruction level parallelism (ILP) to achieve high performance on such applications by searching for independent instructions in a dynamic window of instructions and executing them on a wide-issue pipeline. Increasing the window size and the issue width to extract more ILP may hinder achieving high clock speeds, limiting overall performance. The Multiscalar architecture employs multiple small windows and many narrow-issue processing units to exploit ILP at high clock speeds. Sequential programs are partitioned into code fragments called tasks, which are speculatively executed in parallel. Inter-task register dependences are honored via communication and synchronization and inter-task control flow and memory depe...
Control Flow Speculation in Multiscalar Processors
, 1997
"... The Multiscalar architecture executes a single sequential program following multiple flows of control. In the Multiscalar hardware, a global sequencer, with help from the compiler, takes large steps through the program's control flow graph (CFG) speculatively, starting a new thread of control ( ..."
Abstract
-
Cited by 42 (5 self)
- Add to MetaCart
(Show Context)
The Multiscalar architecture executes a single sequential program following multiple flows of control. In the Multiscalar hardware, a global sequencer, with help from the compiler, takes large steps through the program's control flow graph (CFG) speculatively, starting a new thread of control (task) at each step. This is inter-task control flow speculation. Within a task, traditional control flow speculation is used to extract instruction level parallelism. This is intra-task control flow speculation. This paper focuses on mechanisms to implement intertask control flow speculation (task prediction) in a Multiscalar implementation. This form of speculation has fundamental differences from traditional branch prediction. We look in detail at the issues of prediction automata, history generation and target buffers. We present implementations in each of these areas that offer good accuracy, size and performance characteristics. Keywords: Multiscalar Architecture, Control-flow Speculation, ...
Memory Dependence Prediction
, 1998
"... As the existing techniques that empower the modern high-performance processors are being refined and as the underlying technology trade-offs change, new bottlenecks are exposed and new challenges are raised. This thesis introduces a new tool, Memory Dependence Prediction that can be useful in combat ..."
Abstract
-
Cited by 39 (5 self)
- Add to MetaCart
(Show Context)
As the existing techniques that empower the modern high-performance processors are being refined and as the underlying technology trade-offs change, new bottlenecks are exposed and new challenges are raised. This thesis introduces a new tool, Memory Dependence Prediction that can be useful in combating these bottlenecks and meeting the new challenges. Memory dependence prediction is a technique to guess whether a load or a store will experience a dependence. Memory dependence prediction exploits regularity in the memory dependence stream of ordinary programs, a phenomenon which is also identified in this thesis. To demonstrate the utility of memory dependence prediction this thesis also presents the following three novel microarchitectural techniques: 1. Dynamic Speculation/Synchronization of Memory Dependences: this thesis demonstrates that to exploit parallelism over larger regions of code waiting to determine the dependences a load has is not the best performing option. Higher performance is possible if memory dependence speculation is used especially if memory dependence prediction is used to guide this speculation.