Results 1 - 10
of
44
Simultaneous Multithreading: Maximizing On-Chip Parallelism
, 1995
"... This paper examines simultaneous multithreading, a technique permitting several independent threads to issue instructions to a superscalar’s multiple functional units in a single cycle. We present several models of simultaneous multithreading and compare them with alternative organizations: a wide s ..."
Abstract
-
Cited by 823 (48 self)
- Add to MetaCart
(Show Context)
This paper examines simultaneous multithreading, a technique permitting several independent threads to issue instructions to a superscalar’s multiple functional units in a single cycle. We present several models of simultaneous multithreading and compare them with alternative organizations: a wide superscalar, a fine-grain multithreaded processor, and single-chip, multiple-issue multiprocessing architectures. Our results show that both (single-threaded) superscalar and fine-grain multithreaded architectures are limited in their ability to utilize the resources of a wide-issue processor. Simultaneous multithreading has the potential to achieve 4 times the throughput of a superscalar, and double that of fine-grain multithreading. We evaluate several cache configurations made possible by this type of organization and evaluate tradeoffs between them. We also show that simultaneous multithreading is an attractive alternative to single-chip multiprocessors; simultaneous multithreaded processors with a variety of organizations outperform corresponding conventional multiprocessors with similar execution resources. While simultaneous multithreading has excellent potential to increase processor utilization, it can add substantial complexity to the design. We examine many of these complexities and evaluate alternative organizations in the design space.
Converting Thread-Level Parallelism to Instruction-Level Parallelism via Simultaneous Multithreading
- ACM Transactions on Computer Systems
, 1997
"... This article explores parallel processing on an alternative architecture, simultaneous multithreading (SMT), which allows multiple threads to compete for and share all of the processor's resources every cycle. The most compelling reason for running parallel applications on an SMT processor is i ..."
Abstract
-
Cited by 147 (17 self)
- Add to MetaCart
This article explores parallel processing on an alternative architecture, simultaneous multithreading (SMT), which allows multiple threads to compete for and share all of the processor's resources every cycle. The most compelling reason for running parallel applications on an SMT processor is its ability to use thread-level parallelism and instruction-level parallelism interchangeably. By permitting This research was supported by Digital Equipment Corporation, the Washington Technology Center, NSF PYI Award MIP-9058439, NSF grants MIP-9632977, CCR-9200832, and CCR9632769, DARPA grant F30602-97-2-0226, ONR grants N00014-92-J-1395 and N00014-94-11136, and fellowships from Intel and the Computer Measurement Group.
Informing Memory Operations: Providing Memory Performance Feedback in Modern Processors
, 1996
"... Memory latency is an important bottleneck in system performance that cannot be adequately solved by hardware alone. Several promising software techniques have been shown to address this problem successfully in specific situations. However, the generality of these software approaches has been limited ..."
Abstract
-
Cited by 59 (3 self)
- Add to MetaCart
Memory latency is an important bottleneck in system performance that cannot be adequately solved by hardware alone. Several promising software techniques have been shown to address this problem successfully in specific situations. However, the generality of these software approaches has been limited because current architectures do not provide a fine-grained, low-overhead mechanism for observing and reacting to memory behavior directly. To fill this need, we propose a new class of memory operations called informing memory operations, which essentially consist of a memory operation combined (either implicitly or explicitly) with a conditional branch-and-link operation that is taken only if the reference suffers a cache miss. We describe two different implementations of informing memory operations---one based on a cache-outcome condition code and another based on low-overhead traps---and find that modern in-order-issue and out-of-order-issue superscalar processors already contain the bulk of the necessary hardware support. We describe how a number of software-based memory optimizations can exploit informing memory operations to enhance performance, and look at cache coherence with fine-grained access control as a case study. Our performance results demonstrate that the runtime overhead of invoking the informing mechanism on the Alpha 21164 and MIPS R10000 processors is generally small enough to provide considerable flexibility to hardware and software designers, and that the cache coherence application has improved performance compared to other current solutions. We believe that the inclusion of informing memory operations in future processors may spur even more innovative performance optimizations.
Thread-Sensitive Scheduling for SMT Processors
, 2000
"... A simultaneous-multithreaded (SMT) processor executes multiple instructions from multiple threads every cycle. As a result, threads on SMT processors -- unlike those on traditional shared-memory machines -- simultaneously share all low-level hardware resources in a single CPU. Because of this fine-g ..."
Abstract
-
Cited by 55 (2 self)
- Add to MetaCart
(Show Context)
A simultaneous-multithreaded (SMT) processor executes multiple instructions from multiple threads every cycle. As a result, threads on SMT processors -- unlike those on traditional shared-memory machines -- simultaneously share all low-level hardware resources in a single CPU. Because of this fine-grained resource sharing, SMT threads have the ability to interfere or conflict with each other, as well as to share these resources to mutual benefit. This paper examines thread-sensitive scheduling for SMT processors. When more threads exist than hardware execution contexts, the operating system is responsible for selecting which threads to execute at any instant, inherently deciding which threads will compete for resources. Thread-sensitive scheduling uses thread-behavior feedback to choose the best set of threads to execute together, in order to maximize processor throughput. We introduce several thread-sensitive scheduling schemes and compare them to traditional oblivious schemes, such as round-robin. Our measurements show how these scheduling algorithms impact performance and the utilization of low-level hardware resources. We also demonstrate how thread-sensitive scheduling algorithms can be tuned to trade-off performance and fairness. For the workloads we measured, we show that an IPC-based thread-sensitive scheduling algorithm can achieve speedups over oblivious schemes of 7% to 15%, with minimal hardware costs. 1
A memory-level parallelism aware fetch policy for SMT processors
- In HPCA
, 2007
"... A thread executing on a simultaneous multithreading (SMT) processor that experiences a long-latency load will eventually stall while holding execution resources. Existing long-latency load aware SMT fetch policies limit the amount of resources allocated by a stalled thread by identifying long-latenc ..."
Abstract
-
Cited by 26 (7 self)
- Add to MetaCart
A thread executing on a simultaneous multithreading (SMT) processor that experiences a long-latency load will eventually stall while holding execution resources. Existing long-latency load aware SMT fetch policies limit the amount of resources allocated by a stalled thread by identifying long-latency loads and preventing the given thread from fetching more instructions — and in some implementations, instructions beyond the long-latency load may even be flushed which frees allocated resources. This paper proposes an SMT fetch policy that takes into account the available memory-level parallelism (MLP) in a thread. The key idea proposed in this paper is that in case of an isolated long-latency load, i.e., there is no MLP, the thread should be prevented from allocating additional resources. However, in case multiple independent long-latency loads overlap, i.e., there is MLP, the thread should allocate as many resources as needed in order to fully expose the available MLP. The proposed MLP-aware fetch policy achieves better performance for MLP-intensive threads on an SMT processor and achieves a better overall balance between performance and fairness than previously proposed fetch policies. 1
Balanced Multithreading: Increasing Throughput via a Low Cost Multithreading Hierarchy
- In Proceedings of the 37th International Symposium on Microarchitecture. IEEE
, 2004
"... A simultaneous multithreading (SMT) processor can issue instructions from several threads every cycle, allowing it to effectively hide various instruction latencies; this effect increases with the number of simultaneous contexts supported. However, each added context on an SMT processor incurs a cos ..."
Abstract
-
Cited by 21 (3 self)
- Add to MetaCart
(Show Context)
A simultaneous multithreading (SMT) processor can issue instructions from several threads every cycle, allowing it to effectively hide various instruction latencies; this effect increases with the number of simultaneous contexts supported. However, each added context on an SMT processor incurs a cost in complexity, which may lead to an increase in pipeline length or a decrease in the maximum clock rate. This paper presents new designs for multithreaded processors which combine a conservative SMT implementation with a coarsegrained multithreading capability. By presenting more virtual contexts to the operating system and user than are supported in the core pipeline, the new designs can take advantage of the memory parallelism present in workloads with many threads, while avoiding the performance penalties inherent in a manycontext SMT processor design. A design with 4 virtual contexts, but which is based on a 2-context SMT processor core, gains an additional 26% throughput when 4 threads are run together.
OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance
- In ASPLOS
, 2013
"... Emerging GPGPU architectures, along with programming models like CUDA and OpenCL, offer a cost-effective platform for many applications by providing high thread level parallelism at lower energy budgets. Unfortunately, for many general-purpose applications, available hardware resources of a GPGPU ar ..."
Abstract
-
Cited by 18 (4 self)
- Add to MetaCart
(Show Context)
Emerging GPGPU architectures, along with programming models like CUDA and OpenCL, offer a cost-effective platform for many applications by providing high thread level parallelism at lower energy budgets. Unfortunately, for many general-purpose applications, available hardware resources of a GPGPU are not efficiently utilized, leading to lost opportunity in improving performance. A major cause of this is the inefficiency of current warp scheduling policies in tolerating long memory latencies. In this paper, we identify that the scheduling decisions made by such policies are agnostic to thread-block, or cooperative thread array (CTA), behavior, and as a result inefficient. We present a coordinated CTA-aware scheduling policy that utilizes four schemes to minimize the impact of long memory latencies. The first two schemes, CTA-aware two-level warp scheduling and locality aware warp scheduling, enhance per-core performance by effectively reducing cache contention and improving latency hiding capability. The third scheme, bank-level parallelism aware warp scheduling, improves overall GPGPU performance by enhancing DRAM bank-level parallelism. The fourth scheme employs opportunistic memory-side prefetching to further enhance performance by taking advantage of open DRAM rows. Evaluations on a 28-core GPGPU platform with highly memory-intensive applications indicate that our proposed mechanism can provide 33 % average performance improvement compared to the commonly-employed round-robin warp scheduling policy.
Informing Memory Operations: Memory Performance Feedback Mechanisms and Their Applications
- ACM Transactions on Computer Systems
, 1998
"... this article proposes a new class of memory operations called informing memory operations, which essentially consist of a memory operation combined (either implicitly or explicitly) with a conditional branch-and-link operation that is taken only if the reference suffers a cache miss. This article de ..."
Abstract
-
Cited by 18 (1 self)
- Add to MetaCart
this article proposes a new class of memory operations called informing memory operations, which essentially consist of a memory operation combined (either implicitly or explicitly) with a conditional branch-and-link operation that is taken only if the reference suffers a cache miss. This article describes two different implementations of informing memory operations. One is based on a cache-outcome condition code, and the other is based on low-overhead traps. We find that modern in-order-issue and out-of-order-issue superscalar processors already contain the bulk of the necessary hardware support. We describe how a number of software-based memory optimizations can exploit informing memory operations to enhance performance, and we look at cache coherence with fine-grained access control as a Mark Horowitz is supported by ARPA Contract DABT63-94-C-0054. Margaret Martonosi is supported in part by a National Science Foundation Career Award (CCR-9502516). Todd C. Mowry is partially supported by a Faculty Development Award from IBM. Michael D. Smith is supported by a National Science Foundation Young Investigator award (CCR-9457779).
Execution and Cache Performance of the Scheduled Dataflow Architecture
- Journal of Universal Computer Science, Special Issue on Multithreaded Processors and Chip Multiprocessors
, 2000
"... Abstract: This paper presents an evaluation of our Scheduled Dataflow (SDF) Processor. Recent focus in the field of new processor architectures is mainly on VLIW (e.g. IA-64), superscalar and superspeculative architectures. This trend allows for better performance at the expense of an increased hard ..."
Abstract
-
Cited by 13 (11 self)
- Add to MetaCart
Abstract: This paper presents an evaluation of our Scheduled Dataflow (SDF) Processor. Recent focus in the field of new processor architectures is mainly on VLIW (e.g. IA-64), superscalar and superspeculative architectures. This trend allows for better performance at the expense of an increased hardware complexity and a brute-force solution to the memory-wall problem. Our research substantially deviates from this trend by exploring a simpler, yet powerful execution paradigm that is based on dataflow concepts. A program is partitioned into functional execution threads, which are perfectly suited for our non-blocking multithreaded architecture. In addition, all memory accesses are decoupled from the thread’s execution. Data is pre-loaded into the thread’s context (registers), and all results are post-stored after the completion of the thread’s execution. The decoupling of memory accesses from thread execution requires a separate unit to perform the necessary pre-loads and poststores, and to control the allocation of hardware thread contexts to enabled threads. The analytical analysis of our architecture showed that we could achieve a better performance than other classical dataflow architectures (i.e., ETS), hybrid models (e.g., EARTH) and decoupled multithreaded architectures (e.g., Rhamma processor). This paper analyzes the architecture using an instruction set level simulator for a variety of benchmark programs. We compared the execution cycles required for programs on SDF with the execution cycles required by the programs on DLX (or MIPS). Then we investigated the expected cache-memory performance by collecting address traces from programs and using a trace-driven cache simulator (Dinero-IV). We present these results in this paper.
Limits on the Performance Benefits of Multithreading and Prefetching
- in Proc. Int'l Conf. on the Measurement and Modeling of Computer Systems
, 1996
"... This paper presents new analytical models of the performance benefits of multithreading and prefetching, and experimental measurements of parallel applications on the MIT Alewife multiprocessor. For the first time, both techniques are evaluated on a real machine as opposed to simulations. The models ..."
Abstract
-
Cited by 12 (0 self)
- Add to MetaCart
(Show Context)
This paper presents new analytical models of the performance benefits of multithreading and prefetching, and experimental measurements of parallel applications on the MIT Alewife multiprocessor. For the first time, both techniques are evaluated on a real machine as opposed to simulations. The models determine the region in the parameter space where the techniques are most effective, while the measurements determine the region where the applications lie. We find that these regions do not always overlap significantly. The multithreading model shows that only 2--4 contexts are necessary to maximize this technique's potential benefit in current multiprocessors. Multithreading improves execution time by less than 10% for most of the applications that we examined. The model also shows that multithreading can significantly improve the performance of the same applications in multiprocessors with longer latencies. Reducing context-switch overhead is not crucial. The software prefetching model s...