Results 1 - 10
of
34
Handling Long-latency Loads in a Simultaneous Multithreading Processor
, 2001
"... Simultaneous multithreading architectures have been defined previously with fully shared execution resources. When one thread in such an architecture experiences a very longlatency operation, such as a load miss, the thread will eventually stall, potentially holding resources which other threads cou ..."
Abstract
-
Cited by 93 (11 self)
- Add to MetaCart
Simultaneous multithreading architectures have been defined previously with fully shared execution resources. When one thread in such an architecture experiences a very longlatency operation, such as a load miss, the thread will eventually stall, potentially holding resources which other threads could be using to make forward progress. This paper shows that in many cases it is better to free the resources associated with a stalled thread rather than keep that thread ready to immediately begin execution upon return of the loaded data. Several possible architectures are examined, and some simple solutions are shown to be very effective, achieving speedups close to 6.0 in some cases, and averaging 15% speedup with four threads and over 100% speedup with two threads running. Response times are cut in half for several workloads in open system experiments. 1
Decoupled Software Pipelining with the Synchronization Array
- In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
, 2004
"... Despite the success of instruction-level parallelism (ILP) optimizations in increasing the performance of microprocessors, certain codes remain elusive. In particular, codes containing recursive data structure (RDS) traversal loops have been largely immune to ILP optimizations, due to the fundamenta ..."
Abstract
-
Cited by 46 (17 self)
- Add to MetaCart
Despite the success of instruction-level parallelism (ILP) optimizations in increasing the performance of microprocessors, certain codes remain elusive. In particular, codes containing recursive data structure (RDS) traversal loops have been largely immune to ILP optimizations, due to the fundamental serialization and variable latency of the loop-carried dependence through a pointer-chasing load. To address these and other situations, we introduce decoupled software pipelining (DSWP), a technique that statically splits a single-threaded sequential loop into multiple non-speculative threads, each of which performs useful computation essential for overall program correctness. The resulting threads execute on thread-parallel architectures such as simultaneous multithreaded (SMT) cores or chip multiprocessors (CMP), expose additional instruction level parallelism, and tolerate latency better than the original single-threaded RDS loop. To reduce overhead, these threads communicate using a synchronization array, a dedicated hardware structure for pipelined inter-thread communication. DSWP used in conjunction with the synchronization array achieves an 11 % to 76 % speedup in the optimized functions on both statically and dynamically scheduled processors. 1.
An Analysis of Operating System Behavior on a Simultaneous Multithreaded Architecture
- In Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems
, 2000
"... This paper presents the first analysis of operating system execution on a simultaneous multithreaded (SMT) processor. While SMT has been studied extensively over the past 6 years, previous research has focused entirely on user-mode execution. However, many of the applications most amenable to multit ..."
Abstract
-
Cited by 42 (3 self)
- Add to MetaCart
This paper presents the first analysis of operating system execution on a simultaneous multithreaded (SMT) processor. While SMT has been studied extensively over the past 6 years, previous research has focused entirely on user-mode execution. However, many of the applications most amenable to multithreading technologies spend a significant fraction of their time in kernel code. A full understanding of the behavior of such workloads therefore requires execution and measurement of the operating system, as well as the application itself. To carry out this study, we (1) modified the Digital Unix 4.0d operating system to run on an SMT CPU, and (2) integrated our SMT Alpha instruction set simulator into the SimOS simulator to provide an execution environment. For an OS-intensive workload, we ran the multithreaded Apache Web server on an 8-context SMT. We compared Apache's user- and kernel-mode behavior to a standard multiprogrammed SPECInt workload, and compared the SMT processor to an out-...
A Performance Comparison of DRAM Memory System Optimizations for SMT Processors
- In HPCA-11
, 2005
"... Memory system optimizations have been well studied on single-threaded systems; however, the wide use of simultaneous multithreading (SMT) techniques raises questions over their effectiveness in the new context. In this study, we thoroughly evaluate contemporary multi-channel DDR SDRAM and Rambus DRA ..."
Abstract
-
Cited by 27 (3 self)
- Add to MetaCart
Memory system optimizations have been well studied on single-threaded systems; however, the wide use of simultaneous multithreading (SMT) techniques raises questions over their effectiveness in the new context. In this study, we thoroughly evaluate contemporary multi-channel DDR SDRAM and Rambus DRAM systems in SMT systems, and search for new thread-aware DRAM optimization techniques. Our major findings are: (1) in general, increasing the number of threads tends to increase the memory concurrency and thus the pressure on DRAM systems, but some exceptions do exist; (2) the application performance is sensitive to memory channel organizations, e.g. independent channels may outperform ganged organizations by up to 90%; (3) the DRAM latency reduction through improving row buffer hit rates becomes less effective due to the increased bank contentions; and (4) thread-aware DRAM access scheduling schemes may improve performance by up to 30 % on workload mixes of memory-intensive applications. In short, the use of SMT techniques has somewhat changed the context of DRAM optimizations but does not make them obsolete. 1
Architectural support for enhanced smt job scheduling
- In Proc. of the 13th International Conference on Parallel Architectures and Compilation Techniques (PACT’04
, 2004
"... By converting thread-level parallelism to instruction level parallelism, Simultaneous Multithreaded (SMT) processors are emerging as effective ways to utilize the resources of modern superscalar architectures. However, the full potential of SMT has not yet been reached as most modern operating syste ..."
Abstract
-
Cited by 27 (2 self)
- Add to MetaCart
By converting thread-level parallelism to instruction level parallelism, Simultaneous Multithreaded (SMT) processors are emerging as effective ways to utilize the resources of modern superscalar architectures. However, the full potential of SMT has not yet been reached as most modern operating systems use existing single-thread or multiprocessor algorithms to schedule threads, neglecting contention for resources between threads. To date, even the best SMT scheduling algorithms simply try to group threads for co-residency based on each thread’s expected resource utilization but do not take into account variance in thread behavior. As such, we introduce architectural support that enables new thread scheduling algorithms to group threads for co-residency based on fine-grain memory system activity information. The proposed memory monitoring framework centers on the concept of a cache activity vector, which exposes runtime cache resource information to the operating system to improve job scheduling. Using this scheduling technique, we experimentally evaluate the overall performance improvement of workloads on an SMT machine compared against the most recent Linux job scheduler. This work is first motivated with experiments in a simulated environment, then validated on a Hyperthreading-enabled Intel Pentium-4 Xeon microprocessor running a modified version of the latest Linux Kernel.
Physical experimentation with prefetching helper threads on intel’s hyper-threaded processors
- InCGO ’04: Proceedings of the international
, 2004
"... Pre-execution techniques have received much attention as an effective way of prefetching cache blocks to tolerate the everincreasing memory latency. A number of pre-execution techniques based on hardware, compiler, or both have been proposed and studied extensively by researchers. They report promis ..."
Abstract
-
Cited by 19 (0 self)
- Add to MetaCart
Pre-execution techniques have received much attention as an effective way of prefetching cache blocks to tolerate the everincreasing memory latency. A number of pre-execution techniques based on hardware, compiler, or both have been proposed and studied extensively by researchers. They report promising results on simulators that model a Simultaneous Multithreading (SMT) processor. In this paper, we apply the helper threading idea on a real multithreaded machine, i.e., Intel Pentium 4 processor with Hyper-Threading Technology, and show that indeed it can provide wall-clock speedup on real silicon. To achieve further performance improvements via helper threads, we investigate three helper threading scenarios that are driven by automated compiler infrastructure, and identify several key challenges and opportunities for novel hardware and software optimizations. Our study shows a program behavior changes dynamically during execution. In addition, the organizations of certain critical hardware structures in the hyper-threaded processors are either shared or partitioned in the multi-threading mode and thus, the tradeoffs regarding resource contention can be intricate. Therefore, it is essential to judiciously invoke helper threads by adapting to the dynamic program behavior so that we can alleviate potential performance degradation due to resource contention. Moreover, since adapting to the dynamic behavior requires frequent thread synchronization, having light-weight thread synchronization mechanisms is important. 1.
Helper Threads via Virtual Multithreading On An Experimental Itanium 2 Processor-based Platform
- In Proceedings of the 11th International Conference on Architectural Support for Programming Languages and Operating Systems
, 2004
"... Helper threading is a technology to accelerate a program by exploiting a processor’s multithreading capability to run “assist ” threads. Previous experiments on hyper-threaded processors have demonstrated significant speedups by using helper threads to prefetch hard-to-predict delinquent data access ..."
Abstract
-
Cited by 17 (1 self)
- Add to MetaCart
Helper threading is a technology to accelerate a program by exploiting a processor’s multithreading capability to run “assist ” threads. Previous experiments on hyper-threaded processors have demonstrated significant speedups by using helper threads to prefetch hard-to-predict delinquent data accesses. In order to apply this technique to processors that do not have built-in hardware support for multithreading, we introduce virtual multithreading (VMT), a novel form of switch-on-event user-level multithreading, capable of fly-weight multiplexing of event-driven thread executions on a single processor without additional operating system support. The compiler plays a key role in minimizing synchronization cost by judiciously partitioning register usage among the user-level threads. The VMT approach makes it possible to launch dynamic helper thread instances in response
Multigrain parallel Delaunay mesh generation: Challenges and opportunities for multithreaded architectures
- In Proceedings of the 19th annual international conference on Supercomputing
, 2005
"... Given the importance of parallel mesh generation in large-scale scientific applications and the proliferation of multilevel SMTbased architectures, it is imperative to obtain insight on the interaction between meshing algorithms and these systems. We focus on Parallel Constrained Delaunay Mesh (PCDM ..."
Abstract
-
Cited by 12 (6 self)
- Add to MetaCart
Given the importance of parallel mesh generation in large-scale scientific applications and the proliferation of multilevel SMTbased architectures, it is imperative to obtain insight on the interaction between meshing algorithms and these systems. We focus on Parallel Constrained Delaunay Mesh (PCDM) generation. We exploit coarse-grain parallelism at the subdomain level and fine-grain at the element level. This multigrain data parallel approach targets clusters built from low-end, commercially available SMTs. Our experimental evaluation shows that current SMTs are not capable of executing fine-grain parallelism in PCDM. However, experiments on a simulated SMT indicate that with modest hardware support it is possible to exploit fine-grain parallelism opportunities. The exploitation of fine-grain parallelism results to higher performance than a pure MPI implementation and closes the gap between the performance of PCDM and the state-of-the-art sequential mesher on a single physical processor. Our findings extend to other adaptive and irregular multigrain, parallel algorithms. 1
Improving Server Software Support for Simultaneous Multithreaded Processors
- In Proceedings of the ninth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ’03
, 2003
"... Simultaneous multithreading (SMT) represents a fundamental shift in processor capability. SMT's ability to execute multiple threads simultaneously within a single CPU offers tremendous potential performance benefits. However, the structure and behavior of software affects the extent to which this po ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
Simultaneous multithreading (SMT) represents a fundamental shift in processor capability. SMT's ability to execute multiple threads simultaneously within a single CPU offers tremendous potential performance benefits. However, the structure and behavior of software affects the extent to which this potential can be achieved. Consequently, just like the earlier arrival of multiprocessors, the advent of SMT processors prompts a needed re-evaluation of software that will run on them. This evaluation is complicated, since SMT adopts architectural features and operating costs of both its predecessors (uniprocessors and multiprocessors). The crucial task for researchers is to determine which software structures and policies - multi- processor, uniprocessor, or neither - are most appropriate for SMT.
A Study of Source-Level Compiler Algorithms for Automatic Construction of Pre-Execution Code
- ACM TRANS. COMPUT. SYST
, 2004
"... This article investigates several source-to-source C compilers for extracting pre-execution thread code automatically, thus relieving the programmer or hardware from this onerous task. We present an aggressive profile-driven compiler that employs three powerful algorithms for code extraction. First, ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
This article investigates several source-to-source C compilers for extracting pre-execution thread code automatically, thus relieving the programmer or hardware from this onerous task. We present an aggressive profile-driven compiler that employs three powerful algorithms for code extraction. First, program slicing removes non-critical code for computing cache-missing memory references. Second, prefetch conversion replaces blocking memory references with non-blocking prefetch instructions to minimize pre-execution thread stalls. Finally, speculative loop parallelization generates thread-level parallelism to tolerate the latency of blocking loads. In addition, we present four "reduced" compilers that employ less aggressive algorithms to simplify compiler implementation. Our reduced compilers rely on back-end code optimizations rather than program slicing to remove non-critical code, and use compile-time heuristics rather than profiling to approximate runtime information (e.g., cache-miss and loop-trip counts) We

