Results 1 -
8 of
8
Bosschere. Detecting the existence of coarse-grain parallelism in general-purpose programs
- In Proceedings of the First Workshop on Programmability Issues for Multi-Core Computers, MULTIPROG-1
"... Abstract. With the rise of chip-multiprocessors, the problem of parallelizing general-purpose programs has once again been placed on the research agenda. In the 1980s and early 1990s, great successes were obtained to extract parallelism from the inner loops of scientific computations. General-purpos ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
Abstract. With the rise of chip-multiprocessors, the problem of parallelizing general-purpose programs has once again been placed on the research agenda. In the 1980s and early 1990s, great successes were obtained to extract parallelism from the inner loops of scientific computations. General-purpose programs, however, stayed out-of-reach due to the complexity of their control flow and data dependences. More recently, thread-level speculation (TLS) has been tauted as the definitive solution for general-purpose programs. TLS again targets inner loops. The program complexity issue is handled by checking and resolving dependences at runtime using complex hardware support. However, results so far have been disappointing and limit studies predict very low potential speedups, in one study just 18%. In this paper we advocate a completely different approach. We show that signficant amounts of coarse-grain parallelism exists in the outer program loops, even in general-purpose programs. This coarse-grain parallelism can be exploited efficiently on CMPs without additional hardware support. This paper presents a technique to extract coarse-grain parallelism from the outer program loops. Application of this technique to the MiBench and SPEC CPU2000 benchmarks shows that significant amounts of outerloop parallelism exist. This leads to a speedup of 5.18 for bzip2 compression and 11.8 for an MPEG2 encoder on a Sun UltraSPARC T1 CMP. The parallelization effort was limited to 10 to 20 person-hours per benchmark while we had no prior knowledge of the programs. 1
Bosschere. Function level parallelism driven by data dependencies
- In Workshop on Design, Architecture and Simulation of Chip MultiProcessors
, 2006
"... With the rise of Chip multiprocessors (CMPs), the amount of parallel computing power will increase significantly in the near future. However, most programs are sequential in nature and have not been explicitly parallelized, so they cannot exploit these parallel resources. Automatic parallelization o ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
With the rise of Chip multiprocessors (CMPs), the amount of parallel computing power will increase significantly in the near future. However, most programs are sequential in nature and have not been explicitly parallelized, so they cannot exploit these parallel resources. Automatic parallelization of sequential, non-regular codes is very hard, as illustrated by the lack of solutions after more than 30 years of research on the topic. The question remains if there is parallelism in sequential programs that can be detected automatically and if so, how much parallelism there is. In this paper, we propose a framework for extracting potential parallelism from programs. Applying this framework to sequential programs can teach us how much parallelism is present in a program, but also tells us what the most appropriate parallel construct for a program is, e.g. a pipeline, master/slave work distribution, etc. Our framework is profile-based, implying that it is not safe. It builds two new graph representations of the profile-data: the interprocedural data flow graph and the data sharing graph. This graphs show the data-flow between functions and the data structures facilitating this data-flow, respectively. We apply our framework on the SPECcpu2000 bzip2 benchmark, achieving a speedup of 3.74 of the compression part and a global speedup of 2.45 on a quad processor system. 1
Toward Thread-Level Speculation for Coarse-Grained Parallelism with Regular Access Patterns ⋆
"... Abstract. Recent work on transactional memory (TM) bears promise to exploit multicore capabilities. TM extensions for thread-level speculative parallelism (TLS) have predominantly focused on integer benchmarks with short critical sections and exploit limited on-chip buffering space to store shadow v ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Abstract. Recent work on transactional memory (TM) bears promise to exploit multicore capabilities. TM extensions for thread-level speculative parallelism (TLS) have predominantly focused on integer benchmarks with short critical sections and exploit limited on-chip buffering space to store shadow values needed to potentially abort transactions. In contrast, scientific codes generally provide coarse-grained parallel regions with potentially shared memory accesses, which do not fit into size-limited shadow buffers. Hence, such codes represent a mismatch for TM-TLS. This work contributes mechanisms to speculatively parallelize scientific codes with dense, non-scalar data references exploiting compilation techniques and runtime enhancements coupled with minor hardware enhancements to transparently support TLS. A method to efficiently detect access violations to shared memory in speculatively parallelized regions is developed, much alike TM, yet with data footprints of arbitrarily large size. The mechanism for violation detection is based on runtime software and optional hardware support to efficiently capture regular access traces. Experimental evaluations assess the speculation overhead in presence and absence of access violations considering an environment with and without hardware support. The results show that this method is competitive to explicit parallelization or auto-parallelization, yet can be applied even when data dependency checks remain inconclusive at compilation time. 1
Performance Evaluation of Dynamic Speculative Multithreading . . .
- IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS
, 2009
"... Thread-level parallelism (TLP) has been extensively studied in order to overcome the limitations of exploiting instruction level parallelism (ILP) on high-performance superscalar processors. One promising method of exploiting TLP is Dynamic Speculative Multithreading (D-SpMT), which extracts multipl ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Thread-level parallelism (TLP) has been extensively studied in order to overcome the limitations of exploiting instruction level parallelism (ILP) on high-performance superscalar processors. One promising method of exploiting TLP is Dynamic Speculative Multithreading (D-SpMT), which extracts multiple threads from a sequential program without compiler support or instruction set extensions. This paper introduces Cascadia, a D-SpMT multicore architecture that provides multi-grain thread-level support and is used to evaluate the performance of several benchmarks. Cascadia applies a unique sustainable IPC (sIPC) metric on a comprehensive loop tree to select the best performing nested loop level to multithread. This paper also discusses the relationships that loops have on one another, in particular, how loop-nesting levels can be extended through procedures. In addition, a detailed study is provided on the effects that thread granularity and inter-thread dependencies have on the entire system.
Fast Track: Supporting Unsafe Optimizations with Software Speculation
"... The use of multi-core, multi-processor machines is opening new opportunities for software speculation, where program code is speculatively executed to improve performance at the additional cost of monitoring and error recovery. In this paper we describe a new system that use software speculation to ..."
Abstract
- Add to MetaCart
The use of multi-core, multi-processor machines is opening new opportunities for software speculation, where program code is speculatively executed to improve performance at the additional cost of monitoring and error recovery. In this paper we describe a new system that use software speculation to support unsafely optimized code. We open a fast, unsafe track of execution but run the correct code on other processors to ensure correctness. We have developed an analytical model to measure the effect of major parameters including the speed of the fast track, its success rate, and its overheads. We have implemented a prototype and verified the correctness and performance using a synthetic benchmark on a 4-CPU machine. 1
Exploring Speculative Parallelism in SPEC2006
"... The computer industry has adopted multi-threaded and multicore architectures as the clock rate increase stalled in early 2000’s. It was hoped that the continuous improvement of single-program performance could be achieved through these architectures. However, traditional parallelizing compilers ofte ..."
Abstract
- Add to MetaCart
The computer industry has adopted multi-threaded and multicore architectures as the clock rate increase stalled in early 2000’s. It was hoped that the continuous improvement of single-program performance could be achieved through these architectures. However, traditional parallelizing compilers often fail to effectively parallelize general-purpose applications which typically have complex control flow and excessive pointer usage. Recently hardware techniques such as Transactional Memory (TM) and Thread-Level Speculation (TLS) have been proposed to simplify the task of parallelization by using speculative threads. Potential of speculative parallelism in general-purpose applications like SPEC CPU 2000 have been well studied and shown to be moderately successful. Preliminary work examining the potential parallelism in SPEC2006 deployed parallel threads with a restrictive TLS execution model and limited compiler support, and thus only showed limited performance potential. In this paper, we first analyze the cross-iteration dependence behavior of SPEC 2006 benchmarks and show that more parallelism potential is available in SPEC 2006 benchmarks, comparing to SPEC2000. We further use a state-of-the-art profile-driven TLS compiler to identify loops that can be speculatively parallelized. Overall, we found that with optimal loop selection we can potentially achieve an average speedup of 60 % on four cores over what could be achieved by a traditional parallelizing compiler such as Intel’s ICC compiler. We also found that an additional 11 % improvement can be potentially obtained on selected benchmarks using 8 cores when we extend TLS on multiple loop levels as opposed to restricting to a single loop level. I.
OF THE UNIVERSITY OF MINNESOTA BY
, 2009
"... The computer industry has adopted multi-threaded (Simultaneous Multi-Threading (SMT) and multi-core (Chip Multiprocessor) architectures as the clock rate increase stalled in early 2000’s. It was hoped that the continuous improvement of single-program performance could be achieved through these archi ..."
Abstract
- Add to MetaCart
The computer industry has adopted multi-threaded (Simultaneous Multi-Threading (SMT) and multi-core (Chip Multiprocessor) architectures as the clock rate increase stalled in early 2000’s. It was hoped that the continuous improvement of single-program performance could be achieved through these architectures. However, traditional parallelizing compilers often fail to effectively parallelize general-purpose applications which typically have complex control flow and excessive pointer usage. Thread-Level Speculation (TLS) have been proposed to simplify the task of parallelization by using speculative threads. Though the performance of TLS has been well studied in the past, its power consumption, power efficiency and thermal behavior are not well understood. Also previous work on TLS have concentrated on multi-core based architecture and relatively little has been done on supporting TLS on multi-threaded architectures. With increasing multi-threaded/multi-core design choices, it is important to understand the benefits of the different type of architectures. The goal of this disseration is to develop architecture techniques to efficiently implement TLS in future multi-threaded/multi-core processors. The disseration proposes a novel cachebased
November 18, 2010Compiler Assisted Out-Of-Order Instruction Commit
"... This paper proposes an out-of-order instruction commit mechanism using a novel compiler/architecture interface. The compiler provides information about instruction “blocks ” and the processor uses the block information to decide which instructions can be committed out of order and when. Some blocks ..."
Abstract
- Add to MetaCart
This paper proposes an out-of-order instruction commit mechanism using a novel compiler/architecture interface. The compiler provides information about instruction “blocks ” and the processor uses the block information to decide which instructions can be committed out of order and when. Some blocks are guaranteed to be data independent blocks which allows instructions from different such blocks be committed simultaneously and out of order. Other blocks have data or control dependencies and require in-order execution and in-order commit. Micro-architectural support required for the new commit mode is made on top of the standard, ROB-based commit and includes out-of-order instruction commit, early register release, support for committing loads and stores out of order, and exception handling. All of these are driven by the block information which simplifies the hardware. Results for a 4-wide processor model based on the Alpha 21264 and a set of 6 SPEC2000 and 2006 benchmarks show that, on average, 52 % instructions are committed out of order resulting in 10 % to 26 % speedups over in-order commit with minimal hardware overhead. 1

