Results 1 - 10
of
62
Automatic thread extraction with decoupled software pipelining
- In Proceedings of the 38th IEEE/ACM International Symposium on Microarchitecture
, 2005
"... {ottoni, ram, astoler, august}@princeton.edu Abstract Until recently, a steadily rising clock rate and otheruniprocessor microarchitectural improvements could be relied upon to consistently deliver increasing performance fora wide range of applications. Current difficulties in maintaining this trend ..."
Abstract
-
Cited by 101 (18 self)
- Add to MetaCart
(Show Context)
{ottoni, ram, astoler, august}@princeton.edu Abstract Until recently, a steadily rising clock rate and otheruniprocessor microarchitectural improvements could be relied upon to consistently deliver increasing performance fora wide range of applications. Current difficulties in maintaining this trend have lead microprocessor manufacturersto add value by incorporating multiple processors on a chip. Unfortunately, since decades of compiler research have notsucceeded in delivering automatic threading for prevalent code properties, this approach demonstrates no improve-ment for a large class of existing codes. To find useful work for chip multiprocessors, we proposean automatic approach to thread extraction, called Decoupled Software Pipelining (DSWP). DSWP exploits the fine-grained pipeline parallelism lurking in most applications to extract long-running, concurrently executing threads. Useof the non-speculative and truly decoupled threads produced by DSWP can increase execution efficiency and pro-vide significant latency tolerance, mitigating design complexity by reducing inter-core communication and per-coreresource requirements. Using our initial fully automatic compiler implementation and a validated processor model,we prove the concept by demonstrating significant gains for dual-core chip multiprocessor models running a variety ofcodes. We then explore simple opportunities missed by our initial compiler implementation which suggest a promisingfuture for this approach. 1
POSH: A TLS compiler that exploits program structure
- In Proceedings of the 11th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
, 2006
"... As multi-core architectures with Thread-Level Speculation (TLS) are becoming better understood, it is important to focus on TLS compilation. TLS compilers are interesting in that, while they do not need to fully prove the independence of concurrent tasks, they make choices of where and when to gener ..."
Abstract
-
Cited by 65 (7 self)
- Add to MetaCart
(Show Context)
As multi-core architectures with Thread-Level Speculation (TLS) are becoming better understood, it is important to focus on TLS compilation. TLS compilers are interesting in that, while they do not need to fully prove the independence of concurrent tasks, they make choices of where and when to generate speculative tasks that are crucial to overall TLS performance. This paper presents POSH, a new, fully automated TLS compiler built on top of gcc. POSH is based on two design decisions. First, to partition the code into tasks, it leverages the code structures created by the programmer, namely subroutines and loops. Second, it uses a simple profiling pass to discard ineffective tasks. With the code generated by POSH, a simulated TLS chip multiprocessor with 4 superscalar cores delivers an average speedup of 1.30 for the SPECint 2000 applications. Moreover, an estimated 26 % of this speedup is a result of the implicit data prefetching provided by squashed tasks. Categories and Subject Descriptors D.1.3 [Programming Techniques]:
Speculative decoupled software pipelining
- In Proceedings of the 16th International Conference on Parallel Architectures and Compilation Techniques
, 2007
"... In recent years, microprocessor manufacturers have shifted their focus from single-core to multi-core processors. To avoid burdening programmers with the responsibility of parallelizing their applications, some researchers have advocated automatic thread extraction. A recently proposed technique, De ..."
Abstract
-
Cited by 52 (12 self)
- Add to MetaCart
(Show Context)
In recent years, microprocessor manufacturers have shifted their focus from single-core to multi-core processors. To avoid burdening programmers with the responsibility of parallelizing their applications, some researchers have advocated automatic thread extraction. A recently proposed technique, Decoupled Software Pipelining (DSWP), has demonstrated promise by partitioning loops into long-running, fine-grained threads organized into a pipeline. Using a pipeline organization and execution decoupled by inter-core communication queues, DSWP offers increased execution efficiency that is largely independent of inter-core communication latency. This paper proposes adding speculation to DSWP and evaluates an automatic approach for its implementation. By speculating past infrequent dependences, the benefit of DSWP is increased by making it applicable to more loops, facilitating better balanced threads, and enabling parallelized loops to be run on more cores. Unlike prior speculative threading proposals, speculative DSWP focuses on breaking dependence recurrences. By speculatively breaking these recurrences, instructions that were formerly restricted to a single thread to ensure decoupling are now free to span multiple threads. Using an initial automatic compiler implementation and a validated processor model, this paper demonstrates significant gains using speculation for 4-core chip multiprocessor models running a variety of codes. 1
Uncovering hidden loop level parallelism in sequential applications
- In Proc. of the 14th International Symposium on High-Performance Computer Architecture
, 2008
"... As multicore systems become the dominant mainstream computing technology, one of the most difficult challenges the industry faces is the software. Applications with large amounts of explicit thread-level parallelism naturally scale performance with the number of cores, but single-threaded applicatio ..."
Abstract
-
Cited by 43 (6 self)
- Add to MetaCart
(Show Context)
As multicore systems become the dominant mainstream computing technology, one of the most difficult challenges the industry faces is the software. Applications with large amounts of explicit thread-level parallelism naturally scale performance with the number of cores, but single-threaded applications realize little to no gains with additional cores. One solution to this problem is automatic parallelization that frees the programmer from the difficult task of parallel programming and offers hope for handling the vast amount of legacy single-threaded software. There is a long history of automatic parallelization for scientific applications, but the techniques have generally failed in the context of generalpurpose software. Thread-level speculation overcomes the problem of memory dependence analysis by speculating unlikely dependences that serialize execution. However, this approach has lead to only modest performance gains. In this paper, we take another look at exploiting loop-level parallelism in single-threaded applications. We show that substantial amounts of loop-level parallelism is available in general-purpose applications, but it lurks beneath the surface and is often obfuscated by a small number of data and control dependences. We adapt and extend several code transformations from the instruction-level and scientific parallelization communities to uncover the hidden parallelism. Our results show that 61 % of the dynamic execution of studied benchmarks can be parallelized with our techniques compared to 27 % using traditional thread-level speculation techniques, resulting in a speedup of 1.84 on a four core system compared to 1.41 without transformations. 1
Parallelizing sequential applications on commodity hardware using a low-cost software transactional memory
- In: Proc. of the ACM Conf. on Programming Language Design and Implementation (PLDI
, 2009
"... Multicore designs have emerged as the mainstream design paradigm for the microprocessor industry. Unfortunately, providing multiple cores does not directly translate into performance for most applications. The industry has already fallen short of the decades-old performance trend of doubling perform ..."
Abstract
-
Cited by 36 (3 self)
- Add to MetaCart
(Show Context)
Multicore designs have emerged as the mainstream design paradigm for the microprocessor industry. Unfortunately, providing multiple cores does not directly translate into performance for most applications. The industry has already fallen short of the decades-old performance trend of doubling performance every 18 months. An attractive approach for exploiting multiple cores is to rely on tools, both compilers and runtime optimizers, to automatically extract threads from sequential applications. However, despite decades of research on automatic parallelization, most techniques are only effective in the scientific and data parallel domains where array dominated codes can be precisely analyzed by the compiler. Threadlevel speculation offers the opportunity to expand parallelization to general-purpose programs, but at the cost of expensive hardware support. In this paper, we focus on providing low-overhead software support for exploiting speculative parallelism. We propose STMlite, a light-weight software transactional memory model that is customized to facilitate profile-guided automatic loop parallelization. STMlite eliminates a considerable amount of checking and locking overhead in conventional software transactional memory models by decoupling the commit phase from main transaction execution. Further, strong atomicity requirements for generic transactional memories are unnecessary within a stylized automatic parallelization framework. STMlite enables sequential applications to extract meaningful performance gains on commodity multicore hardware. Categories and Subject Descriptors D.3.3 [Programming Languages]: Language Constructs and Features–Concurrent programming
Program demultiplexing: Data-flow based speculative parallelization of methods in sequential programs
, 2007
"... Processor performance has steadily increased in the past several decades. The continuation of this, along with drop in the price to performance, is critical to accelerating the transformations that will result from ap-plications that demand powerful computing. Unfortunately, the processor industry h ..."
Abstract
-
Cited by 23 (1 self)
- Add to MetaCart
Processor performance has steadily increased in the past several decades. The continuation of this, along with drop in the price to performance, is critical to accelerating the transformations that will result from ap-plications that demand powerful computing. Unfortunately, the processor industry has faced problems with the incumbent processing model —out-of-order superscalar processor — with increasing design complexity and power consumption combined with diminishing performance gains. Processor manufacturers, en masse, have sidestepped this predicament by moving to multicore systems; the current generation of multicore systems being multiple processing cores in a chip. In the future, applications that demand higher and/or continuously increasing performance must rely on parallel execution in some form to use the processing cores in a multicore system. Multi-threaded applications such as enterprise application servers and scientific programs, can benefit from multicore systems as they are either easily parallelizable due to the nature of the application, or have been written by expert programmers who have carefully orchestrated threads. However, such applications occupy only a small fraction of the end-user market. Parallelizing the large fraction of single-threaded programs would require many experienced multi-threaded programmers, as well as significant time requirements to both develop such applications and debug them. Novel approaches are
Loop Selection for Thread-Level Speculation
- In Proceedings of the 18 th International Workshop on Languages and Compilers for Parallel Computing
, 2005
"... Abstract. Thread-level speculation (TLS) allows potentially dependent threads to speculatively execute in parallel, thus making it easier for the compiler to extract parallel threads. However, the high cost associated with unbalanced load, failed speculation, and inter-thread value communication mak ..."
Abstract
-
Cited by 19 (9 self)
- Add to MetaCart
(Show Context)
Abstract. Thread-level speculation (TLS) allows potentially dependent threads to speculatively execute in parallel, thus making it easier for the compiler to extract parallel threads. However, the high cost associated with unbalanced load, failed speculation, and inter-thread value communication makes it difficult to obtain the desired performance unless the speculative threads are carefully chosen. In this paper, we focus on extracting parallel threads from loops in generalpurpose applications because loops, with their regular structures and significant coverage on execution time, are ideal candidates for extracting parallel threads. General-purpose applications, however, usually contain a large number of nested loops with unpredictable parallel performance and dynamic behavior, thus making it difficult to decide which set of loops should be parallelized to improve overall program performance. Our proposed loop selection algorithm addresses all these difficulties. We have found that (i) with the aid of profiling information, compiler analyses can achieve a reasonably accurate estimation of the performance of parallel execution, and that (ii) different invocations of a loop may behave differently, and exploiting this dynamic behavior can further improve performance. With a judicious choice of loops, we can improve the overall program performance of SPEC2000 integer benchmarks by as much as 20%. 1
Supporting speculative parallelization in the presence of dynamic data structures
- In PLDI ’10
"... The availability of multicore processors has led to significant interest in compiler techniques for speculative parallelization of sequential programs. Isolation of speculative state from non-speculative state forms the basis of such speculative techniques as this separation enables recovery from mi ..."
Abstract
-
Cited by 18 (3 self)
- Add to MetaCart
(Show Context)
The availability of multicore processors has led to significant interest in compiler techniques for speculative parallelization of sequential programs. Isolation of speculative state from non-speculative state forms the basis of such speculative techniques as this separation enables recovery from misspeculations. In our prior work on CorD [35, 36] we showed that for array and scalar variable based programs copying of data between speculative and non-speculative memory can be highly optimized to support state separation that yields significant speedups on multicore machines available today. However, we observe that in context of heap-intensive programs that operate on linked dynamic data structures, state separation based speculative parallelization poses many challenges. The copying of data structures from non-speculative to speculative state (copy-in operation) can be very expensive due to the large sizes of dynamic data structures. The copying of updated data structures from speculative state to non-speculative state (copy-out operation) is made complex due to the changes in the shape and size of the dynamic data structure made by the speculative computation. In addition, we must contend with the need to translate pointers internal to dynamic data structures between their non-speculative and speculative memory addresses. In this paper we develop an augmented design for the representation of dynamic data structures such that all of the above operations can be performed efficiently. Our experiments demonstrate significant speedups on a real machine for a set of programs that make extensive use of heap based dynamic data structures.
Chip multi-processor scalability for single-threaded applications
- In Proceedings of the Workshop on Design, Architecture, and Simulation of Chip MultiProcessors
, 2005
"... The exponential increase in uniprocessor performance has begun to slow. Designers have been unable to scale performance while managing thermal, power, and electrical effects. Furthermore, design complexity limits the size of monolithic processors that can be designed while keeping costs reasonable. ..."
Abstract
-
Cited by 15 (2 self)
- Add to MetaCart
(Show Context)
The exponential increase in uniprocessor performance has begun to slow. Designers have been unable to scale performance while managing thermal, power, and electrical effects. Furthermore, design complexity limits the size of monolithic processors that can be designed while keeping costs reasonable. Industry has responded by moving toward chip multi-processor architectures (CMP). These architectures are composed from replicated processors utilizing the die area afforded by newer design processes. While this approach mitigates the issues with design complexity, power, and electrical effects, it does nothing to directly improve the performance of contemporary or future single-threaded applications. This paper examines the scalability potential for exploiting the parallelism in single-threaded applications on these CMP platforms. The paper explores the total available parallelism in unmodified sequential applications and then examines the viability of exploiting this parallelism on CMP machines. Using the results from this analysis, the paper forecasts that CMPs, using the “intrinsic” parallelism in a program, can sustain the performance improvement users have come to expect from new processors for only 6-8 years provided many successful parallelization efforts emerge. Given this outlook, the paper advocates exploring methodologies which achieve parallelism beyond this “intrinsic ” limit of programs. I.
Speculative Thread Decomposition Through Empirical Optimization
, 2007
"... Chip Multiprocessors (cmps) . . . a common way of reducing chip complexity and power consumption while maintaining high performance. Speculative CMPs use hardware to enforce dependence, allowing a parallelizing compiler to generate multithreaded code without needing to prove independence. In these s ..."
Abstract
-
Cited by 14 (0 self)
- Add to MetaCart
Chip Multiprocessors (cmps) . . . a common way of reducing chip complexity and power consumption while maintaining high performance. Speculative CMPs use hardware to enforce dependence, allowing a parallelizing compiler to generate multithreaded code without needing to prove independence. In these systems, a sequential program is decomposed into threads to be executed in parallel; dependent threads cause performance degradation, but do not affect correctness. Thread decomposition attempts to reduce the run-time overheads of data dependence, thread misprediction, and load imbalance. Because these overheads depend on the run times of the threads that are being created by the decomposition, reducing the overheads while creating the threads is a circular problem. Static compile-time decomposition handles this problem by estimating the run times of the candidate threads, but is limited by the estimates ’ inaccuracy. Dynamic execution-time decomposition in hardware has better runtime