Results 1 - 10
of
33
The Jrpm System for Dynamically Parallelizing Java Programs
- In Proceedings of the 30th International Symposium on Computer Architecture
, 2003
"... We describe the Java runtime parallelizing machine (Jrpm), a complete system for parallelizing sequential programs automatically. Jrpm is based on a chip multiprocessor (CMP) with thread-level speculation (TLS) support. CMPs have low sharing and communication costs relative to traditional multt)roce ..."
Abstract
-
Cited by 65 (4 self)
- Add to MetaCart
We describe the Java runtime parallelizing machine (Jrpm), a complete system for parallelizing sequential programs automatically. Jrpm is based on a chip multiprocessor (CMP) with thread-level speculation (TLS) support. CMPs have low sharing and communication costs relative to traditional multt)rocessors, and thread-level speculation (TLS) simplifies program parallelization by allowing us to parallelize optimistically without violating correct sequential program behavior. Using a Java virtual machine with dynamic compilation support coupled with a hardware profiler, speculative buffer requirements and inter-thread dependencies of prospective speculative thread loops (STLs) are analyzed in real-time to identi the best loops to parallelize. Once sufficient data has been collected to make a reasonable decision, selected loops are dynamically recompiled to run in parallel Experimental results demonstrate that Jrpm can exploit thread-level parallelism with minimal effort from the programmer. On four processors, we achieved speedups of 3 to 4 for floating point applications, to 3 on multimedia applications, and between 1.5 and .5 on integer applications. Performance was achieved by automatic selection of thread decompositions by the hardware profiler, intra-procedural optimizations on code compiled dynamically into speculative threads, and some minor programmer transformations for exposing parallelism that cannot be performed automatically.
Optimizing compiler for the cell processor
- In PACT
, 2005
"... Developed for multimedia and game applications, as well as other numerically intensive workloads, the CELL processor provides support both for highly parallel codes, which have high computation and memory requirements, and for scalar codes, which require fast response time and a full-featured progra ..."
Abstract
-
Cited by 53 (1 self)
- Add to MetaCart
Developed for multimedia and game applications, as well as other numerically intensive workloads, the CELL processor provides support both for highly parallel codes, which have high computation and memory requirements, and for scalar codes, which require fast response time and a full-featured programming environment. This first generation CELL processor implements on a single chip a Power Architecture processor with two levels of cache, and eight attached streaming processors with their own local memories and globally coherent DMA engines. In addition to processor-level parallelism, each processing element has a Single Instruction Multiple Data (SIMD) unit that can process from 2 double precision floating points up to 16 bytes per instruction. This paper describes, in the context of a research prototype, several compiler techniques that aim at automatically generating high quality codes over a wide range of heterogeneous parallelism available on the CELL processor. Techniques include compiler-supported branch prediction, compiler-assisted instruction fetch, generation of scalar codes on SIMD units, automatic generation of SIMD codes, and data and code partitioning across the multiple processor elements in the system. Results indicate that significant speedup can be achieved with a high level of support from the compiler. 1.
Effective Automatic Parallelization of Stencil Computations
- In ACM SIGPLAN PLDI 2007
, 2007
"... Abstract Performance optimization of stencil computations has beenwidely studied in the literature, since they occur in many computationally intensive scientific and engineering appli-cations. Compiler frameworks have also been developed that can transform sequential stencil codes for optimization o ..."
Abstract
-
Cited by 45 (5 self)
- Add to MetaCart
(Show Context)
Abstract Performance optimization of stencil computations has beenwidely studied in the literature, since they occur in many computationally intensive scientific and engineering appli-cations. Compiler frameworks have also been developed that can transform sequential stencil codes for optimization ofdata locality and parallelism. However, loop skewing is typically required in order to tile stencil codes along the timedimension, resulting in load imbalance in pipelined parallel execution of the tiles. In this paper, we develop an approachfor automatic parallelization of stencil codes, that explicitly addresses the issue of load-balanced execution of tiles. Ex-perimental results are provided that demonstrate the effectiveness of the approach. Categories and Subject Descriptors D.3.4 [ProgrammingLanguages]: Processors--Compilers, Optimization
Increasing temporal locality with skewing and recursive blocking
- In Proc. SC2001
, 2001
"... We present a strategy, called recursive prismatic time skewing, that increase temporal reuse at all memory hierarchy levels, thus improving the performance of scientific codes that use iterative methods. Prismatic time skewing partitions iteration space of multiple loops into skewed prisms with both ..."
Abstract
-
Cited by 26 (2 self)
- Add to MetaCart
(Show Context)
We present a strategy, called recursive prismatic time skewing, that increase temporal reuse at all memory hierarchy levels, thus improving the performance of scientific codes that use iterative methods. Prismatic time skewing partitions iteration space of multiple loops into skewed prisms with both spatial and temporal (or convergence) dimensions. Novel aspects of this work include: multi-dimensional loop skewing; handling carried data dependences in the skewed loops without additional storage; bi-directional skewing to accommodate periodic boundary conditions; and an analysis and transformation strategy that works inter-procedurally. We combine prismatic skewing with a recursive blocking strategy to boost reuse at all levels in a memory hierarchy. A preliminary evaluation of these techniques shows significant performance improvements compared both to original codes and to methods described previously in the literature. With an inter-procedural application of our techniques, we were able to reduce total primary cache misses of a large application code by 27 % and secondary cache misses by 119%. 1.
TEST: A Tracer for Extracting Speculative Threads
- In The 2003 International Symposium on Code Generation and Optimization
, 2003
"... Thread-level speculation (TLS) allows sequential programs to be arbitrarily decomposed into threads that can be safely executed in parallel. A key challenge for TLS processors is choosing thread decompositions that speedup the program. Current techniques for identifying decompositions have practical ..."
Abstract
-
Cited by 23 (2 self)
- Add to MetaCart
(Show Context)
Thread-level speculation (TLS) allows sequential programs to be arbitrarily decomposed into threads that can be safely executed in parallel. A key challenge for TLS processors is choosing thread decompositions that speedup the program. Current techniques for identifying decompositions have practical limitations in real systems. Traditional parallelizing compilers do not work effectively on most integer programs, and software profiling slows down program execution too much for real-time analysis. Tracer for Extracting Speculative Threads (TEST) is hardware support that analyzes sequential program execution to estimate performance of possible thread decompositions. This hardware is used in a dynamic parallelization system that automatically transforms unmodified, sequential Java programs to run on TLS processors. In this system, the best thread decompositions found by TEST are dynamically recompiled to run speculatively. This paper describes the analysis performed by TEST and presents simulation results demonstrating its effectiveness on real programs. Estimates are also provided that show the tracer requires minimal hardware additions to our speculative chip-multiprocessor (< 1 % of the total transistor count) and causes only minor slowdowns to programs during analysis (3-25%). 1.
Generalized multipartitioning for multi-dimensional arrays
- In Proceedings of the International Parallel and Distributed Processing Symposium, Fort Lauderdale, FL
, 2002
"... Multipartitioning is a strategy for parallelizing computations that require solving 1D recurrences along each dimension of a multi-dimensional array. Previous techniques for multipartitioning yield efficient parallelizations over 3D domains only when the number of processors is a perfect square. Thi ..."
Abstract
-
Cited by 18 (2 self)
- Add to MetaCart
Multipartitioning is a strategy for parallelizing computations that require solving 1D recurrences along each dimension of a multi-dimensional array. Previous techniques for multipartitioning yield efficient parallelizations over 3D domains only when the number of processors is a perfect square. This paper considers the general problem of computing multipartitionings for d-dimensional data volumes on an arbitrary number of processors. We describe an algorithm that computes an optimal multipartitioning onto all of the processors for this general case. Finally, we describe how we extended the Rice dHPF compiler for High Performance Fortran to generate code that exploits generalized multipartitioning and show that the compiler’s generated code for the NAS SP computational fluid dynamics benchmark achieves scalable high performance. 1.
Compiler Synthesis of Task Graphs for Parallel Program Performance Prediction
- LCPC'00, Springer-Verlag LNCS
, 2000
"... Syntax Tree (AST). STG nodes are created as appropriate statements are encountered in the AST. Thus, program statements, such as DO, IF, CALL, PROGRAM/FUNCTION/SUBROUTINE, STOP/RETURN, trigger the creation of a single node in the graph; encountering one of the rst two also leads to the creation of a ..."
Abstract
-
Cited by 15 (10 self)
- Add to MetaCart
(Show Context)
Syntax Tree (AST). STG nodes are created as appropriate statements are encountered in the AST. Thus, program statements, such as DO, IF, CALL, PROGRAM/FUNCTION/SUBROUTINE, STOP/RETURN, trigger the creation of a single node in the graph; encountering one of the rst two also leads to the creation of an enddo-node or an endif-node, a then-node and an else-node, respectively. Any contiguous sequence of other computation statements that are executed by the same set of processors are grouped into a single computational task (contiguous implies that they are not interrupted by any of the above statements or by communication). Identifying statements that are computed by the same set of processors is a critical aspect of the above step. This information is derived from the computation partitioning phase of the compiler and is translated into a symbolic integer set [1] that is included with each task. By having a general representation of the set of processors associated with each task, our re...
Toward Compiler Support for Scalable Parallelism using Multipartitioning
- In Proceedings of the Fifth Workshop on Languages, Compilers, and Runtime Systems for Scalable Computers, Lecture Notes in Computer Science
, 2000
"... Strategies for partitioning an application's data play a fundamental role in determining the range of possible parallelizations that can be performed and ultimately their potential efficiency. This paper describes extensions to the Rice dHPF compiler for High Performance Fortran which enable it ..."
Abstract
-
Cited by 11 (9 self)
- Add to MetaCart
Strategies for partitioning an application's data play a fundamental role in determining the range of possible parallelizations that can be performed and ultimately their potential efficiency. This paper describes extensions to the Rice dHPF compiler for High Performance Fortran which enable it to support data distributions based on multipartitioning. Using these distributions can help close the substantial gap between the eciency and scalability of compiler-parallelized codes for multi-directional line sweep computations and their hand-coded counterparts. We describe our the design and implementation of compiler support for multipartitioning and show preliminary results for a benchmark compiled using these techniques.
Data-Parallel Compiler Support for Multipartitioning
, 2001
"... . Multipartitioning is a skewed-cyclic block distribution that yields better parallel e#ciency and scalability for line-sweep computations than traditional block partitionings. This paper describes extensions to the Rice dHPF compiler for High Performance Fortran that enable it to support multip ..."
Abstract
-
Cited by 10 (5 self)
- Add to MetaCart
. Multipartitioning is a skewed-cyclic block distribution that yields better parallel e#ciency and scalability for line-sweep computations than traditional block partitionings. This paper describes extensions to the Rice dHPF compiler for High Performance Fortran that enable it to support multipartitioned data distributions and optimizations that enable dHPF to generate e#cient multipartitioned code. We describe experiments applying these techniques to parallelize serial versions of the NAS SP and BT application benchmarks and show that the performance of the code generated by dHPF is approaching that of hand-coded parallelizations based on multipartitioning. 1
Semantic-Driven Parallelization of Loops Operating on User-Defined Containers
- in: Workshop on Languages and Compilers for Parallel Computing
, 2003
"... We describe ROSE, a C++ infrastructure for source-to-source translation, that provides an interface for programmers to easily write their own translators for optimizing user-defined high-level abstractions. ..."
Abstract
-
Cited by 7 (3 self)
- Add to MetaCart
We describe ROSE, a C++ infrastructure for source-to-source translation, that provides an interface for programmers to easily write their own translators for optimizing user-defined high-level abstractions.