Results 1 - 10
of
25
The Privatizing DOALL Test: A Run-Time Technique for DOALL Loop Identification and Array Privatization
- In Proceedings of the 1994 International Conference on Supercomputing
, 1994
"... Current parallelizing compilers cannot extract a significant fraction of the available parallelism in a loop if it has a complex and/or statically insufficiently defined access pattern. This is an important issue because a large class of complex simulations used in industry today have irregular doma ..."
Abstract
-
Cited by 43 (16 self)
- Add to MetaCart
Current parallelizing compilers cannot extract a significant fraction of the available parallelism in a loop if it has a complex and/or statically insufficiently defined access pattern. This is an important issue because a large class of complex simulations used in industry today have irregular domains and/or dynamically changing interactions. To handle these types of problems methods capable of automatically extracting parallelism at run--time are needed. For this reason, we have developed the Privatizing DOALL test -- a technique for identifying fully parallel loops at run--time, and dynamically privatizing scalars and arrays. The test is fully parallel, requires no synchronization, is easily automatable, and can be applied to any loop, regardless of its access pattern. We show that the expected speedup for fully parallel loops is significant, and the cost of a failed test (a not fully parallel loop) is minimal. We present experimental results on loops from the PERFECT Benchmarks whi...
Hardware for Speculative Run-Time Parallelization in Distributed Shared-Memory Multiprocessors
, 1997
"... Run-time parallelization is often the only way to generate parallel code for multiprocessors when data dependence information is incomplete at compile time. This situation is common in many important applications, where arrays are accessed with subscripted subscripts. Unfortunately, known techniq ..."
Abstract
-
Cited by 33 (9 self)
- Add to MetaCart
Run-time parallelization is often the only way to generate parallel code for multiprocessors when data dependence information is incomplete at compile time. This situation is common in many important applications, where arrays are accessed with subscripted subscripts. Unfortunately, known techniques for run-time parallelization are often computationally expensive or not general enough. To address this problem, we propose a new hardware support for efficient run-time parallelization in distributed shared-memory multiprocessors (DSMs). The idea is to execute the code in parallel speculatively and use the cache coherence protocol hardware to flag any cross-iteration data dependence. Often, such dependences naturally trigger a coherence transaction, like an invalidation when a processor writes to a variable that was read by another processor. However, with appropriate extensions to the cache coherence protocol, all such dependences are detected. If a dependence is detected, execu...
Techniques for Speculative Run-Time Parallelization of Loops
- In Supercomputing ’98
, 1998
"... This paper presents a set of new run-time tests for speculative parallelization of loops that defy parallelization based on static analysis alone. It presents a novel method for speculative array privatization that is not only more efficient than previous methods when the speculation is correct, b ..."
Abstract
-
Cited by 32 (1 self)
- Add to MetaCart
This paper presents a set of new run-time tests for speculative parallelization of loops that defy parallelization based on static analysis alone. It presents a novel method for speculative array privatization that is not only more efficient than previous methods when the speculation is correct, but also does not require rolling back the computation in case the variable is found not to be privatizable. We present another method for speculative parallelization which can overcome all loop-carried anti and output dependences, with even lower overhead than previous techniques which could not break such dependences. Again, in order to ameliorate the problem of paying a heavy penalty for speculatively parallelizing loops that turn out to be serial, we present a technique that enables early detection of loop-carried dependences. Our experimental results from a preliminary implementation of these tests on an IBM G30 SMP machine show a significant reduction in the penalty paid for mis-...
Run-Time Methods for Parallelizing Partially Parallel Loops
- Proceedings of the 9th ACM International Conference on Supercomputing
, 1995
"... In this paper we give a new run–time technique for finding an optimal parallel execution schedule for a partially parallel loop, i.e., a loop whose parallelization requires synchronization to ensure that the iterations are executed in the correct order. Given the original loop, the compiler generate ..."
Abstract
-
Cited by 30 (6 self)
- Add to MetaCart
In this paper we give a new run–time technique for finding an optimal parallel execution schedule for a partially parallel loop, i.e., a loop whose parallelization requires synchronization to ensure that the iterations are executed in the correct order. Given the original loop, the compiler generates inspector code that performs run–time preprocessing of the loop’s access pattern, and scheduler code that schedules (and executes) the loop iterations. The inspector is fully parallel, uses no synchronization, and can be applied to any loop. In addition, it can implement at run–time the two most effective transformations for increasing the amount of parallelism in a loop: array privatization and reduction parallelization (element–wise). We also describe a new scheme for constructing an optimal parallel execution schedule for the iterations of the loop. 1
A Scalable Method for Run-Time Loop Parallelization
- IJPP
, 1995
"... Current parallelizing compilers do a reasonable job of extracting parallelism from programs with regular, well behaved, statically analyzable access patterns. However, they cannot extract a significant fraction of the available parallelism if the program has a complex and/or statically insufficientl ..."
Abstract
-
Cited by 20 (13 self)
- Add to MetaCart
Current parallelizing compilers do a reasonable job of extracting parallelism from programs with regular, well behaved, statically analyzable access patterns. However, they cannot extract a significant fraction of the available parallelism if the program has a complex and/or statically insufficiently defined access pattern, e.g., simulation programs with irregular domains and/or dynamically changing interactions. Since such programs represent a large fraction of all applications, techniques are needed for extracting their inherent parallelism at run--time. In this paper we give a new run--time technique for finding an optimal parallel execution schedule for a partially parallel loop, i.e., a loop whose parallelization requires synchronization to ensure that the iterations are executed in the correct order. Given the original loop, the compiler generates inspector code that performs run--time preprocessing of the loop's access pattern, and scheduler code that schedules (and executes) the loop iterations. The inspector is fully parallel, uses no synchronization, and can be applied to any loop (from which an inspector can be extracted). In addition, it can implement at run--time the two most effective transformations for increasing the amount of parallelism in a loop: array privatization and reduction parallelization (element--wise). The ability to identify privatizable and reduction variables is very powerful since it eliminates the data dependences involving these variables and thereby potentially increases the overall parallelism of the loop. We also describe a new scheme for constructing an optimal parallel execution schedule for the iterations of the loop. The schedule produced is a partition of the set of iterations into subsets called wavefronts so that there are n...
Coarse-Grained Speculative Execution in Shared-Memory Multiprocessors
, 1998
"... This thesis presents a new parallelization model, called coarse-grained thread pipelining, for exploiting coarse-grained parallelism from general-purpose application programs in shared-memory multiprocessor systems. This parallelization model, which is based on the fine-grained thread pipelining mod ..."
Abstract
-
Cited by 19 (3 self)
- Add to MetaCart
This thesis presents a new parallelization model, called coarse-grained thread pipelining, for exploiting coarse-grained parallelism from general-purpose application programs in shared-memory multiprocessor systems. This parallelization model, which is based on the fine-grained thread pipelining model proposed for the superthreaded architecture [7], allows concurrent execution of loop iterations in a pipelined fashion with run-time data dependence checking and control speculation. The speculative execution combined with the run-time dependence analysis allows the parallelization of a variety of program constructs that cannot be parallelized with existing run-time parallelization algorithms. The pipelined execution of loop iterations in this new technique results in lower parallelization overhead than in other existing techniques. We evaluated the performance of our coarse-grained thread pipelining model using some real applications and a synthetic benchmark. These experiments show that...
Run-time Parallelization: A Framework for Parallel Computation
, 1995
"... The goal of parallelizing, or restructuring, compilers is to detect and exploit parallelism in sequential programs written in conventional languages. Current parallelizing compilers do a reasonable job of extracting parallelism from programs with regular, statically analyzable access patterns. Howev ..."
Abstract
-
Cited by 16 (8 self)
- Add to MetaCart
The goal of parallelizing, or restructuring, compilers is to detect and exploit parallelism in sequential programs written in conventional languages. Current parallelizing compilers do a reasonable job of extracting parallelism from programs with regular, statically analyzable access patterns. However, if the memory access pattern of the program is input data dependent, then static data dependence analysis and consequently parallelization is impossible. Moreover, in this case the compiler cannot apply privatization and reduction parallelization, the transformations that have been proven to be the most effective in removing data dependences and increasing the amount of exploitable parallelism in the program. Typical examples of irregular, dynamic applications are complex simulations such as SPICE for circuit simulation, DYNA-3D for structural mechanics modeling, DMOL for quantum mechanical simulation of molecules, and CHARMM for molecular dynamics simulation of organic systems. Therefore, since irregular programs represent a large and important fraction of applications, an automatable framework for run-time parallelization is needed to complement existing and future static compiler techniques. In this thesis,
JavaSpMT: A Speculative Thread Pipelining Parallelization Model for Java Programs
- In Proceedings of the 14th International Parallel and Distributed Processing Symposium (IPDPS
, 2000
"... This paper presents a new approach to improve execution-time performance of Java programs by extending the superthreaded speculative execution model [16, 17, 18] to exploit coarse-grained parallelism on a shared-memory multiprocessor system. The parallelization model, called Java Speculative MultiTh ..."
Abstract
-
Cited by 15 (5 self)
- Add to MetaCart
This paper presents a new approach to improve execution-time performance of Java programs by extending the superthreaded speculative execution model [16, 17, 18] to exploit coarse-grained parallelism on a shared-memory multiprocessor system. The parallelization model, called Java Speculative MultiThreading (JavaSpMT), combines control speculation with run-time dependence checking to parallelize a wide variety of loop constructs, including do-while loops, that cannot be parallelized using standard parallelization techniques. JavaSpMT is implemented using the standard Java multithreading mechanism and the parallelization is expressed using a Java source-to-source transformation. Thus, the transformed programs are still portable to any shared-memory multiprocessor system with a Java Virtual Machine implementation that supports native threads. The performance of the JavaSpMT model is evaluated on an 8-processor shared-memory node of an IBM SP system using the AIX4.3.2 implementation of JDK-I.I.6 with JIT compilation and three Java application programs.
The Illinois Aggressive Coma Multiprocessor Project (I-ACOMA)
- In Proc. of the 6th Symposium on the Frontiers of Massively Parallel Computing
, 1996
"... While scalable shared-memory multiprocessors with hardware-assisted cache coherence are relatively easy to program, if truly high-performance is desired, they still require substantial programmer effort. For example, data must be allocated close to the processors that will use them and the applicati ..."
Abstract
-
Cited by 12 (0 self)
- Add to MetaCart
While scalable shared-memory multiprocessors with hardware-assisted cache coherence are relatively easy to program, if truly high-performance is desired, they still require substantial programmer effort. For example, data must be allocated close to the processors that will use them and the application must be tuned so that the working set fits in the caches. This is unfortunate because the most important obstacle to widespread use of parallel computing is the hardship of programming parallel machines. The goal of the I-ACOMA project is to explore how to design a highly programmable high-performance multiprocessor. We focus on a flat-coma scalable multiprocessor supported by a parallelizing compiler. The main issues that we are studying are advanced processor organizations, techniques to handle long memory access latencies, and support for important classes of workloads like databases and scientific applications with loops that cannot be compileranalyzed. The project also involves build...
Run-Time Parallelization: It’s Time Has Come
- SPECIAL ISSUE ON LANGUAGES & COMPILERS FOR PARALLEL COMPUTERS
, 1998
"... Current parallelizing compilers cannot identify a significant fraction of parallelizable loops because they have complex or statically insufficiently defined access patterns. This type of loop mostly occurs in irregular, dynamic applications which represent more than 50 % of all applications [20]. M ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
Current parallelizing compilers cannot identify a significant fraction of parallelizable loops because they have complex or statically insufficiently defined access patterns. This type of loop mostly occurs in irregular, dynamic applications which represent more than 50 % of all applications [20]. Making parallel computing succeed has therefore become conditioned by the ability of compilers to analyze and extract the parallelism from irregular applications. In this paper we present a survey of techniques that can complement the current compiler capabilities by performing some form of data dependence analysis during program execution, when all information is available. After describing the problem of loop parallelization and its difficulties, a general overview of the need for techniques of run-time parallelization is given. A survey of the various approaches to parallelizing partially parallel loops and fully parallel loops is presented. Special emphasis is placed on two parallelism enabling transformations, privatization and reduction parallelization, because of their proven efficiency. The technique of speculatively parallelizing doall loops is presented in more detail. This survey limits itself to the domain of Fortran applications parallelized mostly in the shared memory paradigm. Related work from the field of parallel debugging and parallel simulation is also described.

