Results 1 - 10
of
37
The LRPD Test: Speculative Run-Time Parallelization of Loops with Privatization and Reduction Parallelization
, 1995
"... Current parallelizing compilers cannot identify a significant fraction of parallelizable loops because they have complex or statically insufficiently defined access patterns. As parallelizable loops arise frequently in practice, we advocate a novel framework for their identification: speculatively e ..."
Abstract
-
Cited by 185 (36 self)
- Add to MetaCart
Current parallelizing compilers cannot identify a significant fraction of parallelizable loops because they have complex or statically insufficiently defined access patterns. As parallelizable loops arise frequently in practice, we advocate a novel framework for their identification: speculatively execute the loop as a doall, and apply a fully parallel data dependence test to determine if it had any cross--iteration dependences; if the test fails, then the loop is re--executed serially. Since, from our experience, a significant amount of the available parallelism in Fortran programs can be exploited by loops transformed through privatization and reduction parallelization, our methods can speculatively apply these transformations and then check their validity at run--time. Another important contribution of this paper is a novel method for reduction recognition which goes beyond syntactic pattern matching: it detects at run--time if the values stored in an array participate in a reduction operation, even if they are transferred through private variables and/or are affected by statically unpredictable control flow. We present experimental results on loops from the PERFECT Benchmarks which substantiate our claim that these techniques can yield significant speedups which are often superior to those obtainable by inspector/executor methods. The methods presented in this paper differ from and extend our previous work on several important points. First, instead of distributing the loop into inspector and executor loops (the approach taken in all previous work on run-- time parallelization) we advocate the use of run--time tests to validate the execution of a loop that is speculatively executed in parallel. Second, in addition to array privatization, the new techniques are capa...
Nonlinear Array Dependence Analysis
, 1991
"... Standard array data dependence techniques can only reason about linear constraints. There has also been work on analyzing some dependences involving polynomial constraints. Analyzing array data dependences in real-world programs requires handling many "unanalyzable" terms: subscript arrays, run-time ..."
Abstract
-
Cited by 63 (5 self)
- Add to MetaCart
Standard array data dependence techniques can only reason about linear constraints. There has also been work on analyzing some dependences involving polynomial constraints. Analyzing array data dependences in real-world programs requires handling many "unanalyzable" terms: subscript arrays, run-time tests, function calls. The standard approach to analyzing such programs has been to omit and ignore any constraints that cannot be reasoned about. This is unsound when reasoning about value-based dependences and whether privatization is legal. Also, this prevents us from determining the conditions that must be true to disprove the dependence. These conditions could be checked by a run-time test or verified by a programmer or aggressive, demand-driven interprocedural analysis. We describe a solution to these problems. Our solution makes our system sound and more accurate for analyzing value-based dependences and derives conditions that can be used to disprove dependences. We also give some p...
High-Level Adaptive Program Optimization with ADAPT
, 2001
"... Compile-time optimization is often limited by a lack of target machine and input data set knowledge. Without this information, compilers may be forced to make conservative assumptions to preserve correctness and to avoid performance degradation. In order to cope with this lack of information at comp ..."
Abstract
-
Cited by 50 (7 self)
- Add to MetaCart
Compile-time optimization is often limited by a lack of target machine and input data set knowledge. Without this information, compilers may be forced to make conservative assumptions to preserve correctness and to avoid performance degradation. In order to cope with this lack of information at compile-time, adaptive and dynamic systems can be used to perform optimization at runtime when complete knowledge of input and machine parameters is available. This paper presents a compiler-supported high-level adaptive optimization system. Users describe, in a domain specific language, optimizations performed by stand-alone optimization tools and backend compiler flags, as well as heuristics for applying these optimizations dynamically at runtime. The ADAPT compiler reads these descriptions and generates application-specific runtime systems to apply the heuristics. To facilitate the usage of existing tools and compilers, overheads are minimized by decoupling optimization from execution. Our system, ADAPT, supports a range of paradigms proposed recently, including dynamic compilation, parameterization and runtime sampling. We demonstrate our system by applying several optimization techniques to a suite of benchmarks on two target machines. ADAPT is shown to consistently outperform statically generated executables, improving performance by as much as 70%.
Polaris: The Next Generation in Parallelizing Compilers
- PROCEEDINGS OF THE WORKSHOP ON LANGUAGES AND COMPILERS FOR PARALLEL COMPUTING
, 1994
"... It is the goal of the Polaris project to develop a new parallelizing compiler that will overcome limitations of current compilers. While current parallelizing compilers may succeed on small kernels, they often fail to extract any meaningful parallelism from large applications. After a study of ap ..."
Abstract
-
Cited by 38 (8 self)
- Add to MetaCart
It is the goal of the Polaris project to develop a new parallelizing compiler that will overcome limitations of current compilers. While current parallelizing compilers may succeed on small kernels, they often fail to extract any meaningful parallelism from large applications. After a study of application codes, it was concluded that by adding a few new techniques to current compilers, automatic parallelization becomes possible. The techniques needed are interprocedural analysis, scalar and array privatization, symbolic dependence analysis, and advanced induction and reduction recognition and elimination, along with run-time techniques to allow data dependent behavior.
Automatic Detection of Parallelism: A Grand Challenge for High-Performance Computing
- IEEE PARALLEL AND DISTRIBUTED TECHNOLOGY
, 1994
"... The limited ability of compilers to nd the parallelism in programs is a signi cant barrier to the use of high performance computers. It forces programmers to resort to parallelizing their programs by hand, adding another level of complexity to the programming task. We show evidence that compilers ca ..."
Abstract
-
Cited by 35 (12 self)
- Add to MetaCart
The limited ability of compilers to nd the parallelism in programs is a signi cant barrier to the use of high performance computers. It forces programmers to resort to parallelizing their programs by hand, adding another level of complexity to the programming task. We show evidence that compilers can be improved, through static and run-time techniques, to the extent that a signi cant group of scienti c programs may be parallelized automatically. Symbolic dependence analysis and array privatization, plus run-time versions of those techniques are shown to be important to the success of this e ort. If we can succeed to parallelize programs automatically, the acceptance and use of large-scale parallel processors will be enhanced greatly.
Hardware for Speculative Run-Time Parallelization in Distributed Shared-Memory Multiprocessors
, 1997
"... Run-time parallelization is often the only way to generate parallel code for multiprocessors when data dependence information is incomplete at compile time. This situation is common in many important applications, where arrays are accessed with subscripted subscripts. Unfortunately, known techniq ..."
Abstract
-
Cited by 33 (9 self)
- Add to MetaCart
Run-time parallelization is often the only way to generate parallel code for multiprocessors when data dependence information is incomplete at compile time. This situation is common in many important applications, where arrays are accessed with subscripted subscripts. Unfortunately, known techniques for run-time parallelization are often computationally expensive or not general enough. To address this problem, we propose a new hardware support for efficient run-time parallelization in distributed shared-memory multiprocessors (DSMs). The idea is to execute the code in parallel speculatively and use the cache coherence protocol hardware to flag any cross-iteration data dependence. Often, such dependences naturally trigger a coherence transaction, like an invalidation when a processor writes to a variable that was read by another processor. However, with appropriate extensions to the cache coherence protocol, all such dependences are detected. If a dependence is detected, execu...
Techniques for Speculative Run-Time Parallelization of Loops
- In Supercomputing ’98
, 1998
"... This paper presents a set of new run-time tests for speculative parallelization of loops that defy parallelization based on static analysis alone. It presents a novel method for speculative array privatization that is not only more efficient than previous methods when the speculation is correct, b ..."
Abstract
-
Cited by 32 (1 self)
- Add to MetaCart
This paper presents a set of new run-time tests for speculative parallelization of loops that defy parallelization based on static analysis alone. It presents a novel method for speculative array privatization that is not only more efficient than previous methods when the speculation is correct, but also does not require rolling back the computation in case the variable is found not to be privatizable. We present another method for speculative parallelization which can overcome all loop-carried anti and output dependences, with even lower overhead than previous techniques which could not break such dependences. Again, in order to ameliorate the problem of paying a heavy penalty for speculatively parallelizing loops that turn out to be serial, we present a technique that enables early detection of loop-carried dependences. Our experimental results from a preliminary implementation of these tests on an IBM G30 SMP machine show a significant reduction in the penalty paid for mis-...
An Efficient Algorithm for the Run-time Parallelization of Doacross Loops
- In Proceedings of Supercomputing 1994
, 1994
"... While automatic parallelization of loops is generally based on compile-time analysis of data dependences, sometimes the data dependences can not be determined at compile time. This often occurs, for example, when arrays are accessed via subscripted subscripts. In these cases, it is necessary to use ..."
Abstract
-
Cited by 32 (4 self)
- Add to MetaCart
While automatic parallelization of loops is generally based on compile-time analysis of data dependences, sometimes the data dependences can not be determined at compile time. This often occurs, for example, when arrays are accessed via subscripted subscripts. In these cases, it is necessary to use run-time parallelization algorithms. In this paper, we present and evaluate a new run-time parallelization algorithm based on an inspector-executor pair. The scheme, called CYT, handles all types of data dependences in the loop without requiring any special architectural support. Furthermore, compared to an older scheme as general as the new algorithm, the latter speeds up execution by reducing the amount of interprocessor communication required. The new algorithm is applied to the Perfect Club codes and to parameterized loops, all running on the 32-CPU Cedar shared-memory multiprocessor. Although most loops with subscripted subscripts in the Perfect Club codes are highly parallel, their sma...
Run-Time Methods for Parallelizing Partially Parallel Loops
- Proceedings of the 9th ACM International Conference on Supercomputing
, 1995
"... In this paper we give a new run–time technique for finding an optimal parallel execution schedule for a partially parallel loop, i.e., a loop whose parallelization requires synchronization to ensure that the iterations are executed in the correct order. Given the original loop, the compiler generate ..."
Abstract
-
Cited by 30 (6 self)
- Add to MetaCart
In this paper we give a new run–time technique for finding an optimal parallel execution schedule for a partially parallel loop, i.e., a loop whose parallelization requires synchronization to ensure that the iterations are executed in the correct order. Given the original loop, the compiler generates inspector code that performs run–time preprocessing of the loop’s access pattern, and scheduler code that schedules (and executes) the loop iterations. The inspector is fully parallel, uses no synchronization, and can be applied to any loop. In addition, it can implement at run–time the two most effective transformations for increasing the amount of parallelism in a loop: array privatization and reduction parallelization (element–wise). We also describe a new scheme for constructing an optimal parallel execution schedule for the iterations of the loop. 1
Parallelizing While Loops for Multiprocessor Systems
- IN PROCEEDINGS OF THE 9TH INTERNATIONAL PARALLEL PROCESSING SYMPOSIUM
, 1995
"... Current parallelizing compilers treat while loops and do loops with conditional exits as sequential constructs because their iteration space is unknown. Motivated by the fact that these types of loops arise frequently in practice, we have developed techniques that can be used to automatically transf ..."
Abstract
-
Cited by 29 (13 self)
- Add to MetaCart
Current parallelizing compilers treat while loops and do loops with conditional exits as sequential constructs because their iteration space is unknown. Motivated by the fact that these types of loops arise frequently in practice, we have developed techniques that can be used to automatically transform them for parallel execution. We succeed in parallelizing loops involving linked lists traversals --- something that has not been done before. This is an important problem since linked list traversals arise frequently in loops with irregular access patterns, such as sparse matrix computations. The methods can even be applied to loops whose data dependence relations cannot be analyzed at compile-time. We outline a cost/performance analysis that can be used to decide when the methods should be applied. Since, as we show, the expected speedups are significant, our conclusion is that they should almost always be applied --- providing there is sufficient parallelism available in the original loop. We present experimental results on loops from the PERFECT Benchmarks and sparse matrix packages which substantiate our conclusion that these techniques can yield significant speedups.

