Results 1 - 10
of
24
Optimistic parallelism requires abstractions
- In PLDI
, 2007
"... Irregular applications, which manipulate large, pointer-based data structures like graphs, are difficult to parallelize manually. Automatic tools and techniques such as restructuring compilers and runtime speculative execution have failed to uncover much parallelism in these applications, in spite o ..."
Abstract
-
Cited by 66 (8 self)
- Add to MetaCart
Irregular applications, which manipulate large, pointer-based data structures like graphs, are difficult to parallelize manually. Automatic tools and techniques such as restructuring compilers and runtime speculative execution have failed to uncover much parallelism in these applications, in spite of a lot of effort by the research community. These difficulties have even led some researchers to wonder if there is any coarse-grain parallelism worth exploiting in irregular applications. In this paper, we describe two real-world irregular applications: a Delaunay mesh refinement application and a graphics application that performs agglomerative clustering. By studying the algorithms and data structures used in these applications, we show that there is substantial coarse-grain, data parallelism in these applications, but that this parallelism is very dependent on the input data and therefore cannot be uncovered by compiler analysis. In principle, optimistic techniques such as thread-level speculation can be used to uncover this parallelism, but we argue that current implementations cannot accomplish this because they do not use the proper abstractions for the data structures in these programs. These insights have informed our design of the Galois system, an object-based optimistic parallelization system for irregular applications. There are three main aspects to Galois: (1) a small number of syntactic constructs for packaging optimistic parallelism as iteration over ordered and unordered sets, (2) assertions about methods in class libraries, and (3) a runtime scheme for detecting and recovering from potentially unsafe accesses to shared memory made by an optimistic computation. We show that Delaunay mesh generation and agglomerative clustering can be parallelized in a straight-forward way using the Galois approach, and we present experimental measurements to show that this approach is practical. These results suggest that Galois is a practical approach to exploiting data parallelism in irregular programs.
Advanced Program Restructuring for High-Performance Computers with Polaris
- IEEE COMPUTER
, 1996
"... Multiprocessor computers are rapidly becoming the norm. Parallel workstations are widely available today, and it is likely that most PCs in the near future will contain multiple processors. To accommodate these changes, some classes of applications must be developed in explicitly parallel form. Ye ..."
Abstract
-
Cited by 46 (26 self)
- Add to MetaCart
Multiprocessor computers are rapidly becoming the norm. Parallel workstations are widely available today, and it is likely that most PCs in the near future will contain multiple processors. To accommodate these changes, some classes of applications must be developed in explicitly parallel form. Yet, in order to avoid a substantial increase in software development costs, compilers that translate conventional programs into efficient parallel form clearly will be necessary. In the ideal case, multiprocessor parallelism should be as transparent to programmers as instruction level parallelism is to programmers of today's superscalar machines. However, compiling for multiprocessors is substantially more complex than compiling for functional unit parallelism, in part because successful parallelization often requires a very accurate analysis of long sections of code. This article discusses recent experience at Illinois on the automatic parallelization of scientific codes using the Polaris restructurer. We also present several new analysis techniques that we have developed in recent years based on an extensive analysis of the characteristics of real Fortran codes. These techniques, based on both static and dynamic parallelization strategies, have been incorporated into the Polaris restructurer. Preliminary results on parallel workstations are encouraging, and once the implementation of the new techniques is complete, we expect that Polaris will be able to obtain good speedups for many real scientific codes on parallel workstations.
On the Automatic Parallelization of the Perfect Benchmarks
- IEEE Trans. on Parallel and Distributed Systems
, 1994
"... This paper presents the results of the Cedar Hand-Parallelization Experiment, conducted from 1989 through 1992 within the Center for Supercomputing Research and Development (CSRD) at the University of Illinois. In this experiment we manually transformed the Perfect Benchmarks R fl into parallel ..."
Abstract
-
Cited by 39 (7 self)
- Add to MetaCart
This paper presents the results of the Cedar Hand-Parallelization Experiment, conducted from 1989 through 1992 within the Center for Supercomputing Research and Development (CSRD) at the University of Illinois. In this experiment we manually transformed the Perfect Benchmarks R fl into parallel program versions. In doing so, we used techniques that may be automated in an optimizing compiler. We then ran these programs on the Cedar multiprocessor (built at CSRD during the 1980s) and measured the speed improvement due to each technique.
Run-Time Methods for Parallelizing Partially Parallel Loops
- Proceedings of the 9th ACM International Conference on Supercomputing
, 1995
"... In this paper we give a new run–time technique for finding an optimal parallel execution schedule for a partially parallel loop, i.e., a loop whose parallelization requires synchronization to ensure that the iterations are executed in the correct order. Given the original loop, the compiler generate ..."
Abstract
-
Cited by 30 (6 self)
- Add to MetaCart
In this paper we give a new run–time technique for finding an optimal parallel execution schedule for a partially parallel loop, i.e., a loop whose parallelization requires synchronization to ensure that the iterations are executed in the correct order. Given the original loop, the compiler generates inspector code that performs run–time preprocessing of the loop’s access pattern, and scheduler code that schedules (and executes) the loop iterations. The inspector is fully parallel, uses no synchronization, and can be applied to any loop. In addition, it can implement at run–time the two most effective transformations for increasing the amount of parallelism in a loop: array privatization and reduction parallelization (element–wise). We also describe a new scheme for constructing an optimal parallel execution schedule for the iterations of the loop. 1
The R-LRPD Test: Speculative Parallelization of Partially Parallel Loops
- In IPDPS ’02: Proceedings of the 16th International Parallel and Distributed Processing Symposium
, 2002
"... Current parallelizing compilers cannot identify a significant fraction of parallelizable loops because they have complex or statically insufficiently defined access patterns. We have previously proposed a framework for their identification. We speculatively executed a loop as a doall, and applied ..."
Abstract
-
Cited by 23 (3 self)
- Add to MetaCart
Current parallelizing compilers cannot identify a significant fraction of parallelizable loops because they have complex or statically insufficiently defined access patterns. We have previously proposed a framework for their identification. We speculatively executed a loop as a doall, and applied a fully parallel data dependence test to determine if it had any cross--processor dependences; If the test failed, then the loop was re--executed serially. While this method exploits doall parallelism well, it can cause slowdowns for loops with even one cross-processor flow dependence because we have to re-execute sequentially. Moreover, the existing, partial parallelism of loops is not exploited. In this paper we propose a generalization of our speculative doall parallelization technique, called the Recursive LRPD test, that can extract and exploit the maximum available parallelism of any loop and that limits potential slowdowns to the overhead of the run-time dependence test itself, i.e., removes the time lost due to incorrect parallel execution. The asymptotic time-complexity is, for fully serial loops, equal to the sequential execution time.
A Scalable Method for Run-Time Loop Parallelization
- IJPP
, 1995
"... Current parallelizing compilers do a reasonable job of extracting parallelism from programs with regular, well behaved, statically analyzable access patterns. However, they cannot extract a significant fraction of the available parallelism if the program has a complex and/or statically insufficientl ..."
Abstract
-
Cited by 20 (13 self)
- Add to MetaCart
Current parallelizing compilers do a reasonable job of extracting parallelism from programs with regular, well behaved, statically analyzable access patterns. However, they cannot extract a significant fraction of the available parallelism if the program has a complex and/or statically insufficiently defined access pattern, e.g., simulation programs with irregular domains and/or dynamically changing interactions. Since such programs represent a large fraction of all applications, techniques are needed for extracting their inherent parallelism at run--time. In this paper we give a new run--time technique for finding an optimal parallel execution schedule for a partially parallel loop, i.e., a loop whose parallelization requires synchronization to ensure that the iterations are executed in the correct order. Given the original loop, the compiler generates inspector code that performs run--time preprocessing of the loop's access pattern, and scheduler code that schedules (and executes) the loop iterations. The inspector is fully parallel, uses no synchronization, and can be applied to any loop (from which an inspector can be extracted). In addition, it can implement at run--time the two most effective transformations for increasing the amount of parallelism in a loop: array privatization and reduction parallelization (element--wise). The ability to identify privatizable and reduction variables is very powerful since it eliminates the data dependences involving these variables and thereby potentially increases the overall parallelism of the loop. We also describe a new scheme for constructing an optimal parallel execution schedule for the iterations of the loop. The schedule produced is a partition of the set of iterations into subsets called wavefronts so that there are n...
Principles of Speculative Run-time Parallelization
, 1998
"... . Current parallelizing compilers cannot identify a significant fraction of parallelizable loops because they have complex or statically insufficiently defined access patterns. We advocate a novel framework for the identification of parallel loops. It speculatively executes a loop as a doall and ..."
Abstract
-
Cited by 19 (9 self)
- Add to MetaCart
. Current parallelizing compilers cannot identify a significant fraction of parallelizable loops because they have complex or statically insufficiently defined access patterns. We advocate a novel framework for the identification of parallel loops. It speculatively executes a loop as a doall and applies a fully parallel data dependence test to check for any unsatisfied data dependencies; if the test fails, then the loop is re--executed serially. We will present the principles of the design and implementation of a compiler that employs both run-time and static techniques to parallelize dynamic applications. Run-time optimizations always represent a tradeoff between a speculated potential benefit and a certain (sure) overhead that must be paid. We will introduce techniques that take advantage of classic compiler methods to reduce the cost of run-time optimization thus tilting the outcome of speculation in favor of significant performance gains. Experimental results from the ...
Run-time Parallelization: A Framework for Parallel Computation
, 1995
"... The goal of parallelizing, or restructuring, compilers is to detect and exploit parallelism in sequential programs written in conventional languages. Current parallelizing compilers do a reasonable job of extracting parallelism from programs with regular, statically analyzable access patterns. Howev ..."
Abstract
-
Cited by 16 (8 self)
- Add to MetaCart
The goal of parallelizing, or restructuring, compilers is to detect and exploit parallelism in sequential programs written in conventional languages. Current parallelizing compilers do a reasonable job of extracting parallelism from programs with regular, statically analyzable access patterns. However, if the memory access pattern of the program is input data dependent, then static data dependence analysis and consequently parallelization is impossible. Moreover, in this case the compiler cannot apply privatization and reduction parallelization, the transformations that have been proven to be the most effective in removing data dependences and increasing the amount of exploitable parallelism in the program. Typical examples of irregular, dynamic applications are complex simulations such as SPICE for circuit simulation, DYNA-3D for structural mechanics modeling, DMOL for quantum mechanical simulation of molecules, and CHARMM for molecular dynamics simulation of organic systems. Therefore, since irregular programs represent a large and important fraction of applications, an automatable framework for run-time parallelization is needed to complement existing and future static compiler techniques. In this thesis,
Architecture And Compiler Design Issues In Programmable Media Processors
, 2000
"... The processing demands for multimedia applications are rapidly escalating. Many current applications are pushing the limits of existing microprocessors, and the next generation of multimedia promises considerably greater demands. Adequate support for future multimedia requires the flexibility and co ..."
Abstract
-
Cited by 10 (6 self)
- Add to MetaCart
The processing demands for multimedia applications are rapidly escalating. Many current applications are pushing the limits of existing microprocessors, and the next generation of multimedia promises considerably greater demands. Adequate support for future multimedia requires the flexibility and computing power of high-level language (HLL) programmable media processors. This thesis examines the architecture and compiler design issues for programmable media processors. Design of the architecture requires an accurate understanding of multimedia characteristics. Using the MediaBench benchmark suite and the Impact compiler, workload and architecture evaluations were performed to define the essential architecture for programmable media processors. The workload evaluation examines various processing aspects, including functional necessities, data types and sizes, branch performance, loop characteristics, memory statistics, and instruction level parallelism. The architecture evaluation exam...

