Results 1 - 10
of
39
Gang Scheduling Performance Benefits for Fine-Grain Synchronization
- Journal of Parallel and Distributed Computing
, 1992
"... Abstract Multiprogrammed multiprocessors executing fine-grained parallel programs appear to require new scheduling policies. A promising new idea is gang scheduling, where a set of threads are scheduled to execute simultaneously on a set of processors. This has the intuitive appeal of supplying the ..."
Abstract
-
Cited by 117 (12 self)
- Add to MetaCart
Abstract Multiprogrammed multiprocessors executing fine-grained parallel programs appear to require new scheduling policies. A promising new idea is gang scheduling, where a set of threads are scheduled to execute simultaneously on a set of processors. This has the intuitive appeal of supplying the threads with an environment that is very similar to a dedicated machine. It allows the threads to interact efficiently by using busy waiting, without the risk of waiting for a thread that currently is not running. Without gang scheduling, threads have to block in order to synchronize, thus suffering the overhead of a context switch. While this is tolerable in coarse grain computations, and might even lead to performance benefits if the threads are highly unbalanced, it causes severe performance degradation in the fine-grain case. We have developed a model to evaluate the performance of different combinations of synchronization mechanisms and scheduling policies, and validated it by an implementation on the Makbilan multiprocessor. The model leads to the conclusion that gang scheduling is required for efficient fine grain synchronization on multiprogrammed multiprocessors. 1 Introduction Multiprocessors are often dedicated to running a single application at a time. The program is allowed full control over what happens on each processor, and in fact it might be required to include instructions that regulate the mapping and scheduling of parallel threads. Much experience relating to these issues has been accumulated over the years, and automatic parallelization and compilation techniques have been developed. These techniques allow dedicated processors to be used efficiently by a single application.
Automatic Program Parallelization
, 1993
"... This paper presents an overview of automatic program parallelization techniques. It covers dependence analysis techniques, followed by a discussion of program transformations, including straight-line code parallelization, do loop transformations, and parallelization of recursive routines. The last s ..."
Abstract
-
Cited by 97 (8 self)
- Add to MetaCart
This paper presents an overview of automatic program parallelization techniques. It covers dependence analysis techniques, followed by a discussion of program transformations, including straight-line code parallelization, do loop transformations, and parallelization of recursive routines. The last section of the paper surveys several experimental studies on the effectiveness of parallelizing compilers.
Constructive Methods for Scheduling Uniform Loop Nests
- IEEE Transactions on Parallel and Distributed Systems
, 1994
"... This paper surveys scheduling techniques for loop nests with uniform dependences. First we introduce the hyperplane method and related variants. Then we extend it by using a different affine scheduling for each statement within the nest. In both cases we present a new, constructive and efficient met ..."
Abstract
-
Cited by 60 (3 self)
- Add to MetaCart
This paper surveys scheduling techniques for loop nests with uniform dependences. First we introduce the hyperplane method and related variants. Then we extend it by using a different affine scheduling for each statement within the nest. In both cases we present a new, constructive and efficient method to determine optimal solutions, i.e. schedules whose total execution time is minimum. 1 Introduction Loop nests lie in the heart of supercompilers-parallelizers for supercomputers. On one hand their importance in terms of applications is evident: in many scientific programs, the time spent in the execution of a small number of loops represents a large fraction of the total execution time, while the potential parallelism of these loops is very important. On the other hand, the regular and repetitive structure of loop nests greatly facilitates the use of dependence analysis techniques and of scheduling and allocation strategies. The general problem of finding the optimal scheduling for a ...
The fuzzy barrier: A mechanism for high speed synchronization of processors
- In: ASPLOS
, 1989
"... Abstract- Parallel programs are commonly written using barriers to synchronize parallel processes. Upon reaching a barrier, a processor must stall until all participating processors reach the barrier. A software implementation of the barrier mechanism using shared variables has two major drawbacks. ..."
Abstract
-
Cited by 56 (3 self)
- Add to MetaCart
Abstract- Parallel programs are commonly written using barriers to synchronize parallel processes. Upon reaching a barrier, a processor must stall until all participating processors reach the barrier. A software implementation of the barrier mechanism using shared variables has two major drawbacks. Firstly, the execution of the barrier may be slow as it may not only require execution of several instructions and but also result in hot-spot accesses. Secondly, processors that are stalled waiting for other processors to reach the barrier are essentially idling and cannot do any useful work. In this paper, the notion of the fuzzy barrier is presented, that avoids the above drawbacks. The first problem is avoided by implementing the mechanism in hardware. The second problem is solved by extending the barrier concept to include a region of statements that can be executed by a processor while it awaits synchronization. The barrier regions are constructed by a compiler and consist of several instructions such that a processor is ready to synchronize upon reaching the first instruction in this region and must synchronize before exiting the region. When synchronization does occur, the processors could be executing at any point in their respective barrier regions. The larger the barrier region, the more likely it is that none of the processors will have to stall. Preliminary investigations show that barrier regions can be large and the use of program transformations can significantly increase their size. Examples of situations where such a mechanism can result in improved performance are presented. Results based on a software implementation of the fuzzy barrier on the Encore multiprocessor indicate that the synchronization overhead can be greatly reduced using the mechanism. Keywords- multiprocessor systems, barrier synchronization, parallelizing compilers. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for
The Privatizing DOALL Test: A Run-Time Technique for DOALL Loop Identification and Array Privatization
- In Proceedings of the 1994 International Conference on Supercomputing
, 1994
"... Current parallelizing compilers cannot extract a significant fraction of the available parallelism in a loop if it has a complex and/or statically insufficiently defined access pattern. This is an important issue because a large class of complex simulations used in industry today have irregular doma ..."
Abstract
-
Cited by 43 (16 self)
- Add to MetaCart
Current parallelizing compilers cannot extract a significant fraction of the available parallelism in a loop if it has a complex and/or statically insufficiently defined access pattern. This is an important issue because a large class of complex simulations used in industry today have irregular domains and/or dynamically changing interactions. To handle these types of problems methods capable of automatically extracting parallelism at run--time are needed. For this reason, we have developed the Privatizing DOALL test -- a technique for identifying fully parallel loops at run--time, and dynamically privatizing scalars and arrays. The test is fully parallel, requires no synchronization, is easily automatable, and can be applied to any loop, regardless of its access pattern. We show that the expected speedup for fully parallel loops is significant, and the cost of a failed test (a not fully parallel loop) is minimal. We present experimental results on loops from the PERFECT Benchmarks whi...
Linear Scheduling Is Nearly Optimal
, 1991
"... This paper deals with the problem of finding optimal schedulings for uniform dependence algorithms. Given a convex domain, let T f be the total time needed to execute all computations using the free (greedy) schedule and let T l be the total time needed to execute all computations using the optimal ..."
Abstract
-
Cited by 35 (11 self)
- Add to MetaCart
This paper deals with the problem of finding optimal schedulings for uniform dependence algorithms. Given a convex domain, let T f be the total time needed to execute all computations using the free (greedy) schedule and let T l be the total time needed to execute all computations using the optimal linear schedule. Our main result is to bound T l =T f and T l \Gamma T f for sufficiently "fat" domains. Keywords: Uniform dependence algorithms; Convex domain; Free schedule; Linear schedule; Optimal schedule; Path packing. 1. Introduction The pioneering work of Karp, Miller and Winograd 2 has considered a special class of algorithms characterized by uniform data dependencies and unit-time computations. This special class of algorithms, termed uniform dependence algorithms by Shang and Fortes 6 has proven of paramount importance in various fields of applications, such as systolic array design and parallel compiler optimization. This paper deals with the problem of finding optimal s...
Matrix Multiplication on Heterogeneous Platforms
, 2001
"... this paper, we address the issue of implementing matrix multiplication on heterogeneous platforms. We target two different classes of heterogeneous computing resources: heterogeneous networks of workstations and collections of heterogeneous clusters. Intuitively, the problem is to load balance the ..."
Abstract
-
Cited by 35 (19 self)
- Add to MetaCart
this paper, we address the issue of implementing matrix multiplication on heterogeneous platforms. We target two different classes of heterogeneous computing resources: heterogeneous networks of workstations and collections of heterogeneous clusters. Intuitively, the problem is to load balance the work with different speed resources while minimizing the communication volume. We formally state this problem in a geometric framework and prove its NP-completeness. Next, we introduce a (polynomial) column-based heuristic, which turns out to be very satisfactory: We derive a theoretical performance guarantee for the heuristic and we assess its practical usefulness through MPI experiments
Run-Time Methods for Parallelizing Partially Parallel Loops
- Proceedings of the 9th ACM International Conference on Supercomputing
, 1995
"... In this paper we give a new run–time technique for finding an optimal parallel execution schedule for a partially parallel loop, i.e., a loop whose parallelization requires synchronization to ensure that the iterations are executed in the correct order. Given the original loop, the compiler generate ..."
Abstract
-
Cited by 30 (6 self)
- Add to MetaCart
In this paper we give a new run–time technique for finding an optimal parallel execution schedule for a partially parallel loop, i.e., a loop whose parallelization requires synchronization to ensure that the iterations are executed in the correct order. Given the original loop, the compiler generates inspector code that performs run–time preprocessing of the loop’s access pattern, and scheduler code that schedules (and executes) the loop iterations. The inspector is fully parallel, uses no synchronization, and can be applied to any loop. In addition, it can implement at run–time the two most effective transformations for increasing the amount of parallelism in a loop: array privatization and reduction parallelization (element–wise). We also describe a new scheme for constructing an optimal parallel execution schedule for the iterations of the loop. 1

