Results 1 -
4 of
4
A Framework for Exploiting Task- and Data-Parallelism on Distributed Memory Multicomputers
- IEEE Transactions on Parallel and Distributed Systems
, 1997
"... offer significant advantages over shared memory multiprocessors in terms of cost and scalability. Unfortunately, the utilization of all the available computational power in these machines involves a tremendous programming effort on the part of users, which creates a need for sophisticated compiler a ..."
Abstract
-
Cited by 30 (0 self)
- Add to MetaCart
offer significant advantages over shared memory multiprocessors in terms of cost and scalability. Unfortunately, the utilization of all the available computational power in these machines involves a tremendous programming effort on the part of users, which creates a need for sophisticated compiler and run-time support for distributed memory machines. In this paper, we explore a new compiler optimization for regular scientific applications–the simultaneous exploitation of task and data parallelism. Our optimization is implemented as part of the PARADIGM HPF compiler framework we have developed. The intuitive idea behind the optimization is the use of task parallelism to control the degree of data parallelism of individual tasks. The reason this provides increased performance is that data parallelism provides diminishing returns as the number of processors used is increased. By controlling the number of processors used for each data parallel task in an application and by concurrently executing these tasks, we make program execution more efficient and, therefore, faster. A practical implementation of a task and data parallel scheme of execution for an application on a distributed memory multicomputer also involves data redistribution. This data redistribution causes an overhead. However, as our experimental results show, this overhead is not a problem; execution of a program using task and data parallelism together can be significantly faster than its execution using data parallelism alone. This makes our proposed optimization practical and extremely useful.
CPR: Mixed Task and Data Parallel Scheduling for Distributed Systems
- In Proceedings of the 15th International Parallel and Distributed Symposium
, 2001
"... It is well-known that mixing task and data parallelism to solve large computational applications often yields better speedups compared to either applying pure task parallelism or pure data parallelism. Typically, the applications are modeled in terms of a dependence graph of coarse-grain data-parall ..."
Abstract
-
Cited by 14 (5 self)
- Add to MetaCart
It is well-known that mixing task and data parallelism to solve large computational applications often yields better speedups compared to either applying pure task parallelism or pure data parallelism. Typically, the applications are modeled in terms of a dependence graph of coarse-grain data-parallel tasks, called a data-parallel task graph. In this paper we present a new compile-time heuristic, named Critical Path Reduction (CPR), for scheduling data-parallel task graphs. Experimental results based on graphs derived from real problems as well as synthetic graphs, show that CPR achieves higher speedup compared to other wellknown existing scheduling algorithms, at the expense of some higher cost. These results are also confirmed by performance measurements of two real applications (i.e., complex matrix multiplication and Strassen matrix multiplication) running on a cluster of workstations.
Software Support For Parallel Processing Of Irregular And Dynamic Computations
, 1996
"... Many real world scientific computations are irregular and dynamic, which pose great challenge to the effort of parallelization. In this thesis we study the efficient mapping of a subclass of these problems, namely the "stepwise slowly changing" problems, onto distributed memory multiprocessors using ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Many real world scientific computations are irregular and dynamic, which pose great challenge to the effort of parallelization. In this thesis we study the efficient mapping of a subclass of these problems, namely the "stepwise slowly changing" problems, onto distributed memory multiprocessors using the task graph scheduling approach. There exists a large class of applications which belong to this category. Intuitively, the irregularity requires sophisticated mapping algorithms, and the "slowness" in the changes of the computational structures between steps allows the scheduling cost to be amortized, justifying the approach. We study three representative and widely-used applications: The N-body simulation in astrophysics, the Vortex-Sheet Roll-Up and the Contour Dynamics Computation from Computational Fluid Dynamics. We sta...
A Low-Cost Approach towards Mixed Task and Data Parallel Scheduling
- In Proc. of 2001 International Conference on Parallel Processing (30th ICPP’01
, 2001
"... A relatively new trend in parallel programming scheduling is the so-called mixed task and data scheduling. It has been shown that mixing task and data parallelism to solve large computational applications often yields better speedups compared to either applying pure task parallelism or pure data par ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
A relatively new trend in parallel programming scheduling is the so-called mixed task and data scheduling. It has been shown that mixing task and data parallelism to solve large computational applications often yields better speedups compared to either applying pure task parallelism or pure data parallelism. In this paper we present a new compile-time heuristic, named Critical Path and Allocation (CPA), for scheduling data-parallel task graphs. Designed to have a very low cost, its complexity is much lower compared to existing approaches, such as TSAS, TwoL or CPR, by one order of magnitude or even more. Experimental results based on graphs derived from real problems as well as synthetic graphs, show that the performance loss of CPA relative to the above algorithms does not exceed 50%. These results are also confirmed by performance measurements of two real applications (i.e., complex matrix multiplication and Strassen matrix multiplication) running on a cluster of workstations.

