Results 1 
4 of
4
A Framework for Exploiting Task and DataParallelism on Distributed Memory Multicomputers
 IEEE Transactions on Parallel and Distributed Systems
, 1997
"... offer significant advantages over shared memory multiprocessors in terms of cost and scalability. Unfortunately, the utilization of all the available computational power in these machines involves a tremendous programming effort on the part of users, which creates a need for sophisticated compiler a ..."
Abstract

Cited by 32 (0 self)
 Add to MetaCart
offer significant advantages over shared memory multiprocessors in terms of cost and scalability. Unfortunately, the utilization of all the available computational power in these machines involves a tremendous programming effort on the part of users, which creates a need for sophisticated compiler and runtime support for distributed memory machines. In this paper, we explore a new compiler optimization for regular scientific applicationsâ€“the simultaneous exploitation of task and data parallelism. Our optimization is implemented as part of the PARADIGM HPF compiler framework we have developed. The intuitive idea behind the optimization is the use of task parallelism to control the degree of data parallelism of individual tasks. The reason this provides increased performance is that data parallelism provides diminishing returns as the number of processors used is increased. By controlling the number of processors used for each data parallel task in an application and by concurrently executing these tasks, we make program execution more efficient and, therefore, faster. A practical implementation of a task and data parallel scheme of execution for an application on a distributed memory multicomputer also involves data redistribution. This data redistribution causes an overhead. However, as our experimental results show, this overhead is not a problem; execution of a program using task and data parallelism together can be significantly faster than its execution using data parallelism alone. This makes our proposed optimization practical and extremely useful.
CPR: Mixed Task and Data Parallel Scheduling for Distributed Systems
 In Proceedings of the 15th International Parallel and Distributed Symposium
, 2001
"... It is wellknown that mixing task and data parallelism to solve large computational applications often yields better speedups compared to either applying pure task parallelism or pure data parallelism. Typically, the applications are modeled in terms of a dependence graph of coarsegrain dataparall ..."
Abstract

Cited by 17 (6 self)
 Add to MetaCart
It is wellknown that mixing task and data parallelism to solve large computational applications often yields better speedups compared to either applying pure task parallelism or pure data parallelism. Typically, the applications are modeled in terms of a dependence graph of coarsegrain dataparallel tasks, called a dataparallel task graph. In this paper we present a new compiletime heuristic, named Critical Path Reduction (CPR), for scheduling dataparallel task graphs. Experimental results based on graphs derived from real problems as well as synthetic graphs, show that CPR achieves higher speedup compared to other wellknown existing scheduling algorithms, at the expense of some higher cost. These results are also confirmed by performance measurements of two real applications (i.e., complex matrix multiplication and Strassen matrix multiplication) running on a cluster of workstations.
Software Support For Parallel Processing Of Irregular And Dynamic Computations
, 1996
"... Many real world scientific computations are irregular and dynamic, which pose great challenge to the effort of parallelization. In this thesis we study the efficient mapping of a subclass of these problems, namely the "stepwise slowly changing" problems, onto distributed memory multiprocessors using ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
Many real world scientific computations are irregular and dynamic, which pose great challenge to the effort of parallelization. In this thesis we study the efficient mapping of a subclass of these problems, namely the "stepwise slowly changing" problems, onto distributed memory multiprocessors using the task graph scheduling approach. There exists a large class of applications which belong to this category. Intuitively, the irregularity requires sophisticated mapping algorithms, and the "slowness" in the changes of the computational structures between steps allows the scheduling cost to be amortized, justifying the approach. We study three representative and widelyused applications: The Nbody simulation in astrophysics, the VortexSheet RollUp and the Contour Dynamics Computation from Computational Fluid Dynamics. We sta...
A LowCost Approach towards Mixed Task and Data Parallel Scheduling
 In Proc. of 2001 International Conference on Parallel Processing (30th ICPPâ€™01
, 2001
"... A relatively new trend in parallel programming scheduling is the socalled mixed task and data scheduling. It has been shown that mixing task and data parallelism to solve large computational applications often yields better speedups compared to either applying pure task parallelism or pure data par ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
A relatively new trend in parallel programming scheduling is the socalled mixed task and data scheduling. It has been shown that mixing task and data parallelism to solve large computational applications often yields better speedups compared to either applying pure task parallelism or pure data parallelism. In this paper we present a new compiletime heuristic, named Critical Path and Allocation (CPA), for scheduling dataparallel task graphs. Designed to have a very low cost, its complexity is much lower compared to existing approaches, such as TSAS, TwoL or CPR, by one order of magnitude or even more. Experimental results based on graphs derived from real problems as well as synthetic graphs, show that the performance loss of CPA relative to the above algorithms does not exceed 50%. These results are also confirmed by performance measurements of two real applications (i.e., complex matrix multiplication and Strassen matrix multiplication) running on a cluster of workstations.