Results 1 -
7 of
7
A Framework for Exploiting Task- and Data-Parallelism on Distributed Memory Multicomputers
- IEEE Transactions on Parallel and Distributed Systems
, 1997
"... offer significant advantages over shared memory multiprocessors in terms of cost and scalability. Unfortunately, the utilization of all the available computational power in these machines involves a tremendous programming effort on the part of users, which creates a need for sophisticated compiler a ..."
Abstract
-
Cited by 30 (0 self)
- Add to MetaCart
offer significant advantages over shared memory multiprocessors in terms of cost and scalability. Unfortunately, the utilization of all the available computational power in these machines involves a tremendous programming effort on the part of users, which creates a need for sophisticated compiler and run-time support for distributed memory machines. In this paper, we explore a new compiler optimization for regular scientific applications–the simultaneous exploitation of task and data parallelism. Our optimization is implemented as part of the PARADIGM HPF compiler framework we have developed. The intuitive idea behind the optimization is the use of task parallelism to control the degree of data parallelism of individual tasks. The reason this provides increased performance is that data parallelism provides diminishing returns as the number of processors used is increased. By controlling the number of processors used for each data parallel task in an application and by concurrently executing these tasks, we make program execution more efficient and, therefore, faster. A practical implementation of a task and data parallel scheme of execution for an application on a distributed memory multicomputer also involves data redistribution. This data redistribution causes an overhead. However, as our experimental results show, this overhead is not a problem; execution of a program using task and data parallelism together can be significantly faster than its execution using data parallelism alone. This makes our proposed optimization practical and extremely useful.
On The Implementation And Effectiveness Of Autoscheduling For Shared-Memory Multiprocessors
, 1995
"... processors Physical processors Alignment Distribution dependent mapping Implementation Figure 3.4 HPF approach to data partition and distribution. states that iteration i is to be executed by the processor to which A(i) is assigned. Therefore processor p 1 executes iterations f1; 2; 3; 4g. T ..."
Abstract
-
Cited by 16 (2 self)
- Add to MetaCart
processors Physical processors Alignment Distribution dependent mapping Implementation Figure 3.4 HPF approach to data partition and distribution. states that iteration i is to be executed by the processor to which A(i) is assigned. Therefore processor p 1 executes iterations f1; 2; 3; 4g. The ON clause is a feature borrowed from the language Kali [25]. 3.1.3 HPF The High Performance Fortran (HPF) [6, 26, 27] language was designed as a set of extensions and modifications to Fortran 90 to support data parallel programming. The ability to achieve top performance on MIMD and SIMD computers with nonuniform memory access was one of the main goals of the project. The design of HPF was influenced by Fortran D and Vienna Fortran [28, 29]. Just as Fortran D approaches the problem of data partitioning and distribution in two stages, HPF uses three. First, arrays are aligned to each other. Second, arrays are distributed across a user-defined rectilinear arrangement of abstract processo...
A Framework for Exploiting Data and Functional Parallelism on Distributed Memory Multicomputers
, 1994
"... Recent research efforts have shown the benefits of integrating functional and data parallelism over using either pure data parallelism or pure functional parallelism. The work in this paper presents a theoretical framework for deciding on a good execution strategy for a given program based on the av ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
Recent research efforts have shown the benefits of integrating functional and data parallelism over using either pure data parallelism or pure functional parallelism. The work in this paper presents a theoretical framework for deciding on a good execution strategy for a given program based on the available functional and data parallelism in the program. The framework is based on assumptions about the form of computation and communication cost functions for multicomputer systems. We present mathematical functions for these costs and show that these functions are realistic. The framework also requires specification of the available functional and data parallelism for a given problem. For this purpose, we have developed a graphical programming tool. Currently, we have tested our approach using three benchmark programs on the Thinking Machines CM-5 and Intel Paragon. Results presented show that the approach is very effective and can provide a two- to three-fold increase in speedups over ap...
Efficient Scheduling Of Parallel Tasks In A Multiprogramming Environment
, 1995
"... Considerable research has produced a plethora of efficient methods of exploiting parallelism on dedicated machines. On typical real systems, however, some of the important assumptions that lead to efficiency on a dedicated machine either do not hold or cause other problems on a machine which is time ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Considerable research has produced a plethora of efficient methods of exploiting parallelism on dedicated machines. On typical real systems, however, some of the important assumptions that lead to efficiency on a dedicated machine either do not hold or cause other problems on a machine which is time or space shared. Foremost among these assumptions is that the number of processors available to a parallel job is constant throughout the execution of the job. Maintaining such consistency in a real multiprogramming system can lead to poor utilization of the machine. This thesis will address issues involving the efficient exploitation of parallelism in a multiprogramming environment including OS support for user level scheduling and dynamic granularity control. Implementation of some of these techniques in the nanoThreads thread library will be discussed, as well as other details of the implementation of nanoThreads. iv For my brother, Duane Schouten, who in his way encouraged my scientif...
The Performance Impact of Granularity Control and Functional Parallelism
- in Proc. of Workshop on Languages and Compilers for Parallel Computing
, 1995
"... . Task granularity and functional parallelism are fundamental issues in the optimization of parallel programs. Appropriate granularity for exploitation of parallelism is affected by characteristics of both the program and the execution environment. In this paper we demonstrate the efficacy of dynami ..."
Abstract
- Add to MetaCart
. Task granularity and functional parallelism are fundamental issues in the optimization of parallel programs. Appropriate granularity for exploitation of parallelism is affected by characteristics of both the program and the execution environment. In this paper we demonstrate the efficacy of dynamic granularity control. The scheme we propose uses dynamic runtime information to select the task size of exploited parallelism at various stages of the execution of a program. We also demonstrate that functional parallelism can be an important factor in improving the performance of parallel programs, both in the presence and absence of loop-level parallelism. Functional parallelism can increase the amount of large-grain parallelism as well as provide finer-grain parallelism that leads to better load balance. Analytical models and benchmark results quantify the impact of granularity control and functional parallelism. The underlying implementation for this research is a low-overhead threads m...
Turbulence Modulation By An Array Of Large-Scale Streamwise . . .
, 1999
"... Recently, it was shown (Schoppa and Hussain, Phys. Fluids, 10, 1049) that superimposing large-scale, synthetic, streamwise vortical flow structures onto a turbulent Poiseuille flow led to suppression of the low-speed streak instability mechanism, which, in the end, appears to be responsible for dra ..."
Abstract
- Add to MetaCart
Recently, it was shown (Schoppa and Hussain, Phys. Fluids, 10, 1049) that superimposing large-scale, synthetic, streamwise vortical flow structures onto a turbulent Poiseuille flow led to suppression of the low-speed streak instability mechanism, which, in the end, appears to be responsible for drag enhancement in turbulent flows. In this work, we use large-scale ElectroHydroDynamic flow structures to control turbulent transfer mechanisms. We consider a channel with two different flow control configurations: E-control, in which streamwise wireelectrodes are embedded into one of the walls and C-control, in which streamwise wire-electrodes are placed in the central-plane of the channel. In all cases, the wires are maintained at a potential sufficient to ensure ionic discharge. Ions are driven by the applied electrostatic field and generate plane, streamwise jets, which impinge on the opposite (grounded) wall and, by continuity, generate two-dimensional vortical flows. Control flows have...
Enhancing the Performance of Autoscheduling in Distributed Shared Memory Multiprocessors
- Proc. of the 4th International EuroPar Conference
"... . Autoscheduling is a parallel program compilation and execution model that combines uniquely three features: Automatic extraction of loop and functional parallelism at any level of granularity, dynamic scheduling of parallel tasks, and dynamic program adaptability on multiprogrammed shared memory m ..."
Abstract
- Add to MetaCart
. Autoscheduling is a parallel program compilation and execution model that combines uniquely three features: Automatic extraction of loop and functional parallelism at any level of granularity, dynamic scheduling of parallel tasks, and dynamic program adaptability on multiprogrammed shared memory multiprocessors. This paper presents a technique that enhances the performance of autoscheduling in Distributed Shared Memory (DSM) multiprocessors, targetting mainly at medium and large scale systems, where poor data locality and excessive communication impose performance bottlenecks. Our technique partitions the application Hierarchical Task Graph and maps the derived partitions to clusters of processors in the DSM architecture. Autoscheduling is then applied separately for each partition to enhance data locality and reduce communication costs. Our experimental results show that partitioning achieves remarkable performance improvements compared to a standard autoscheduling environment and a...

