Results 11 - 20
of
25
Space and Time Efficient Execution of Parallel Irregular Computations
- In Proceedings of ACM Symposium on Principles & Practice of Parallel Programming
, 1997
"... Solving problems of large sizes is an important goal for parallel machines with multiple CPU and memory resources. In this paper, issues of efficient execution of overhead-sensitive parallel irregular computation under memory constraints are addressed. The irregular parallelism is modeled by task de ..."
Abstract
-
Cited by 8 (5 self)
- Add to MetaCart
Solving problems of large sizes is an important goal for parallel machines with multiple CPU and memory resources. In this paper, issues of efficient execution of overhead-sensitive parallel irregular computation under memory constraints are addressed. The irregular parallelism is modeled by task dependence graphs with mixed granularities. The trade-off in achieving both time and space efficiency is investigated. The main difficulty of designing efficient run-time system support is caused by the use of fast communication primitives available on modern parallel architectures. A run-time active memory management scheme and new scheduling techniques are proposed to improve memory utilization while retaining good time efficiency, and a theoretical analysis on correctness and performance is provided. This work is implemented in the context of RAPID system [5] which provides run-time support for parallelizing irregular code on distributed memory machines and the effectiveness of the proposed...
Compiler and Run-Time Support for Irregular Computations
, 1995
"... There are many important applications in computational fluid dynamics, circuit simulation and structural analysis that can be more accurately modeled using iterations on unstructured grids. In these problems, regular compiler analysis for Massively Parallel Processors (MPP) with distributed address ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
There are many important applications in computational fluid dynamics, circuit simulation and structural analysis that can be more accurately modeled using iterations on unstructured grids. In these problems, regular compiler analysis for Massively Parallel Processors (MPP) with distributed address space fails because communication can only be determined at run-time. However, in many of these applications the communication pattern repeats for every iteration. Therefore, equivalent optimizations to the regular case can be achieved with a combination of run-time support (RTS) and compiler analysis.
Partitioning and Scheduling for Parallel Image Processing Operations
- Seventh IEEE Symposium on Parllel and Distributed Processing
, 1995
"... Many computer vision and image processing (CVIP) operations can be represented as a sequence of tasks with nested loops, specified by the visual programming language Khoros. This paper addresses the automatic partitioning and scheduling of such operations on distributed memory multiprocessors. The m ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
Many computer vision and image processing (CVIP) operations can be represented as a sequence of tasks with nested loops, specified by the visual programming language Khoros. This paper addresses the automatic partitioning and scheduling of such operations on distributed memory multiprocessors. The major difficulties in determining the optimal image data distribution for each task are that the number of processors available and the size of the input image may vary at the run time, and the cost of some image processing operations may be data-dependent. This paper proposes a compile-time processor assignment and data partitioning scheme that optimizes the average run-time performance of task chains with nested loops by considering the data redistribution overheads and possible run-time parameter variations. This paper presents the theoretical analysis and experimental results on a Meiko CS-2 distributed memory machine. 1 Introduction Many Computer Vision and Image Processing (CVIP) algor...
Mathematical Programming Approach for Static Load Balancing of Parallel PDE Solver
- in Proceedings of the 16 th IASTED International Conference on Applied Informatics, Acta Press
, 1998
"... A static load-balancing scheme is discussed for numerical simulation system NSL, which automatically generates parallel solver of partial differential equations, PDE, from high level description of problem. NSL partitions computational domain into multiple blocks, and allocates processors optimally ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
A static load-balancing scheme is discussed for numerical simulation system NSL, which automatically generates parallel solver of partial differential equations, PDE, from high level description of problem. NSL partitions computational domain into multiple blocks, and allocates processors optimally for each block in accordance with computation and communication cost. This allocation problem is formulated as a combinatorial optimization problem, and solved by branch-and-bound method. Though it is impractical to solve large problems by this method because of combinatorial explosion, this paper also describes an effective method to derive suboptimal solution in practical time by limiting search space. The error of this approximation is less than 15% under reasonable condition. Elapsed time for combinatorial optimization is measured in numerical simulations to induce the estimation equation. The method presented here is widely applicable by adapting evaluation function for each purpose.
Global Optimization for Mapping Parallel Image Processing Tasks on Distributed Memory Machines
- Journal of Parallel and Distributed Computing
, 1997
"... Many parallel algorithms and library routines are available for performing computer vision and image processing (CVIP) tasks on distributed-memory multiprocessors. The typical image distribution may use column, row, and block based mapping. Integrating a set of library routines for a CVIP applicatio ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Many parallel algorithms and library routines are available for performing computer vision and image processing (CVIP) tasks on distributed-memory multiprocessors. The typical image distribution may use column, row, and block based mapping. Integrating a set of library routines for a CVIP application requires a global optimization for determining the data mapping of individual tasks by considering intertask communication. The main difficulty in deriving the optimal image data distribution for each task is that CVIP task computation may involve loops, and the number of processors available and the size of the input image may vary at the run time. In this paper, a CVIP application is modeled using a task chain with nested loops, specified by conventional visual languages such as Khoros and Explorer. A mapping algorithm is proposed that optimizes the average run-time performance for CVIP applications with nested loops by considering the data redistribution overheads and possible run-time ...
Concatenated Parallelism: A Technique for Efficient Parallel Divide and Conquer
- In Proc. of the Symposium of Parallel and Distributed Computing (SPDP'96
, 1996
"... A number of problems have efficient algorithms that are based on the divide and conquer paradigm. Such problems can be solved in parallel by mapping the corresponding divide and conquer tree to the parallel computer under consideration. Two basic strategies are used in such parallelizations: Task P ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
A number of problems have efficient algorithms that are based on the divide and conquer paradigm. Such problems can be solved in parallel by mapping the corresponding divide and conquer tree to the parallel computer under consideration. Two basic strategies are used in such parallelizations: Task Parallelism, in which different subproblems are assigned to different groups of processors and Data Parallelism, in which the tasks are solved one after the other using all the processors. Task parallelism involves significant data movement and data parallelism causes problems due to load imbalance. In this paper we propose a new strategy, which we call Concatenated Parallelism, for efficient parallel solution of problems resulting in divide and conquer trees. Our strategy is useful when the communication time due to data movement in distributing the subproblems using task parallelism is significant when compared to the time required to divide the subproblems. This happens to be the case for...
Space/Time-Efficient Scheduling and Execution of Parallel Irregular Computations
- ACM Trans. Prog. Lang. Syst
, 1998
"... this article we investigate the trade-off between time and space efficiency in scheduling and executing parallel irregular computations on distributed-memory machines. We employ acyclic task dependence graphs to model irregular parallelism with mixed granularity, and we use direct remote memory acce ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
this article we investigate the trade-off between time and space efficiency in scheduling and executing parallel irregular computations on distributed-memory machines. We employ acyclic task dependence graphs to model irregular parallelism with mixed granularity, and we use direct remote memory access to support fast communication. We propose new scheduling techniques and a run-time active memory management scheme to improve memory utilization while retaining good time efficiency, and we provide a theoretical analysis on correctness and performance. This work is implemented in the context of the RAPID system which uses an inspector/executor approach to parallelize irregular computations at run-time. We demonstrate the effectiveness of the proposed techniques on several irregular applications such as sparse matrix code and the fast multipole method for particle simulation. Our experimental results on Cray-T3E show that problems of large sizes can be solved under limited space capacity, and that the loss of execution efficiency caused by the extra memory management overhead is reasonable. Categories and Subject Descriptors: C.1.2 [Processor Architectures]: Multiple Data Stream Architectures (Multiprocessors); D.1.3 [Programming Techniques]: Parallel Programming; D.3.4 [Programming Languages]: Run-Time Environments; G.2.2 [Discrete Mathematics]:
Coordinating Foreign Modules with a Parallelizing Compiler
, 1997
"... Integrating task and data parallelism in a language framework has attracted considerable attention. Both the Fx language at Carnegie Mellon University and High Performance Fortran standard have adopted a simple model of task parallelism that allows dynamic assignment of processor subgroups to tasks ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Integrating task and data parallelism in a language framework has attracted considerable attention. Both the Fx language at Carnegie Mellon University and High Performance Fortran standard have adopted a simple model of task parallelism that allows dynamic assignment of processor subgroups to tasks in the application. However, the tasks must be written in the native language, i.e Fx or HPF. Large scientific parallel applications often use a mix of languages for a variety of reasons, including the reuse of existing code and the suitability of different languages for different modules. In this paper we demonstrate how a "native" parallelizing compiler can be used to create parallel applications that combine native modules with "foreign" modules written in a different parallel language. We argue that virtually all the advantages of translating a foreign module to the native language can be achieved by using the native compiler to coordinate the interactions with the foreign module. In par...
Efficient Resource Scheduling in Multiprocessors
- UNIVERSITY OF CALIFORNIA, BERKELEY
, 1996
"... As multiprocessing becomes increasingly successful in scientific and commercial computing, parallel systems will be subjected to increasingly complex and challenging workloads. To ensure good job response and high resource utilization, algorithms are needed to allocate resources to jobs and to sch ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
As multiprocessing becomes increasingly successful in scientific and commercial computing, parallel systems will be subjected to increasingly complex and challenging workloads. To ensure good job response and high resource utilization, algorithms are needed to allocate resources to jobs and to schedule the jobs. This problem is of central importance, and pervades systems research at diverse places such as compilers, runtime, applications, and operating systems. Despite the attention this area has received, scheduling problems in practical parallel computing still lack satisfactory solutions. The focus of system builders is to provide functionality and features; the resulting systems get so complex that many models and theoretical results lack applicability. The focus of this thesis is in ...
Simultaneous Allocation And Scheduling Using Convex Programming Techniques
, 1995
"... Simultaneous exploitation of task and data parallelism provides significant benefits for many applications. The basic approach for exploiting task and data parallelism is to use a task graph representation (Macro Dataflow Graph) for programs to decide on the degree of data parallelism to be used for ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Simultaneous exploitation of task and data parallelism provides significant benefits for many applications. The basic approach for exploiting task and data parallelism is to use a task graph representation (Macro Dataflow Graph) for programs to decide on the degree of data parallelism to be used for each task (allocation) and an execution order for the tasks (scheduling). Previously, we presented a two step approach for allocation and scheduling by considering the two steps to be independent of each other. In this paper, we present a new simultaneous approach which uses constraints to model the scheduler during allocation. The new simultaneous approach provides significant benefits over our earlier approach for the benchmark task graphs that we have considered.

