Results 1 - 10
of
17
CPR: Mixed Task and Data Parallel Scheduling for Distributed Systems
- In Proceedings of the 15th International Parallel and Distributed Symposium
, 2001
"... It is well-known that mixing task and data parallelism to solve large computational applications often yields better speedups compared to either applying pure task parallelism or pure data parallelism. Typically, the applications are modeled in terms of a dependence graph of coarse-grain data-parall ..."
Abstract
-
Cited by 14 (5 self)
- Add to MetaCart
It is well-known that mixing task and data parallelism to solve large computational applications often yields better speedups compared to either applying pure task parallelism or pure data parallelism. Typically, the applications are modeled in terms of a dependence graph of coarse-grain data-parallel tasks, called a data-parallel task graph. In this paper we present a new compile-time heuristic, named Critical Path Reduction (CPR), for scheduling data-parallel task graphs. Experimental results based on graphs derived from real problems as well as synthetic graphs, show that CPR achieves higher speedup compared to other wellknown existing scheduling algorithms, at the expense of some higher cost. These results are also confirmed by performance measurements of two real applications (i.e., complex matrix multiplication and Strassen matrix multiplication) running on a cluster of workstations.
Scheduling Strategies for Mixed Data and Task Parallelism on Heterogeneous Processor Grids
, 2002
"... In this paper, we consider the execution of a complex application on a heterogeneous "grid" computing platform. The complex application consists of a suite of identical, independent problems to be solved. In turn, each problem consists of a set of tasks. There are dependences (precedence constraints ..."
Abstract
-
Cited by 12 (7 self)
- Add to MetaCart
In this paper, we consider the execution of a complex application on a heterogeneous "grid" computing platform. The complex application consists of a suite of identical, independent problems to be solved. In turn, each problem consists of a set of tasks. There are dependences (precedence constraints) between these tasks. A typical example is the repeated execution of the same algorithm on several distinct data samples. We use a non-oriented graph to model...
A Data and Task Parallel Image Processing Environment
- Parallel Computing
, 2001
"... The paper presents a data and task paxallel environment for parallelizing low-level image processing applications on distributed memory systems. Image processing operators axe paxallelized by data decomposition using algorithmic skeletons. At the application level we use task decomposition, base ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
The paper presents a data and task paxallel environment for parallelizing low-level image processing applications on distributed memory systems. Image processing operators axe paxallelized by data decomposition using algorithmic skeletons. At the application level we use task decomposition, based on the Image Application Task Graph.
Distributed Bucket Processing: a Paradigm embedded in a framework for the parallel processing of pixel sets
- Delft University of Technology
"... Large datasets, such as pixels and voxels in 2D and 3D images can usually be reduced during their processing to smaller subsets with less datapoints. Such subsets can be the objects in the image, features-edges or corners- or more general, regions of interest. For instance, the transformation from a ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Large datasets, such as pixels and voxels in 2D and 3D images can usually be reduced during their processing to smaller subsets with less datapoints. Such subsets can be the objects in the image, features-edges or corners- or more general, regions of interest. For instance, the transformation from a set of datapoints representing an image, to one or more subsets of datapoints representing objects in the image, is due to a segmentation algorithm and may involve both the selection of datapoints as well as a change in datastructure. The massive number of pixels in the original image, points to a data parallel approach, whereas the processing of the various objects in the image is more suitable for task parallelism. In this paper we introduce a framework for parallel image processing and we focus on an array of buckets that can be distributed over a number of processors and that contains pointers to the data from the dataset. The benefit of this approach is that the processor activity remains focussed on the datapoints that need processing and, moreover, that the load can be distributed over many processors, even in a heterogeneous computer architecture. Although the method is generally applicable in the processing of sets, in this paper we obtain our examples from the domain of image processing. As this method yields speed-ups that are data-dependent, we derived a run-time evaluation that is able to determine if the use of distributed buckets is beneficial.
Taxonomies of the Multi-criteria Grid Workflow Scheduling Problem
, 2007
"... The workflow scheduling problem which is considered difficult on the Grid becomes even more challenging when multiple scheduling criteria are used for optimization. The existing approaches can address only certain variants of the multi-criteria workflow scheduling problem, usually considering up to ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
The workflow scheduling problem which is considered difficult on the Grid becomes even more challenging when multiple scheduling criteria are used for optimization. The existing approaches can address only certain variants of the multi-criteria workflow scheduling problem, usually considering up to two contradicting criteria being scheduled in some specific Grid environments. A comprehensive description of the problem can be an important step towards more general scheduling approaches. Based on the related work and on our own experience, we propose several novel taxonomies of the multi-criteria workflow scheduling problem, considering five facets which may have a major impact on the selection of an appropriate scheduling strategy: scheduling process, scheduling criteria, resource model, task model, and workflow model. We analyze different existing workflow scheduling approaches for the Grid, and classify them according to the proposed taxonomies, identifying the most common use cases and the areas which have not been sufficiently explored yet. 1
Exploting Pipelined Executions in OpenMP
- Proc. of the Int’l Conference on Parallel Processing (ICPP 2003), 2003
, 2003
"... This paper proposes a set of extensions to the OpenMP programming model to express point–to–point synchronization schemes. This is accomplished by defining, in the form of directives, precedence relations among the tasks that are originated from OpenMP work–sharing constructs. The proposal is based ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
This paper proposes a set of extensions to the OpenMP programming model to express point–to–point synchronization schemes. This is accomplished by defining, in the form of directives, precedence relations among the tasks that are originated from OpenMP work–sharing constructs. The proposal is based on the definition of a name space that identifies the work parceled out by these work–sharing constructs. Then the programmer defines the precedence relations using this name space. This relieves the programmer from the burden of defining complex synchronization data structures and the insertion of explicit synchronization actions in the program that make the program difficult to understand and maintain. The paper briefly describes the main aspects of the runtime implementation required to support precedences relations in OpenMP. The paper focuses on the evaluation of the proposal through its use two benchmarks: NAS LU and ASCI Seep3d. 1
A Low-Cost Approach towards Mixed Task and Data Parallel Scheduling
- In Proc. of 2001 International Conference on Parallel Processing (30th ICPP’01
, 2001
"... A relatively new trend in parallel programming scheduling is the so-called mixed task and data scheduling. It has been shown that mixing task and data parallelism to solve large computational applications often yields better speedups compared to either applying pure task parallelism or pure data par ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
A relatively new trend in parallel programming scheduling is the so-called mixed task and data scheduling. It has been shown that mixing task and data parallelism to solve large computational applications often yields better speedups compared to either applying pure task parallelism or pure data parallelism. In this paper we present a new compile-time heuristic, named Critical Path and Allocation (CPA), for scheduling data-parallel task graphs. Designed to have a very low cost, its complexity is much lower compared to existing approaches, such as TSAS, TwoL or CPR, by one order of magnitude or even more. Experimental results based on graphs derived from real problems as well as synthetic graphs, show that the performance loss of CPA relative to the above algorithms does not exceed 50%. These results are also confirmed by performance measurements of two real applications (i.e., complex matrix multiplication and Strassen matrix multiplication) running on a cluster of workstations.
Complex Pipelined Executions in OpenMP Parallel Applications
- in: Proceedings of the International Conference on Parallel Processing
"... This paper proposes a set of extensions to the OpenMP programming model to express complex pipelined computations. This is accomplished by defining, in the form of directives, precedence relations among the tasks originated from work--sharing constructs. The proposal is based on the definition of a ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
This paper proposes a set of extensions to the OpenMP programming model to express complex pipelined computations. This is accomplished by defining, in the form of directives, precedence relations among the tasks originated from work--sharing constructs. The proposal is based on the definition of a name space that identifies the work parceled out by these work--sharing constructs. Then the programmer defines the precedence relations using this name space. This relieves the programmer from the burden of defining complex synchronization data structures and the insertion of explicit synchronization actions in the program that make the program difficult to understand and maintain. This work is transparently done by the compiler with the support of the OpenMP runtime library. The proposal is motivated and evaluated with a synthetic multi-block example. The paper also includes a description of the compiler and run-- time support in the framework of the NanosCompiler for OpenMP.
Modeling the Scalability of Acyclic Stream Programs
, 2004
"... Despite the fact that the streaming application domain is becoming increasingly widespread, few studies have focused specifically on the performance characteristics of stream programs. We introduce two models by which the scalability of stream programs can be predicted to some degree of accuracy. Th ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Despite the fact that the streaming application domain is becoming increasingly widespread, few studies have focused specifically on the performance characteristics of stream programs. We introduce two models by which the scalability of stream programs can be predicted to some degree of accuracy. This is accomplished by testing a series of stream benchmarks on our numerical representations of the two models. These numbers are then compared to actual speedups obtained by running the benchmarks through the Raw machine and a Magic network. Using the metrics, we show that stateless acyclic stream programs benefit considerably from data parallelization. In particular, programs with low communication datarates experience up to a tenfold speedup increase when parallelized to a reasonable margin. Those with high communication datarates also experience approximately a twofold speedup. We find that the model that takes synchronization communication overhead into account, in addition to a cost proportional to the communication rate of the stream, provides the highest predictive accuracy.
Runtime Scheduling of Dynamic Parallelism on Accelerator-Based Multi-core Systems
, 2007
"... We explore runtime mechanisms and policies for scheduling dynamic multi-grain parallelism on heterogeneous multi-core processors. Heterogeneous multi-core processors integrate conventional cores that run legacy codes with specialized cores that serve as computational accelerators. The term multi-gra ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
We explore runtime mechanisms and policies for scheduling dynamic multi-grain parallelism on heterogeneous multi-core processors. Heterogeneous multi-core processors integrate conventional cores that run legacy codes with specialized cores that serve as computational accelerators. The term multi-grain parallelism refers to the exposure of multiple dimensions of parallelism from within the runtime system, so as to best exploit a parallel architecture with heterogeneous computational capabilities between its cores and execution units. We investigate user-level schedulers that dynamically “rightsize ” the dimensions and degrees of parallelism on the Cell Broadband Engine. The schedulers address the problem of mapping applicationspecific concurrency to an architecture with multiple hardware layers of parallelism, without requiring programmer intervention or sophisticated compiler support. We evaluate recently introduced schedulers for event-driven execution and utilizationdriven dynamic multi-grain parallelization on Cell. We also present a new scheduling scheme for dynamic multi-grain parallelism, S-MGPS, which uses sampling of dominant execution phases to converge to the optimal scheduling algorithm. We evaluate S-MGPS on an IBM Cell BladeCenter with two realistic bioinformatics applications that infer large phylogenies. S-MGPS performs within 2%–10 % off the optimal scheduling algorithm in these applications, while exhibiting low overhead and little sensitivity to application-dependent parameters.

