Results 1  10
of
19
Distributed Aggregation for DataParallel Computing: Interfaces and Implementations
"... Dataintensive applications are increasingly designed to execute on large computing clusters. Grouped aggregation is a core primitive of many distributed programming models, and it is often the most efficient available mechanism for computations such as matrix multiplication and graph traversal. Suc ..."
Abstract

Cited by 38 (3 self)
 Add to MetaCart
Dataintensive applications are increasingly designed to execute on large computing clusters. Grouped aggregation is a core primitive of many distributed programming models, and it is often the most efficient available mechanism for computations such as matrix multiplication and graph traversal. Such algorithms typically require nonstandard aggregations that are more sophisticated than traditional builtin database functions such as Sum and Max. As a result, the ease of programming userdefined aggregations, and the efficiency of their implementation, is of great current interest. This paper evaluates the interfaces and implementations for userdefined aggregation in several state of the art distributed computing systems: Hadoop, databases such as Oracle Parallel Server, and DryadLINQ. We show that: the degree of language integration between userdefined functions and the highlevel query language has an impact on code legibility and simplicity; the choice of programming interface has a material effect on the performance of computations; some execution plans perform better than others on average; and that in order to get good performance on a variety of workloads a system must be able to select between execution plans depending on the computation. The interface and execution plan described in the MapReduce paper, and implemented by Hadoop, are found to be among the worstperforming choices.
Complexity Results for Throughput and Latency Optimization of Replicated and Dataparallel Workflow
 ALGORITHMICA
, 2007
"... Mapping applications onto parallel platforms is a challenging problem, even for simple application patterns such as pipeline or fork graphs. Several antagonist criteria should be optimized for workflow applications, such as throughput and latency (or a combination). In this paper, we consider a si ..."
Abstract

Cited by 19 (16 self)
 Add to MetaCart
Mapping applications onto parallel platforms is a challenging problem, even for simple application patterns such as pipeline or fork graphs. Several antagonist criteria should be optimized for workflow applications, such as throughput and latency (or a combination). In this paper, we consider a simplified model with no communication cost, and we provide an exhaustive list of complexity results for different problem instances. Pipeline or fork stages can be replicated in order to increase the throughput by sending consecutive data sets onto different processors. In some cases, stages can also be dataparallelized, i.e. the computation of one single data set is shared between several processors. This leads to a decrease of the latency and an increase of the throughput. Some instances of this simple model are shown to be NPhard, thereby exposing the inherent complexity of the mapping problem. We provide polynomial algorithms for other problem instances. Altogether, we provide solid theoretical foundations for the study of monocriterion or bicriteria mapping optimization problems.
Autotuning SkePU: a multibackend skeleton programming framework for multigpu systems
 In Proceeding of the 4th International workshop on Multicore software engineering, IWMSE ’11
, 2011
"... We present SkePU, a C++ template library which provides a simple and unified interface for specifying dataparallel computations with the help of skeletons on GPUs using CUDA and OpenCL. The interface is also general enough to support other architectures, and SkePU implements both a sequential CPU a ..."
Abstract

Cited by 14 (3 self)
 Add to MetaCart
We present SkePU, a C++ template library which provides a simple and unified interface for specifying dataparallel computations with the help of skeletons on GPUs using CUDA and OpenCL. The interface is also general enough to support other architectures, and SkePU implements both a sequential CPU and a parallel OpenMP backend. It also supports multiGPU systems. Copying data between the host and the GPU device memory can be a performance bottleneck. A key technique in SkePU is the implementation of lazy memory copying in the container type used to represent skeleton operands, which allows to avoid unnecessary memory transfers. We evaluate SkePU with small benchmarks and a larger application, a RungeKutta ODE solver. The results show that a skeleton approach to GPU programming is viable, especially when the computation burden is large compared to memory I/O (the lazy memory copying can help to achieve this). It also shows that utilizing several GPUs have a potential for performance gains. We see that SkePU offers good performance with a more complex and realistic task such as ODE solving, with up to 10 times faster run times when using SkePU with a GPU backend compared to a sequential solver running on a fast CPU.
Evaluating the Performance of SkeletonBased High Level Parallel Programs
 THE INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE (ICCS 2004), PART III, LNCS
, 2004
"... We show in this paper how to evaluate the performance of skeletonbased high level parallel programs. Since many applications follow some commonly used algorithmic skeletons, we identify such skeletons and model them with process algebra in order to get relevant information about the performance ..."
Abstract

Cited by 13 (7 self)
 Add to MetaCart
We show in this paper how to evaluate the performance of skeletonbased high level parallel programs. Since many applications follow some commonly used algorithmic skeletons, we identify such skeletons and model them with process algebra in order to get relevant information about the performance of the application, and be able to take some "good" scheduling decisions. This concept is illustrated through the case study of the Pipeline skeleton, and a tool which generates automatically a set of models and solves them is presented. Some numerical results are provided, proving the efficiency of this approach.
Multicriteria scheduling of pipeline workflows
 In HeteroPar’07, the 6th International Workshop on Algorithms, Models and Tools for Parallel Computing on Heterogeneous Networks
, 2007
"... apport de recherche ISSN 02496399 ISRN INRIA/RR6232FR+ENGMulticriteria scheduling of pipeline workflows ..."
Abstract

Cited by 12 (12 self)
 Add to MetaCart
apport de recherche ISSN 02496399 ISRN INRIA/RR6232FR+ENGMulticriteria scheduling of pipeline workflows
Parallel Skyline Computation on Multicore Architectures
"... With the advent of multicore processors, it has become imperative to write parallel programs if one wishes to exploit the next generation of processors. This paper deals with skyline computation as a case study of parallelizing database operations on multicore architectures. We compare two parallel ..."
Abstract

Cited by 9 (1 self)
 Add to MetaCart
With the advent of multicore processors, it has become imperative to write parallel programs if one wishes to exploit the next generation of processors. This paper deals with skyline computation as a case study of parallelizing database operations on multicore architectures. We compare two parallel skyline algorithms: a parallel version of the branchandbound algorithm (BBS) and a new parallel algorithm based on skeletal parallel programming. Experimental results show despite its simple design, the new parallel algorithm is comparable to parallel BBS in speed. For sequential skyline computation, the new algorithm far outperforms sequential BBS when the density of skyline tuples is low.
Performance and energy optimization of concurrent pipelined applications
, 2009
"... In this paper, we study the problem of finding optimal mappings for several independent but concurrent workflow applications, in order to optimize performancerelated criteria together with energy consumption. Each application consists in a linear chain application with several stages, and processes ..."
Abstract

Cited by 9 (9 self)
 Add to MetaCart
In this paper, we study the problem of finding optimal mappings for several independent but concurrent workflow applications, in order to optimize performancerelated criteria together with energy consumption. Each application consists in a linear chain application with several stages, and processes successive data sets in pipeline mode, from the first to the last stage. We study the problem complexity on different target execution platforms, ranking from fully homogeneous platforms to fully heterogeneous ones. The goal is to select an execution speed for each processor, and then to assign stages to processors, with the aim of optimizing several concurrent optimization criteria. There is a clear tradeoff to reach, since running faster and/or more processors leads to better performance, but the energy consumption is then very high. Energy savings can be done at the price of a lower performance, by reducing processor speeds or enrolling fewer resources.. We consider two mapping strategies: in onetoone mappings, a processor is assigned a single stage, while in interval mappings, a processor may process an interval of consecutive stages of the same application. For both mapping strategies and all platform types, we establish the complexity of several
On the complexity of mapping linear chain applications onto heterogeneous platforms
 Parallel Processing Letters (PPL
, 2009
"... In this paper, we explore the problem of mapping simple application patterns onto largescale heterogeneous platforms. An important optimization criteria that should be considered in such a framework is the latency, or makespan, which measures the response time of the system in order to process one ..."
Abstract

Cited by 5 (5 self)
 Add to MetaCart
In this paper, we explore the problem of mapping simple application patterns onto largescale heterogeneous platforms. An important optimization criteria that should be considered in such a framework is the latency, or makespan, which measures the response time of the system in order to process one single data set entirely. We focus in this work on linear chain applications, which are representative of a broad class of reallife applications. For such applications, we can consider onetoone mappings, in which each stage is mapped onto a single processor. However, in order to reduce the communication cost, it seems natural to group stages into intervals. The interval mapping problem can be solved in a straightforward way if the platform has homogeneous communications: the whole chain is grouped into a single interval, which in turn is mapped onto the fastest processor. But the problem becomes harder when considering a fully heterogeneous platform. Indeed, we prove the NPcompleteness of this problem. Furthermore, we prove that neither the interval mapping problem nor the similar onetoone mapping problem can be approximated by any constant factor (unless P=NP).
Systematic Derivation of Tree Contraction Algorithms
 In Proceedings of INFOCOM '90
, 2005
"... While tree contraction algorithms play an important role in e#cient tree computation in parallel, it is di#cult to develop such algorithms due to the strict conditions imposed on contracting operators. In this paper, we propose a systematic method of deriving e#cient tree contraction algorithms f ..."
Abstract

Cited by 3 (3 self)
 Add to MetaCart
While tree contraction algorithms play an important role in e#cient tree computation in parallel, it is di#cult to develop such algorithms due to the strict conditions imposed on contracting operators. In this paper, we propose a systematic method of deriving e#cient tree contraction algorithms from recursive functions on trees in any shape. We identify a general recursive form that can be parallelized to obtain e#cient tree contraction algorithms, and present a derivation strategy for transforming general recursive functions to parallelizable form. We illustrate our approach by deriving a novel parallel algorithm for the maximum connectedset sum problem on arbitrary trees, the treeversion of the famous maximum segment sum problem.