Results 1 - 10
of
17
Runtime Support for Multicore Haskell
"... Purely functional programs should run well on parallel hardware because of the absence of side effects, but it has proved hard to realise this potential in practice. Plenty of papers describe promising ideas, but vastly fewer describe real implementations with good wall-clock performance. We describ ..."
Abstract
-
Cited by 18 (5 self)
- Add to MetaCart
Purely functional programs should run well on parallel hardware because of the absence of side effects, but it has proved hard to realise this potential in practice. Plenty of papers describe promising ideas, but vastly fewer describe real implementations with good wall-clock performance. We describe just such an implementation, and quantitatively explore some of the complex design tradeoffs that make such implementations hard to build. Our measurements are necessarily detailed and specific, but they are reproducible, and we believe that they offer some general insights. 1.
Language Virtualization for Heterogeneous Parallel Computing
"... As heterogeneous parallel systems become dominant, application developers are being forced to turn to an incompatible mix of low level programming models (e.g. OpenMP, MPI, CUDA, OpenCL). However, these models do little to shield developers from the difficult problems of parallelization, data decomp ..."
Abstract
-
Cited by 12 (6 self)
- Add to MetaCart
As heterogeneous parallel systems become dominant, application developers are being forced to turn to an incompatible mix of low level programming models (e.g. OpenMP, MPI, CUDA, OpenCL). However, these models do little to shield developers from the difficult problems of parallelization, data decomposition and machine-specific details. Most programmers are having a difficult time using these programming models effectively. To provide a programming model that addresses the productivity and performance requirements for the average programmer, we explore a domainspecific approach to heterogeneous parallel programming. We propose language virtualization as a new principle that enables the construction of highly efficient parallel domain specific languages that are embedded in a common host language. We define criteria for language virtualization and present techniques to achieve them. We present two concrete case studies of domain-specific languages that are implemented using our virtualization approach.
Space Profiling for Parallel Functional Programs
"... This paper presents a semantic space profiler for parallel functional programs. Building on previous work in sequential profiling, our tools help programmers to relate runtime resource use back to program source code. Unlike many profiling tools, our profiler is based on a cost semantics. This provi ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
This paper presents a semantic space profiler for parallel functional programs. Building on previous work in sequential profiling, our tools help programmers to relate runtime resource use back to program source code. Unlike many profiling tools, our profiler is based on a cost semantics. This provides a means to reason about performance without requiring a detailed understanding of the compiler or runtime system. It also provides a specification for language implementers. This is critical in that it enables us to separate cleanly the performance of the application from that of the language implementation. Some aspects of the implementation can have significant effects on performance. Our cost semantics enables programmers to understand the impact of different scheduling policies yet abstracts away from many of the details of their implementations. We show applications where the choice of scheduling policy has asymptotic effects on space use. We explain these use patterns through a demonstration of our tools. We also validate our methodology by observing similar performance in our implementation of a parallel extension of Standard ML.
Lazy Tree Splitting
"... Nested data-parallelism (NDP) is a declarative style for programming irregular parallel applications. NDP languages provide language features favoring the NDP style, efficient compilation of NDP programs, and various common NDP operations like parallel maps, filters, and sum-like reductions. In this ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Nested data-parallelism (NDP) is a declarative style for programming irregular parallel applications. NDP languages provide language features favoring the NDP style, efficient compilation of NDP programs, and various common NDP operations like parallel maps, filters, and sum-like reductions. In this paper, we describe the implementation of NDP in Parallel ML (PML), part of the Manticore project. Managing the parallel decomposition of work is one of the main challenges of implementing NDP. If the decomposition creates too many small chunks of work, performance will be eroded by too much parallel overhead. If, on the other hand, there are too few large chunks of work, there will be too much sequential processing and processors will sit idle. Recently the technique of Lazy Binary Splitting was proposed for dynamic parallel decomposition of work on flat arrays, with promising results. We adapt Lazy Binary Splitting to parallel processing of binary trees, which we use to represent parallel arrays in PML. We call our technique Lazy Tree Splitting (LTS). One of its main advantages is its performance robustness: per-program tuning is not required to achieve good performance across varying platforms. We describe LTS-based implementations of standard NDP operations, and we present experimental data demonstrating the scalability of LTS across a range of benchmarks.
Regular, shape-polymorphic, parallel arrays in Haskell
- In Proceedings of the ACM SIGPLAN International Conference on Functional Programming, ICFP 2010
, 2010
"... We present a novel approach to regular, multi-dimensional arrays in Haskell. The main highlights of our approach are that it (1) is purely functional, (2) supports reuse through shape polymorphism, (3) avoids unnecessary intermediate structures rather than relying on subsequent loop fusion, and (4) ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
We present a novel approach to regular, multi-dimensional arrays in Haskell. The main highlights of our approach are that it (1) is purely functional, (2) supports reuse through shape polymorphism, (3) avoids unnecessary intermediate structures rather than relying on subsequent loop fusion, and (4) supports transparent parallelisation. We show how to embed two forms of shape polymorphism into Haskell’s type system using type classes and type families. In particular, we discuss the generalisation of regular array transformations to arrays of higher rank, and introduce a type-safe specification of array slices. We discuss the runtime performance of our approach for three standard array algorithms. We achieve absolute performance comparable to handwritten C code. At the same time, our implementation scales well up to 8 processor cores. Categories and Subject Descriptors D.3.3 [Programming Languages]: Language Constructs and Features—Concurrent programming structures; Polymorphism; Abstract data types
Parallel Performance Tuning for Haskell
"... Parallel Haskell programming has entered the mainstream with support now included in GHC for multiple parallel programming models, along with multicore execution support in the runtime. However, tuning programs for parallelism is still something of a black art. Without much in the way of feedback pr ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Parallel Haskell programming has entered the mainstream with support now included in GHC for multiple parallel programming models, along with multicore execution support in the runtime. However, tuning programs for parallelism is still something of a black art. Without much in the way of feedback provided by the runtime system, it is a matter of trial and error combined with experience to achieve good parallel speedups. This paper describes an early prototype of a parallel profiling system for multicore programming with GHC. The system comprises three parts: fast event tracing in the runtime, a Haskell library for reading the resulting trace files, and a number of tools built on this library for presenting the information to the programmer. We focus on one tool in particular, a graphical timeline browser called ThreadScope. The paper illustrates the use of ThreadScope through a number of case studies, and describes some useful methodologies for parallelizing Haskell programs.
Ein Ferry-basiertes Query-Backend für die Programmiersprache Links
"... This thesis describes the implementation of a FERRY-based query backend for the LINKS programming language. In LINKS, queries are seamlessly embedded into the language: Queries formulated in a subset of the language are translated into single SQL queries. LINKS uses static checks to ensure that a ty ..."
Abstract
- Add to MetaCart
This thesis describes the implementation of a FERRY-based query backend for the LINKS programming language. In LINKS, queries are seamlessly embedded into the language: Queries formulated in a subset of the language are translated into single SQL queries. LINKS uses static checks to ensure that a type-correct query expression can be translated into an equivalent SQL query and allows abstraction over parts of a query. The queryizable subset of LINKS is, however, severely limited in terms of supported functions and the data type (limited to bags of flat records) of queries. The thesis begins with a description of the query facility and criticizes the limited functionality of the queryizable LINKS subset. The FERRY framework deals with the compilation of pure, declarative languages based on list comprehensions into SQL queries. It provides features that LINKS queries are lacking: query results involving nested lists and computed by a small, statically bounded number of SQL queries, ordered lists semantics and a larger number of supported functions. The thesis first reviews the compilation technique of the FERRY framework and adapts it to the specifics of LINKS. The queryizable subset of LINKS is higher-order and allows to treat functions as first-class values. To keep this property in the new query backend, the
A Framework for Distributed Proof Search ⋆
"... Abstract. In automated theorem proving, the nondeterminism of proof rule application naturally gives rise to nested and/or branching of independent subgoals. This problem structure should be amenable to parallel exploration, but there are at present few theorem prover implementations which actually ..."
Abstract
- Add to MetaCart
Abstract. In automated theorem proving, the nondeterminism of proof rule application naturally gives rise to nested and/or branching of independent subgoals. This problem structure should be amenable to parallel exploration, but there are at present few theorem prover implementations which actually exploit this potential. We propose a new programming model to make it easier to write such parallel provers. In it, details such as communication and work distribution among processors are hidden from the programmer. Banyan, our implementation of this model, only needs problem-specific input such as how a tree of proof goals gets constructed for the proof search algorithm and what should be the relative amount of processor time spent at each subgoal. This allows the programmer to focus on high level proof search and finely tune a search strategy, for example by giving more time to branches that are more promising. The Banyan runtime system automatically distributes and dynamically redistributes proof goals among distributed processors and performs weighted round-robin scheduling among active goals. We have used Banyan to write a theorem prover for hybrid systems, DLBanyan. DLBanyan outperforms our existing sequential prover for hybrid systems and achieves a near linear speedup in the number of processors used. 1
Compress-and-Conquer for Optimal Multicore Computing
"... We propose a programming paradigm called compress-and-conquer (CC) that leads to optimal performance on multicore platforms. Given a multicore system of p cores and a problem of size n, the problem is first reduced to p smaller problems, each of which can be solved independently of the others (the c ..."
Abstract
- Add to MetaCart
We propose a programming paradigm called compress-and-conquer (CC) that leads to optimal performance on multicore platforms. Given a multicore system of p cores and a problem of size n, the problem is first reduced to p smaller problems, each of which can be solved independently of the others (the compression phase). From the solutions to the p problems, a compressed version of the same problem of size O(p) is deduced and solved (the global phase). The solution to the original problem is then derived from the solution to the compressed problem together with the solutions of the smaller problems (the expansion phase). The CC paradigm reduces the complexity of multicore programming by allowing the best-known sequential algorithm for a problem to be used in each of the three phases. In this paper we apply the CC paradigm to a range of problems including scan, nested scan, difference equations, banded linear systems, and linear tridiagonal systems. The performance of CC programs is analyzed, and their optimality and linear speedup are proven. Characteristics of the problem space subject to CC are formally examined, and we show that its computational power subsumes that of scan, nested scan, and mapReduce. The CC paradigm has been implemented in Haskell as a modular, higher-order function, whose constituent functions can be shared by seemingly unrelated problems. This function is compiled into low-level Haskell threads that run on a multicore machine, and performance benchmarks confirm the theoretical analysis. D.1.3 [Parallel Program-
GPU Kernels as Data-Parallel Array Computations
"... We present a novel high-level parallel programming model for graphics processing units (GPUs). We embed GPU kernels as data-parallel array computations in the purely functional language Haskell. GPU and CPU computations can be freely interleaved with the type system tracking the two different modes ..."
Abstract
- Add to MetaCart
We present a novel high-level parallel programming model for graphics processing units (GPUs). We embed GPU kernels as data-parallel array computations in the purely functional language Haskell. GPU and CPU computations can be freely interleaved with the type system tracking the two different modes of computation. The embedded language of array computations is sufficiently limited that our system can automatically isolate and extract these computations and compile them to efficient GPU code. In this paper, we outline our approach and present the results of a few preliminary benchmarks. 1.

