DMCA
Optimising Purely Functional GPU Programs
Citations: | 12 - 3 self |
Citations
4675 | A computational approach to edge detection
- Canny
- 1986
(Show Context)
Citation Context ...pa version running on the host CPUs is described in [20]; it suffers from a lack of memory bandwidth compared with the GPU version. 5.7 Canny edge detection Edge detection applies the Canny algorithm =-=[7]-=- to square images of various sizes. The algorithm consists of seven phases, the first six are naturally data parallel and performed on the GPU. The last phase uses a recursive algorithm to “connect” p... |
568 | Stable Fluids
- STAM
- 1999
(Show Context)
Citation Context ...aster because it includes a custom software thread block scheduler to manage the unbalanced workload inherent to this benchmark. 5.6 Fluid Flow Fluid flow implements Jos Stam’s stable fluid algorithm =-=[32]-=-, a fast approximate algorithm intended for animation and games, rather than accurate engineering simulation. The core of the algorithm is a finite time step simulation on a grid, implemented as a mat... |
329 | Functional programming with bananas, lenses, envelopes and barbed wire
- Meijer, Fokkinga, et al.
- 1991
(Show Context)
Citation Context ...erforms map/map fusion but cannot fuse maps into reduction combinators. Sato and Iwasaki [30] describe a C++ library for GPGPU programming that includes a fusion mechanism based on list homomorphisms =-=[25]-=-. The fusion transformation itself is implemented as a source to source translation. SkeTo [24] is a C++ library that provides parallel skeletons for CPUs. SkeTo’s use of C++ templates provides a fusi... |
229 | A short cut to deforestation.
- Gill, Launchbury, et al.
- 1993
(Show Context)
Citation Context ...fusion eliminates the intermediate values and additional GPU kernels that would otherwise be needed when successive bulk operators are applied to an array. Existing methods such as foldr/build fusion =-=[15]-=- and stream fusion [12] are not applicable to our setting as they produce tail-recursive loops, rather than the GPU kernels we need for Accelerate. The NDP2GPU system of [4] does produce fused GPU ker... |
173 | Scan primitives for gpu computing
- Sengupta, Harris, et al.
- 2007
(Show Context)
Citation Context ...h other to implement them efficiently. For example, a parallel fold (with an associative operator) can be implemented efficiently as a tree reduction, but a parallel scan requires two separate phases =-=[9, 31]-=-. Unfortunately, this sort of information is obfuscated by most fusion techniques. To support the different properties of producers and consumers, our fusion transform is split into two distinct phase... |
153 | Optimization of sparse matrix-vector multiplication on emerging multicore platforms.
- Williams, Oliker, et al.
- 2009
(Show Context)
Citation Context ...e 3 compares Accelerate to the CUSP library [3], which is a special purpose library for sparse matrix operations. For test data we use a 14 matrix corpus derived from a variety of application domains =-=[33]-=-. Compared to our previous work [8] the fusion transformation converts the program to a single segmented reduction. The corresponding reduction in memory bandwidth puts Accelerate on par with CUSP for... |
151 | NESL: A nested data-parallel language
- Blelloch
- 1521
(Show Context)
Citation Context ...stream fusion [12], running on the host CPU. The Repa version runs in parallel on all eight cores of the host CPU, using the fusion method reported in [17]. The NDP2GPU version [4] compiles NESL code =-=[6]-=- to CUDA. The performance of this version suffers because the NDP2GPU compiler uses the legacy NESL compiler for the frontend, which introduces redundant administrative operations that are not strictl... |
142 | Implementing Sparse Matrix-vector Multiplication on Throughput-oriented Processors
- Bell, Garland
- 2009
(Show Context)
Citation Context ...ch problem [21]. 5.9 Sparse-matrix vector multiplication (SMVM) SMVM multiplies a sparse matrix in compressed row format (CSR) [9] with a dense vector. Table 3 compares Accelerate to the CUSP library =-=[3]-=-, which is a special purpose library for sparse matrix operations. For test data we use a 14 matrix corpus derived from a variety of application domains [33]. Compared to our previous work [8] the fus... |
131 | Prefix sums and their applications.
- Blelloch
- 1990
(Show Context)
Citation Context ...er region which requires boundary checks, from the main internal region which does not [19], but we leave this to future work. 5.8 Radix sort Radix sort implements the algorithm described in Blelloch =-=[5]-=- to sort an array of signed 32-bit integers. We compare our implementation against a Nikola [22] version. 2 The Accelerate version is faster then Nikola because Nikola is limited to single kernel prog... |
130 | Garland -Designing Efficient Sorting Algorithms for Manycore GPUs,
- Satish, Harris, et al.
- 2009
(Show Context)
Citation Context ...rate version is faster then Nikola because Nikola is limited to single kernel programs and must transfer intermediate results back to the host. Hand written CUDA implementations such as in the Thrust =-=[29]-=- library make use of on-chip shared memory and are approximately 10× faster. As mentioned earlier, automatically making use of GPU shared memory remains an open research problem [21]. 5.9 Sparse-matri... |
68 | Stream fusion: from lists to streams to nothing at all
- Coutts, Leshchinskiy, et al.
- 2007
(Show Context)
Citation Context ...tributed, and parallel languages Keywords Arrays; Data parallelism; Embedded language; Dynamic compilation; GPGPU; Haskell; Sharing recovery; Array fusion 1. Introduction Recent work on stream fusion =-=[12]-=-, the vector package [23], and the parallel array library Repa [17, 19, 20] has demonstrated that (1) the performance of purely functional array code in Haskell can be competitive with that of imperat... |
57 | Secrets of the Glasgow Haskell Compiler inliner.
- Jones, Marlow
- 2002
(Show Context)
Citation Context ...handle this situation by only inlining the definitions of letbound variables that have a single use site, or by relying on some heuristic about the size of the resulting code to decide what to inline =-=[26]-=-. However, in typical Accelerate programs, each array is used at least twice: once to access the shape information and once to access the array data; so, we must handle at least this case separately. ... |
49 |
Accelerating Haskell array codes with multicore GPUs
- Chakravarty, Keller, et al.
- 2011
(Show Context)
Citation Context ...iate data structures must be shuffled back and forth across the CPU-GPU bus. We recently presented Accelerate, an EDSL and skeleton-based code generator targeting the CUDA GPU development environment =-=[8]-=-. In the present paper, we present novel methods for optimising the code using sharing recovery and array fusion. Sharing recovery for embedded languages recovers the sharing of let-bound expressions ... |
43 | Scan primitives for vector computers
- Chatterjee, Blelloch, et al.
- 1990
(Show Context)
Citation Context ...h other to implement them efficiently. For example, a parallel fold (with an associative operator) can be implemented efficiently as a tree reduction, but a parallel scan requires two separate phases =-=[9, 31]-=-. Unfortunately, this sort of information is obfuscated by most fusion techniques. To support the different properties of producers and consumers, our fusion transform is split into two distinct phase... |
40 |
Nikola: embedding compiled GPU functions in Haskell. In Haskell.
- Mainland, Morrisett
- 2010
(Show Context)
Citation Context ...such as GPUs (Graphical Processing Units) has been less successful. Vertigo [13] was an early Haskell EDSL producing DirectX 9 shader code, though no runtime performance figures were reported. Nikola =-=[22]-=- produces code competitive with CUDA, but without supporting generative functions like replicate where the result size is not statically fixed. Obsidian [10] is additionally restricted to only process... |
34 | Stretching the storage manager: Weak pointers and stable names in Haskell. In IFL.
- Jones, Marlow, et al.
- 2000
(Show Context)
Citation Context ... existing techniques are adequate for a type-preserving embedded language compiler targeting massively parallel SIMD hardware, such as GPUs. 3. Sharing recovery Gill [14] proposed to use stable names =-=[27]-=- to recover the sharing of source terms in a deeply embedded language. The stable namesblackscholes :: Vector (Float, Float, Float) -> Acc (Vector (Float, Float)) blackscholes = map callput . use whe... |
32 | Programming graphics processors functionally. In:
- Elliott
- 2004
(Show Context)
Citation Context ...control-parallel multicore CPUs. So far, the use of purely functional languages for programming data parallel SIMD hardware such as GPUs (Graphical Processing Units) has been less successful. Vertigo =-=[13]-=- was an early Haskell EDSL producing DirectX 9 shader code, though no runtime performance figures were reported. Nikola [22] produces code competitive with CUDA, but without supporting generative func... |
30 |
Type-safe observable sharing in Haskell. In:
- Gill
- 2009
(Show Context)
Citation Context ...pressions that would otherwise be lost due to the embedding. Without sharing recovery, the value of a let-bound expression is recomputed for every use of the bound variable. In contrast to prior work =-=[14]-=- that decomposes expression trees into graphs and fails to be type preserving, our novel algorithm preserves both the tree structure and typing of a deeply embedded language. This enables our runtime ... |
24 |
Expressive Array Constructs in an Embedded GPU Kernel Programming Language. In
- Claessen, Sheeran, et al.
- 2012
(Show Context)
Citation Context ...by its shape and a function mapping indices to their corresponding values. We previously used it successfully to optimise purely functional array programs in Repa [17], but it was also used by others =-=[11]-=-. However, there are at least two reasons why it is not always beneficial to represent all array terms uniformly as functions. One is sharing: we must be able to represent some terms as manifest array... |
22 | Optimizing data structures in high-level programs.
- Rompf, Sujeeth, et al.
- 2013
(Show Context)
Citation Context ... discussed in detail. Barracuda steps around the sharing problem by requiring let-bindings to be written using the AST node constructor, rather than using Haskell’s native let-expressions. Delite/LMS =-=[28]-=- is a parallelisation framework for DSLs in Scala that uses library-based multi-pass staging to specify complex optimisations in a modular manner. Delite supports loop fusion for DSLs targeting GPUs u... |
19 | Unembedding domain-specific languages.
- Atkey, Lindley, et al.
- 2009
(Show Context)
Citation Context ...sed on higher-order abstract syntax (HOAS) to a type-safe internal representation based on de Bruijn indices. Although developed independently, this conversion is like the unembedding of Atkey et al. =-=[1]-=-. Unembedding and sharing recovery necessarily need to go hand in hand for the following reasons. Sharing recovery must be performed on the source representation; otherwise, sharing will already have ... |
15 | On the distributed implementation of aggregate data structures by program transformation
- Keller, Chakravarty
- 1999
(Show Context)
Citation Context ...d fusion [15] is not applicable, because it produces sequential tail-recursive loops rather than massively parallel GPU kernels. Similarly, the split/join approach used in Data Parallel Haskell (DPH) =-=[16]-=- is not helpful, although fused operations are split into sequential and parallel subcomputations, as they assume an explicit parallel scheduler, which in DPH is written directly in Haskell. Accelerat... |
14 |
A generic abstract syntax model for embedded languages.
- Axelsson
- 2012
(Show Context)
Citation Context ... type safety, we do not want any further operations that are not type preserving when adding sharing recovery. Hence, we cannot use Gill’s algorithm. The variant of Gill’s algorithm used in Syntactic =-=[2]-=- does not apply either: it (1) also generates a graph and (2) discards static type information in the process. In contrast, our novel algorithm performs simultaneous sharing recovery and conversion fr... |
13 | Nested Data-Parallelism on the GPU.
- Bergstrom, Reppy
- 2012
(Show Context)
Citation Context ...uch as foldr/build fusion [15] and stream fusion [12] are not applicable to our setting as they produce tail-recursive loops, rather than the GPU kernels we need for Accelerate. The NDP2GPU system of =-=[4]-=- does produce fused GPU kernels, but is limited to simple map/map fusion. We present a fusion method partly inspired by Repa’s delayed arrays [17] that fuses more general producers and consumers, whil... |
12 | Efficient parallel stencil convolution in Haskell
- Lippmeier, Keller
- 2011
(Show Context)
Citation Context ... Embedded language; Dynamic compilation; GPGPU; Haskell; Sharing recovery; Array fusion 1. Introduction Recent work on stream fusion [12], the vector package [23], and the parallel array library Repa =-=[17, 19, 20]-=- has demonstrated that (1) the performance of purely functional array code in Haskell can be competitive with that of imperative programs and that (2) purely functional array code lends itself to an e... |
9 | Exploiting vector instructions with generalized stream fusion.
- Mainland, Leshchinskiy, et al.
- 2013
(Show Context)
Citation Context ...nguages Keywords Arrays; Data parallelism; Embedded language; Dynamic compilation; GPGPU; Haskell; Sharing recovery; Array fusion 1. Introduction Recent work on stream fusion [12], the vector package =-=[23]-=-, and the parallel array library Repa [17, 19, 20] has demonstrated that (1) the performance of purely functional array code in Haskell can be competitive with that of imperative programs and that (2)... |
9 |
A Skeletal Parallel Framework with Fusion Optimizer for GPGPU Programming
- Sato, Iwasaki
- 2009
(Show Context)
Citation Context ...NESL code down to CUDA. As the source language is not embedded there is no need for sharing recovery. NDP2GPU performs map/map fusion but cannot fuse maps into reduction combinators. Sato and Iwasaki =-=[30]-=- describe a C++ library for GPGPU programming that includes a fusion mechanism based on list homomorphisms [25]. The fusion transformation itself is implemented as a source to source translation. SkeT... |
8 | Guiding parallel array fusion with indexed types
- Lippmeier, Chakravarty, et al.
- 2012
(Show Context)
Citation Context ... Embedded language; Dynamic compilation; GPGPU; Haskell; Sharing recovery; Array fusion 1. Introduction Recent work on stream fusion [12], the vector package [23], and the parallel array library Repa =-=[17, 19, 20]-=- has demonstrated that (1) the performance of purely functional array code in Haskell can be competitive with that of imperative programs and that (2) purely functional array code lends itself to an e... |
7 | Obsidian: GPU Programming in Haskell
- CLAESSEN, SHEERAN, et al.
(Show Context)
Citation Context ... performance figures were reported. Nikola [22] produces code competitive with CUDA, but without supporting generative functions like replicate where the result size is not statically fixed. Obsidian =-=[10]-=- is additionally restricted to only processing arrays of a fixed, implementation dependent size. Additionally, both Nikola and Obsidian can only generate single GPU kernels at a time, so that in Permi... |
6 |
Simple optimizations for an applicative array language for graphics processors
- Larsen
- 2011
(Show Context)
Citation Context ...ush array represents a general consumer. Push arrays allow the intermediate program to be written in continuation passing style (CPS), and helps to compile (and fuse) append-like operations. Baracuda =-=[18]-=- is another Haskell EDSL that produces CUDA GPU kernels, though is intended to be used offline, with the kernels being called directly from C++. The paper [18] mentions a fusion system that appears to... |
6 |
An integer programming framework for optimizing shared memory use on gpus
- Ma, Agrawal
(Show Context)
Citation Context ...ip shared memory to reduce the memory bandwidth requirements of the program. The shared memory is essentially a software managed cache, and making automatic use of it remains an open research problem =-=[21]-=-. 5.5 Mandelbrot fractal The Mandelbrot set is generated by sampling values c in the complex plane, and determining whether under iteration of the complex quadratic polynomial zn+1 = z 2 n + c that |z... |
5 |
K.: Implementing Fusion-Equipped Parallel Skeletons by Expression Templates
- Matsuzaki, Emoto
- 2010
(Show Context)
Citation Context ...escribe a C++ library for GPGPU programming that includes a fusion mechanism based on list homomorphisms [25]. The fusion transformation itself is implemented as a source to source translation. SkeTo =-=[24]-=- is a C++ library that provides parallel skeletons for CPUs. SkeTo’s use of C++ templates provides a fusion system similar to delayed arrays, which could be equivalently implemented using CUDA templat... |
2 |
beats C using generalized stream fusion
- Haskell
- 2012
(Show Context)
Citation Context ...ormance Keywords Arrays, Data parallelism, Embedded language, Dynamic compilation, GPGPU, Haskell, Sharing recovery, Array fusion 1. Introduction Recent work on stream fusion [12], the vector package =-=[23]-=-, and the parallel array library Repa [17, 19, 20] has demonstrated that (1) the performance of purely functional array code in Haskell can be competitive with that of imperative programs and that (2)... |