Results 1 - 10
of
10
Sketching Stencils
"... Performance of stencil computations can be significantly improved through smart implementations that improve memory locality, computation reuse, or parallelize the computation. Unfortunately, efficient implementations are hard to obtain because they often involve non-traditional transformations, whi ..."
Abstract
-
Cited by 13 (2 self)
- Add to MetaCart
Performance of stencil computations can be significantly improved through smart implementations that improve memory locality, computation reuse, or parallelize the computation. Unfortunately, efficient implementations are hard to obtain because they often involve non-traditional transformations, which means that they cannot be produced by optimizing the reference stencil with a compiler. In fact, many stencils are produced by code generators that were tediously handcrafted. In this paper, we show how stencil implementations can be produced with sketching. Sketching is a software synthesis approach where the programmer develops a partial implementation— a sketch—and a separate specification of the desired functionality given by a reference (unoptimized) stencil. The synthesizer then completes the sketch to behave like the specification, filling in code fragments that are difficult to develop manually. Existing sketching systems work only for small finite programs, i.e., programs that can be represented as small Boolean circuits. In this paper, we develop a sketching synthesizer that works for stencil computations, a large class of programs that, unlike circuits, have unbounded inputs and outputs, as well as an unbounded number of computations. The key contribution is a reduction algorithm that turns a stencil into a circuit, allowing us to synthesize stencils using an existing sketching synthesizer.
Time Skewing: A Value-Based Approach to Optimizing for Memory Locality
- In http://www.haverford.edu/cmsc/davew/cache-opt/cache-opt.html
, 1999
"... As the gap between processor and main memory speed continues to grow, higher cache hit rates are required for efficient processor use. Recent work on compile-time transformations to improve locality in scientific progams has focused on loop fusion, tiling, and distribution ..."
Abstract
-
Cited by 12 (2 self)
- Add to MetaCart
As the gap between processor and main memory speed continues to grow, higher cache hit rates are required for efficient processor use. Recent work on compile-time transformations to improve locality in scientific progams has focused on loop fusion, tiling, and distribution
14.9 TFLOPS Three-dimensional Fluid Simulation for Fusion Science with HPF on the Earth Simulator
- In Proc. SC2002, CD-ROM
, 2002
"... We succeeded in getting 14.9 TFLOPS performance when running a plasma simulation code IMPACT-3D parallelized with High Performance Fortran on 512 nodes of the Earth Simulator. The theoretical peak performance of the 512 nodes is 32 TFLOPS, which means 45% of the peak performance was obtained wit ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
We succeeded in getting 14.9 TFLOPS performance when running a plasma simulation code IMPACT-3D parallelized with High Performance Fortran on 512 nodes of the Earth Simulator. The theoretical peak performance of the 512 nodes is 32 TFLOPS, which means 45% of the peak performance was obtained with HPF. IMPACT-3D is an implosion analysis code using TVD scheme, which performs three-dimensional compressible and inviscid Eulerian fluid computation with the explicit 5-point stencil scheme for spatial differentiation and the fractional time step for time integration. The mesh size is 2048x2048x4096, and the third dimension was distributed for the parallelization. The HPF system used in the evaluation is HPF/ES, developed for the Earth Simulator by enhancing NEC HPF/SX V2 mainly in communication scalability. Shift communications were manually tuned to get best performance by using HPF/JA extensions, which was designed to give the users more control over sophisticated parallelization and communication optimizations.
Loop Fusion in High Performance Fortran
- IN PROCEEDINGS OF THE 1998 ACM INTERNATIONAL CONFERENCE ON SUPERCOMPUTING
, 1998
"... In this paper we investigate a unique problem associated with fusing loops within a High Performance Fortran (HPF) program. In particular, we discuss the issue of performing loop fusion in an HPF compiler when compiling Fortran90 array assignment statements for execution on a distributedmemory machi ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
In this paper we investigate a unique problem associated with fusing loops within a High Performance Fortran (HPF) program. In particular, we discuss the issue of performing loop fusion in an HPF compiler when compiling Fortran90 array assignment statements for execution on a distributedmemory machine. During compilation of an HPF program, Fortran90 array assignment statements must be scalarized into loop nests. We show how a certain class of these loop nests, when fused, can cause problems for the compiler's distributed-memory code generator. We then present an algorithm which not only prevents the fusion of these loops, but also increases the amount of useful fusion that can be performed.
Towards optimal multi-level tiling for stencil computations
- 21st IEEE International Parallel and Distributed Processing Symposium (IPDPS
, 2007
"... Stencil computations form the performance-critical core of many applications. Tiling and parallelization are two important optimizations to speed up stencil computations. Many tiling and parallelization strategies are applicable to a given stencil computation. The best strategy depends not only on t ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
Stencil computations form the performance-critical core of many applications. Tiling and parallelization are two important optimizations to speed up stencil computations. Many tiling and parallelization strategies are applicable to a given stencil computation. The best strategy depends not only on the combination of the two techniques, but also on many parameters: tile and loop sizes in each dimension; computation-communication balance of the code; processor architecture; message startup costs; etc. The best choices can only be determined through design-space exploration, which is extremely tedious and error prone to do via exhaustive experimentation. We characterize the space of multi-level tilings and parallelizations for 2D/3D Gauss-Siedel stencil computation. A systematic exploration of a part of this space enabled us to derive a design which is up to a factor of two faster than the standard implementation. 1.
Eliminating Redundancies in Sum-of-Product Array Computations
, 2001
"... Array programming languages such as Fortran 90, High Performance Fortran and ZPL are well-suited to scientic computing because they free the scientist from the responsibility of managing burdensome low-level details that complicate programming in languages like C and Fortran 77. However, these burde ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
Array programming languages such as Fortran 90, High Performance Fortran and ZPL are well-suited to scientic computing because they free the scientist from the responsibility of managing burdensome low-level details that complicate programming in languages like C and Fortran 77. However, these burdensome details are critical to performance, thus necessitating aggressive compilation techniques for their optimization. In this paper, we present a new compiler optimization called Array Subexpression Elimination (ASE) that lets a programmer take advantage of the expressibility aorded by array languages and achieve enviable portability and performance. We design a set of micro-benchmarks that model an important class of computations known as stencils and we report on our implementation of this optimization in the context of this micro-benchmark suite. Our results include a 125% improvement on one of these benchmarks and a 50% average speedup across the suite. Also we show a speedup of 32% improvement on the ZPL port of the NAS MG Parallel Benchmark and a 29% speedup over the handoptimized Fortran version. Further, the compilation time is only negligibly aected.
A General Algorithm for Time Skewing
- Dept. of Computer Science, Rutgers University
, 2001
"... Microprocessor speed has been growing exponentially faster than memory speed in the recent past. In this paper, we consider the long term implications of a continuation of this trend. We define a program property known as scalable locality, which measures our ability to apply ever faster uniprocesso ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Microprocessor speed has been growing exponentially faster than memory speed in the recent past. In this paper, we consider the long term implications of a continuation of this trend. We define a program property known as scalable locality, which measures our ability to apply ever faster uniprocessors to increasingly large problems (just as scalable parallelism measures our ability to apply more numerous processors to larger problems). We then show that exploiting scalable locality in scientific programs often requires advanced compile-time reordering of loop iterations. We provide an algorithm that derives an execution order, storage mapping, and cache requirement for any desired degree of locality, for certain programs that can be made to exhibit scalable locality. Our approach is unusual in that it derives the program transformation and cache requirement from the dataow of the calculation (a fundamental characteristic of the algorithm), instead of searching a space of possible transformations of the execution order and array layout used by the programmer (which are artifacts of the expression of the algorithm). We include empirical results showing the effectiveness of our transformation on two small kernels and a non-trivial benchmark program. Our transformation can produce speedups for data sets residing in L2 cache, main memory, or virtual memory. This report unifies the major results of DCS-TR-379 and DCS-TR-378, presents them in terms of a general algorithm rather than a collection of heuristics, and gives a clearer presentation of the empirical data.
On Eliminating Redundant Computation from High-Level Array Statements
, 2000
"... High-level array programming languages are well-suited for scientific computing. Such languages free the programmer from the responsibility of managing burdensome low-level details that complicate programming in languages like C and Fortran. But these details do not vanish. We argue that the compile ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
High-level array programming languages are well-suited for scientific computing. Such languages free the programmer from the responsibility of managing burdensome low-level details that complicate programming in languages like C and Fortran. But these details do not vanish. We argue that the compiler should relieve the programmer of this burden. In this paper, we present a compiler optimization called partial array redundancy elimination that removes redundant computation from certain high-level array statements resulting in cleaner, more portable and more concise code without any performance loss. This optimization is critical to certain codes and achieves a speedup of 36% on the NAS MG parallel benchmark. 1
Generation and Optimisation of Code using Coxeter Lattice Paths
, 2007
"... Supercomputing applications usually involve the repeated parallel application of discretized differential operators. Difficulties arise with higher-order discretizations of operators on parallel computers because their communications can overlap processors in complex ways. Their correct and efficien ..."
Abstract
- Add to MetaCart
Supercomputing applications usually involve the repeated parallel application of discretized differential operators. Difficulties arise with higher-order discretizations of operators on parallel computers because their communications can overlap processors in complex ways. Their correct and efficient implementation requires careful choreography of computation and communication, taking into account the symmetries of the problem and of the computer’s communication network. This paper shows how these symmetries can be used to automate the construction of the code for optimized operator computation. This is done with considerable generality by making the symmetries both of the problem and the computer explicit using the language of finitely presented reflection (Coxeter) groups, and using coset enumeration to generate and optimize the required code.

