Results 1  10
of
16
SPIRAL: A Generator for PlatformAdapted Libraries of Signal Processing Algorithms
 Journal of High Performance Computing and Applications
, 2004
"... SPIRAL is a generator for libraries of fast software implementations of linear signal processing transforms. These libraries are adapted to the computing platform and can be reoptimized as the hardware is upgraded or replaced. This paper describes the main components of SPIRAL: the mathematical fra ..."
Abstract

Cited by 71 (20 self)
 Add to MetaCart
SPIRAL is a generator for libraries of fast software implementations of linear signal processing transforms. These libraries are adapted to the computing platform and can be reoptimized as the hardware is upgraded or replaced. This paper describes the main components of SPIRAL: the mathematical framework that concisely describes signal transforms and their fast algorithms; the formula generator that captures at the algorithmic level the degrees of freedom in expressing a particular signal processing transform; the formula translator that encapsulates the compilation degrees of freedom when translating a specific algorithm into an actual code implementation; and, finally, an intelligent search engine that finds within the large space of alternative formulas and implementations
Tiling Optimizations for 3D Scientific Computations
, 2000
"... Compiler transformations can significantly improve data locality for many scientific programs. In this paper, we show iterative solvers for partial differential equations (PDEs) in three dimensions require new compiler optimizations not needed for 2D codes, since reuse along the third dimension cann ..."
Abstract

Cited by 54 (4 self)
 Add to MetaCart
Compiler transformations can significantly improve data locality for many scientific programs. In this paper, we show iterative solvers for partial differential equations (PDEs) in three dimensions require new compiler optimizations not needed for 2D codes, since reuse along the third dimension cannot fit in cachefor larger problem sizes. Tiling is a program transformation compilers can apply to capture this reuse, but successful application of tiling requires selection of nonconflicting tiles and/or padding array dimensions to eliminate conflicts. We present new algorithms and cost models for selecting tiling shapes and array pads. We explain why tiling is rarely needed for 2D PDE solvers, but can be helpful for 3D stencil codes. Experimental results show tiling 3D codes can reduce miss rates and achieve performance improvements of 17121% for key scientific kernels, including a 27% average improvement for the key computational loop nest in the SPEC/NAS benchmark MGRID.
Transforming Loops to Recursion for MultiLevel Memory Hierarchies
 In Proceedings of the SIGPLAN ’00 Conference on Programming Language Design and Implementation
, 2000
"... Recently, there have been several experimental and theoretical results showing significant performance benefits of recursive algorithms on both multilevel memory hierarchies and on sharedmemory systems. In particular, such algorithms have the data reuse characteristics of a blocked algorithm that ..."
Abstract

Cited by 32 (4 self)
 Add to MetaCart
Recently, there have been several experimental and theoretical results showing significant performance benefits of recursive algorithms on both multilevel memory hierarchies and on sharedmemory systems. In particular, such algorithms have the data reuse characteristics of a blocked algorithm that is simultaneously blocked at many different levels. Most existing applications, however, are written using ordinary loops. We present a new compiler transformation that can be used to convert loop nests into recursive form automatically. We show that the algorithm is fast and effective, handling loop nests with arbitrary nesting and control flow. The transformation achieves substantial performance improvements for several linear algebra codes even on a current system with a two level cache hierarchy. As a sideeffect of this work, we also develop an improved algorithm for transitive dependence analysis (a powerful technique used in the recursion transformation and other loop transformations) that ...
FFT program generation for shared memory: SMP and multicore
 In Proc. Supercomputing
, 2006
"... The chip maker’s response to the approaching end of CPU frequency scaling are multicore systems, which offer the same programming paradigm as traditional shared memory platforms but different performance characteristics. This situation considerably increases the burden on library developers and stre ..."
Abstract

Cited by 15 (10 self)
 Add to MetaCart
The chip maker’s response to the approaching end of CPU frequency scaling are multicore systems, which offer the same programming paradigm as traditional shared memory platforms but different performance characteristics. This situation considerably increases the burden on library developers and strengthens the case for automatic performance tuning frameworks such as Spiral, a program generator and optimizer for linear transforms such as the discrete Fourier transform (DFT). We present a shared memory extension of Spiral. The extension within Spiral consists of a rewriting system that manipulates the structure of transform algorithms to achieve load balancing and avoids false sharing, and of a backend to generate multithreaded code. Application to the DFT produces a novel class of algorithms suitable for multicore systems as validated by experimental results: we demonstrate parallelization speedup for sizes that fit into L1 cache and compare favorably to other DFT libraries across all small and midsize DFTs and considered platforms. 1
Sparse Tiling for Stationary Iterative Methods
 INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS
, 2004
"... In modern computers, a program’s data locality can affect performance significantly. This paper details full sparse tiling, a runtime reordering transformation that improves the data locality for stationary iterative methods such as Gauss–Seidel operating on sparse matrices. In scientific applicati ..."
Abstract

Cited by 14 (5 self)
 Add to MetaCart
In modern computers, a program’s data locality can affect performance significantly. This paper details full sparse tiling, a runtime reordering transformation that improves the data locality for stationary iterative methods such as Gauss–Seidel operating on sparse matrices. In scientific applications such as finite element analysis, these iterative methods dominate the execution time. Full sparse tiling chooses a permutation of the rows and columns of the sparse matrix, and then an order of execution that achieves better data locality. We prove that full sparsetiled Gauss–Seidel generates a solution that is bitwise identical to traditional Gauss–Seidel on the permuted matrix. We also present measurements of the performance improvements and the overheads of full sparse tiling and of cache blocking for irregular grids, a related technique developed by Douglas et al.
Satin: Efficient Parallel DivideandConquer
 in Java, in: EuroPAR 2000, no. 1900 in Lecture Notes in Computer Science
, 2000
"... Satin is a system for running divide and conquer programs on distributed memory systems (and ultimately on widearea metacomputing systems). Satin extends Java with three simple Cilklike primitives for divide and conquer programming. The Satin compiler and runtime system cooperate to implement th ..."
Abstract

Cited by 13 (3 self)
 Add to MetaCart
Satin is a system for running divide and conquer programs on distributed memory systems (and ultimately on widearea metacomputing systems). Satin extends Java with three simple Cilklike primitives for divide and conquer programming. The Satin compiler and runtime system cooperate to implement these primitives eciently on a distributed system, using work stealing to distribute the jobs. Satin optimizes the overhead of local jobs using ondemand serialization, which avoids copying and serialization of parameters for jobs that are not stolen. This optimization is implemented using explicit invocation records. We have implemented Satin by extending the Manta compiler. We discuss the performance of ten applications on a Myrinetbased cluster.
A Modal Model of Memory
 In V.N.Alexandrov, J.J. Dongarra, Computer Science
, 2001
"... . We consider the problem of automatically guiding program transformations for locality, despite incomplete information due to complicated program structures, changing target architectures, and lack of knowledge of the properties of the input data. Our system, the modal model of memory, uses limited ..."
Abstract

Cited by 9 (1 self)
 Add to MetaCart
. We consider the problem of automatically guiding program transformations for locality, despite incomplete information due to complicated program structures, changing target architectures, and lack of knowledge of the properties of the input data. Our system, the modal model of memory, uses limited static analysis and bounded runtime experimentation to produce performance formulas that can be used to make runtime locality transformation decisions. Static analysis is performed once per program to determine its memory reference properties, using modes, a small set of parameterized, kernel reference patterns. Once per architectural system, our system automatically performs a set of experiments to determine a family of kernel performance formulas. The system can use these kernel formulas to synthesize a performance formula for any program's mode tree. Finally, with program transformations represented as mappings between mode trees, the generated performance formulas can be used to guide transformation decisions. 1
FFT Program Generation for Shared Memory
 SMP and Multicore”, SC2006
, 2006
"... The chip maker’s response to the approaching end of CPU frequency scaling are multicore systems, which offer the same programming paradigm as traditional shared memory platforms but have different performance characteristics. This situation considerably increases the burden on library developers and ..."
Abstract

Cited by 6 (5 self)
 Add to MetaCart
The chip maker’s response to the approaching end of CPU frequency scaling are multicore systems, which offer the same programming paradigm as traditional shared memory platforms but have different performance characteristics. This situation considerably increases the burden on library developers and strengthens the case for automatic performance tuning frameworks like Spiral, a program generator and optimizer for linear transforms such as the discrete Fourier transform (DFT). We present a shared memory extension of Spiral. The extension within Spiral consists of a rewriting system that manipulates the structure of transform algorithms to achieve load balancing and avoids false sharing, and of a backend to generate multithreaded code. Application to the DFT produces a novel class of algorithms suitable for multicore systems as validated by experimental results: we demonstrate a parallelization speedup already for sizes that fit into L1 cache and compare favorably to other DFT libraries across all small and midsize DFTs and considered platforms. CR Categories: F.2.1 [Analysis of algorithms and problem complexity]: Numerical algorithms and problems—Fast
Portable High Performance Programming via ArchitectureCognizant DivideandConquer Algorithms
, 2000
"... ...................................................... xiii 1 Introduction .................................................. 1 1. DivideandConquer and the Memory Hierarchy . . . . . . . . . . . 2 2. Overview of ArchitectureCognizant Divideand Conquer . . . . . . 4 3. Overview of Napoleon . . . ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
...................................................... xiii 1 Introduction .................................................. 1 1. DivideandConquer and the Memory Hierarchy . . . . . . . . . . . 2 2. Overview of ArchitectureCognizant Divideand Conquer . . . . . . 4 3. Overview of Napoleon . . . . . . . . . . . . . . . . . . . . . . . . . 5 4. What You Can Expect . . . . . . . . . . . . . . . . . . . . . . . . . 6 5. Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1. DivideandConquer Algorithms for Performance Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2. The Importance of ArchitectureCognizance . . . . . . . . . 7 3. Complexity of Determining VariantPolicy . . . . . . . . . . 7 4. A Framework and System for DivideandConquer Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 5. The Fastest Portable FFT Algorithm . . . . . . . . . . . . . 8 6. Outline of Thesis . . . . . . . . . . . . . . . . ....
Faster FFTs via ArchitectureCognizance
 In Proceedings of PACT 2000
, 2000
"... algorithms in computational science, accounting for large amounts of computing time. One major problem with modem FFT implementations is that they poorly scale to large problem. As the problem size increases, stride and associativity effects play a larger role. The result is a severe dropoff in per ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
algorithms in computational science, accounting for large amounts of computing time. One major problem with modem FFT implementations is that they poorly scale to large problem. As the problem size increases, stride and associativity effects play a larger role. The result is a severe dropoff in performance.