Results 1  10
of
391
SPIRAL: Code Generation for DSP Transforms
 PROCEEDINGS OF THE IEEE SPECIAL ISSUE ON PROGRAM GENERATION, OPTIMIZATION, AND ADAPTATION
"... Fast changing, increasingly complex, and diverse computing platforms pose central problems in scientific computing: How to achieve, with reasonable effort, portable optimal performance? We present SPIRAL that considers this problem for the performancecritical domain of linear digital signal proces ..."
Abstract

Cited by 212 (39 self)
 Add to MetaCart
Fast changing, increasingly complex, and diverse computing platforms pose central problems in scientific computing: How to achieve, with reasonable effort, portable optimal performance? We present SPIRAL that considers this problem for the performancecritical domain of linear digital signal processing (DSP) transforms. For a specified transform, SPIRAL automatically generates high performance code that is tuned to the given platform. SPIRAL formulates the tuning as an optimization problem, and exploits the domainspecific mathematical structure of transform algorithms to implement a feedbackdriven optimizer. Similar to a human expert, for a specified transform, SPIRAL “intelligently ” generates and explores algorithmic and implementation choices to find the best match to the computer’s microarchitecture. The “intelligence” is provided by search and learning techniques that exploit the structure of the algorithm and implementation space to guide the exploration and optimization. SPIRAL generates high performance code for a broad set of DSP transforms including the discrete Fourier transform, other trigonometric transforms, filter transforms, and discrete wavelet transforms. Experimental results show that the code generated by SPIRAL competes with, and sometimes outperforms, the best available human tuned transform library code.
Brook for GPUs: Stream Computing on Graphics Hardware
 ACM TRANSACTIONS ON GRAPHICS
, 2004
"... In this paper, we present Brook for GPUs, a system for generalpurpose computation on programmable graphics hardware. Brook extends C to include simple dataparallel constructs, enabling the use of the GPU as a streaming coprocessor. We present a compiler and runtime system that abstracts and virtua ..."
Abstract

Cited by 204 (9 self)
 Add to MetaCart
(Show Context)
In this paper, we present Brook for GPUs, a system for generalpurpose computation on programmable graphics hardware. Brook extends C to include simple dataparallel constructs, enabling the use of the GPU as a streaming coprocessor. We present a compiler and runtime system that abstracts and virtualizes many aspects of graphics hardware. In addition, we present an analysis of the effectiveness of the GPU as a compute engine compared to the CPU, to determine when the GPU can outperform the CPU for a particular algorithm. We evaluate our system with five applications, the SAXPY and SGEMV BLAS operators, image segmentation, FFT, and ray tracing. For these applications, we demonstrate that our Brook implementations perform comparably to handwritten GPU code and up to seven times faster than their CPU counterparts.
Sequoia: Programming the Memory Hierarchy
, 2006
"... We present Sequoia, a programming language designed to facilitate the development of memory hierarchy aware parallel programs that remain portable across modern machines featuring different memory hierarchy configurations. Sequoia abstractly exposes hierarchical memory in the programming model and p ..."
Abstract

Cited by 151 (9 self)
 Add to MetaCart
We present Sequoia, a programming language designed to facilitate the development of memory hierarchy aware parallel programs that remain portable across modern machines featuring different memory hierarchy configurations. Sequoia abstractly exposes hierarchical memory in the programming model and provides language mechanisms to describe communication vertically through the machine and to localize computation to particular memory locations within it. We have implemented a complete programming system, including a compiler and runtime systems for Cell processorbased blade systems and distributed memory clusters, and demonstrate efficient performance running Sequoia programs on both of these platforms.
OSKI: A library of automatically tuned sparse matrix kernels
 Institute of Physics Publishing
, 2005
"... kernels ..."
(Show Context)
Understanding the Efficiency of GPU Algorithms for MatrixMatrix Multiplication
, 2004
"... Utilizing graphics hardware for general purpose numerical computations has become a topic of considerable interest. The implementation of streaming algorithms, typified by highly parallel computations with little reuse of input data, has been widely explored on GPUs. We relax the streaming model&a ..."
Abstract

Cited by 95 (1 self)
 Add to MetaCart
Utilizing graphics hardware for general purpose numerical computations has become a topic of considerable interest. The implementation of streaming algorithms, typified by highly parallel computations with little reuse of input data, has been widely explored on GPUs. We relax the streaming model's constraint on input reuse and perform an indepth analysis of dense matrixmatrix multiplication, which reuses each element of input matrices O(n) times. Its regular data access pattern and highly parallel computational requirements suggest matrixmatrix multiplication as an obvious candidate for efficient evaluation on GPUs but, surprisingly we find even nearoptimal GPU implementations are pronouncedly less efficient than current cacheaware CPU approaches. We find the key cause of this inefficiency is that the GPU can fetch less data and yet execute more arithmetic operations per clock than the CPU when both are operating out of their closest caches. The lack of high bandwidth access to cached data will impair the performance of GPU implementations of any computation featuring significant input reuse.
Parallel tiled QR factorization for multicore architectures
, 2007
"... As multicore systems continue to gain ground in the High Performance Computing world, linear algebra algorithms have to be reformulated or new algorithms have to be developed in order to take advantage of the architectural features on these new processors. Fine grain parallelism becomes a major requ ..."
Abstract

Cited by 84 (43 self)
 Add to MetaCart
(Show Context)
As multicore systems continue to gain ground in the High Performance Computing world, linear algebra algorithms have to be reformulated or new algorithms have to be developed in order to take advantage of the architectural features on these new processors. Fine grain parallelism becomes a major requirement and introduces the necessity of loose synchronization in the parallel execution of an operation. This paper presents an algorithm for the QR factorization where the operations can be represented as a sequence of small tasks that operate on square blocks of data. These tasks can be dynamically scheduled for execution based on the dependencies among them and on the availability of computational resources. This may result in an out of order execution of the tasks which will completely hide the presence of intrinsically sequential tasks in the factorization. Performance comparisons are presented with the LAPACK algorithm for QR factorization where parallelism can only be exploited at the level of the BLAS operations.
Is Search Really Necessary to Generate HighPerformance BLAS?
, 2005
"... Abstract — A key step in program optimization is the estimation of optimal values for parameters such as tile sizes and loop unrolling factors. Traditional compilers use simple analytical models to compute these values. In contrast, library generators like ATLAS use global search over the space of p ..."
Abstract

Cited by 67 (12 self)
 Add to MetaCart
Abstract — A key step in program optimization is the estimation of optimal values for parameters such as tile sizes and loop unrolling factors. Traditional compilers use simple analytical models to compute these values. In contrast, library generators like ATLAS use global search over the space of parameter values by generating programs with many different combinations of parameter values, and running them on the actual hardware to determine which values give the best performance. It is widely believed that traditional modeldriven optimization cannot compete with searchbased empirical optimization because tractable analytical models cannot capture all the complexities of modern highperformance architectures, but few quantitative comparisons have been done to date. To make such a comparison, we replaced the global search engine in ATLAS with a modeldriven optimization engine, and measured the relative performance of the code produced by the two systems on a variety of architectures. Since both systems use the same code generator, any differences in the performance of the code produced by the two systems can come only from differences in optimization parameter values. Our experiments show that modeldriven optimization can be surprisingly effective, and can generate code with performance comparable to that of code generated by ATLAS using global search. Index Terms — program optimization, empirical optimization, modeldriven optimization, compilers, library generators, BLAS, highperformance computing
Combining models and guided empirical search to optimize for multiple levels of the memory hierarchy
 In International Symposium on Code Generation and Optimization
, 2005
"... This paper describes an algorithm for simultaneously optimizing across multiple levels of the memory hierarchy for densematrix computations. Our approach combines compiler models and heuristics with guided empirical search to take advantage of their complementary strengths. The models and heurist ..."
Abstract

Cited by 57 (10 self)
 Add to MetaCart
(Show Context)
This paper describes an algorithm for simultaneously optimizing across multiple levels of the memory hierarchy for densematrix computations. Our approach combines compiler models and heuristics with guided empirical search to take advantage of their complementary strengths. The models and heuristics limit the search to a small number of candidate implementations, and the empirical results provide the most accurate information to the compiler to select among candidates and tune optimization parameter values. We have developed an initial implementation and applied this approach to two case studies, Matrix Multiply and Jacobi Relaxation. For Matrix Multiply, our results on two architectures, SGI R10000 and Sun UltraSparc IIe, outperform the native compiler, and either outperform or achieve comparable performance as the ATLAS selftuning library and the handtuned vendor BLAS library. Jacobi results also substantially outperform the native compilers. 1
Leastsquares meshes
 In Shape Modeling International (SMI
, 2004
"... Figure 1: LSmesh: a mesh constructed from a given connectivity graph and a sparse set of control points with geometry. In this example the connectivity is taken from the camel mesh. In (a) the LSmesh is constructed with 100 control points and in (c) with 2000 control points. The connectivity graph ..."
Abstract

Cited by 48 (5 self)
 Add to MetaCart
Figure 1: LSmesh: a mesh constructed from a given connectivity graph and a sparse set of control points with geometry. In this example the connectivity is taken from the camel mesh. In (a) the LSmesh is constructed with 100 control points and in (c) with 2000 control points. The connectivity graph contains 39074 vertices (without any geometric information). (b) and (d) show closeups on the head; the control points are marked by red balls. In this paper we introduce Leastsquares Meshes: meshes with a prescribed connectivity that approximate a set of control points in a leastsquares sense. The given mesh consists of a planar graph with arbitrary connectivity and a sparse set of control points with geometry. The geometry of the mesh is reconstructed by solving a sparse linear system. The linear system not only defines a surface that approximates the given control points, but it also distributes the vertices over the surface in a fair way. That is, each vertex lies as close as possible to the center of gravity of its immediate neighbors. The Leastsquares Meshes (LSmeshes) are a visually smooth and fair approximation of the given control points. We show that the connectivity of the mesh contains geometric information that affects the shape of the reconstructed surface. Finally, we discuss the applicability of LSmeshes to approximation of given surfaces, smooth completion, mesh editing and progressive transmission.
Closing the gap: CPU and FPGA Trends in sustainable floatingpoint BLAS performance
"... Field programmable gate arrays (FPGAs) have long been an attractive alternative to microprocessors for computing tasks  as long as floatingpoint arithmetic is not required. Fueled by the advance of Moore's Law, FPGAs are rapidly reaching sufficient densities to enhance peak floatingpoint p ..."
Abstract

Cited by 46 (4 self)
 Add to MetaCart
Field programmable gate arrays (FPGAs) have long been an attractive alternative to microprocessors for computing tasks  as long as floatingpoint arithmetic is not required. Fueled by the advance of Moore's Law, FPGAs are rapidly reaching sufficient densities to enhance peak floatingpoint performance as well. The question, however, is how much of this peak performance can be sustained. This paper examines three of the basic linear algebra subroutine (BLAS) functions: vector dot product, matrixvector multiply, and matrix multiply. A comparison of microprocessors, FPGAs, and Reconfigurable Computing platforms is performed for each operation. The analysis highlights the amount of memory bandwidth and internal storage needed to sustain peak performance with FPGAs. This analysis considers the historical context of the last six years and is extrapolated for the next six years.