Results 1  10
of
101
The design and implementation of FFTW3
 PROCEEDINGS OF THE IEEE
, 2005
"... FFTW is an implementation of the discrete Fourier transform (DFT) that adapts to the hardware in order to maximize performance. This paper shows that such an approach can yield an implementation that is competitive with handoptimized libraries, and describes the software structure that makes our cu ..."
Abstract

Cited by 717 (3 self)
 Add to MetaCart
FFTW is an implementation of the discrete Fourier transform (DFT) that adapts to the hardware in order to maximize performance. This paper shows that such an approach can yield an implementation that is competitive with handoptimized libraries, and describes the software structure that makes our current FFTW3 version flexible and adaptive. We further discuss a new algorithm for realdata DFTs of prime size, a new way of implementing DFTs by means of machinespecific singleinstruction, multipledata (SIMD) instructions, and how a specialpurpose compiler can derive optimized implementations of the discrete cosine and sine transforms automatically from a DFT algorithm.
SPIRAL: Code Generation for DSP Transforms
 PROCEEDINGS OF THE IEEE SPECIAL ISSUE ON PROGRAM GENERATION, OPTIMIZATION, AND ADAPTATION
"... Fast changing, increasingly complex, and diverse computing platforms pose central problems in scientific computing: How to achieve, with reasonable effort, portable optimal performance? We present SPIRAL that considers this problem for the performancecritical domain of linear digital signal proces ..."
Abstract

Cited by 212 (39 self)
 Add to MetaCart
Fast changing, increasingly complex, and diverse computing platforms pose central problems in scientific computing: How to achieve, with reasonable effort, portable optimal performance? We present SPIRAL that considers this problem for the performancecritical domain of linear digital signal processing (DSP) transforms. For a specified transform, SPIRAL automatically generates high performance code that is tuned to the given platform. SPIRAL formulates the tuning as an optimization problem, and exploits the domainspecific mathematical structure of transform algorithms to implement a feedbackdriven optimizer. Similar to a human expert, for a specified transform, SPIRAL “intelligently ” generates and explores algorithmic and implementation choices to find the best match to the computer’s microarchitecture. The “intelligence” is provided by search and learning techniques that exploit the structure of the algorithm and implementation space to guide the exploration and optimization. SPIRAL generates high performance code for a broad set of DSP transforms including the discrete Fourier transform, other trigonometric transforms, filter transforms, and discrete wavelet transforms. Experimental results show that the code generated by SPIRAL competes with, and sometimes outperforms, the best available human tuned transform library code.
A comparison of empirical and modeldriven optimization
 In ACM Symp. on Programming Language Design and Implementation (PLDI’03
, 2003
"... Empirical program optimizers estimate the values of key optimization parameters by generating different program versions and running them on the actual hardware to determine which values give the best performance. In contrast, conventional compilers use models of programs and machines to choose thes ..."
Abstract

Cited by 104 (12 self)
 Add to MetaCart
(Show Context)
Empirical program optimizers estimate the values of key optimization parameters by generating different program versions and running them on the actual hardware to determine which values give the best performance. In contrast, conventional compilers use models of programs and machines to choose these parameters. It is widely believed that empirical optimization is more effective than modeldriven optimization, but few quantitative comparisons have been done to date. To make such a comparison, we replaced the empirical optimization engine in ATLAS (a system for generating dense numerical linear algebra libraries) with a modelbased optimization engine that used detailed models to estimate values for optimization parameters, and then measured the relative performance of the two systems on three different hardware platforms. Our experiments show that although modelbased optimization can be surprisingly effective, useful models may have to consider not only hardware parameters but also the ability of backend compilers to exploit hardware resources. 1.
SPIRAL: A Generator for PlatformAdapted Libraries of Signal Processing Algorithms
 Journal of High Performance Computing and Applications
, 2004
"... SPIRAL is a generator for libraries of fast software implementations of linear signal processing transforms. These libraries are adapted to the computing platform and can be reoptimized as the hardware is upgraded or replaced. This paper describes the main components of SPIRAL: the mathematical fra ..."
Abstract

Cited by 82 (20 self)
 Add to MetaCart
(Show Context)
SPIRAL is a generator for libraries of fast software implementations of linear signal processing transforms. These libraries are adapted to the computing platform and can be reoptimized as the hardware is upgraded or replaced. This paper describes the main components of SPIRAL: the mathematical framework that concisely describes signal transforms and their fast algorithms; the formula generator that captures at the algorithmic level the degrees of freedom in expressing a particular signal processing transform; the formula translator that encapsulates the compilation degrees of freedom when translating a specific algorithm into an actual code implementation; and, finally, an intelligent search engine that finds within the large space of alternative formulas and implementations
PetaBricks: a language and compiler for algorithmic choice, volume 44
, 2009
"... It is often impossible to obtain a onesizefitsall solution for high performance algorithms when considering different choices for data distributions, parallelism, transformations, and blocking. The best solution to these choices is often tightly coupled to different architectures, problem sizes, ..."
Abstract

Cited by 79 (10 self)
 Add to MetaCart
(Show Context)
It is often impossible to obtain a onesizefitsall solution for high performance algorithms when considering different choices for data distributions, parallelism, transformations, and blocking. The best solution to these choices is often tightly coupled to different architectures, problem sizes, data, and available system resources. In some cases, completely different algorithms may provide the best performance. Current compiler and programming language techniques are able to change some of these parameters, but today there is no simple way for the programmer to express or the compiler to choose different algorithms to handle different parts of the data. Existing solutions normally can handle only coarsegrained, library level selections or hand coded cutoffs between base cases and recursive cases. We present PetaBricks, a new implicitly parallel language
Managing performance vs. accuracy tradeoffs with loop perforation
 In SIGSOFT FSE
, 2011
"... Many modern computations (such as video and audio encoders, Monte Carlo simulations, and machine learning algorithms) are designed to trade off accuracy in return for increased performance. To date, such computations typically use adhoc, domainspecific techniques developed specifically for the com ..."
Abstract

Cited by 66 (18 self)
 Add to MetaCart
(Show Context)
Many modern computations (such as video and audio encoders, Monte Carlo simulations, and machine learning algorithms) are designed to trade off accuracy in return for increased performance. To date, such computations typically use adhoc, domainspecific techniques developed specifically for the computation at hand. Loop perforation provides a general technique to trade accuracy for performance by transforming loops to execute a subset of their iterations. A criticality testing phase filters out critical loops (whose perforation produces unacceptable behavior) to identify tunable loops (whose perforation produces more efficient and still acceptably accurate computations). A perforation space exploration algorithm perforates combinations of tunable loops to find Paretooptimal perforation policies. Our results indicate that, for a range of applications, this approach typically delivers performance increases of over a factor of two (and up to a factor of seven) while changing the result that the application produces by less than 10%.
Combining models and guided empirical search to optimize for multiple levels of the memory hierarchy
 In International Symposium on Code Generation and Optimization
, 2005
"... This paper describes an algorithm for simultaneously optimizing across multiple levels of the memory hierarchy for densematrix computations. Our approach combines compiler models and heuristics with guided empirical search to take advantage of their complementary strengths. The models and heurist ..."
Abstract

Cited by 57 (10 self)
 Add to MetaCart
(Show Context)
This paper describes an algorithm for simultaneously optimizing across multiple levels of the memory hierarchy for densematrix computations. Our approach combines compiler models and heuristics with guided empirical search to take advantage of their complementary strengths. The models and heuristics limit the search to a small number of candidate implementations, and the empirical results provide the most accurate information to the compiler to select among candidates and tune optimization parameter values. We have developed an initial implementation and applied this approach to two case studies, Matrix Multiply and Jacobi Relaxation. For Matrix Multiply, our results on two architectures, SGI R10000 and Sun UltraSparc IIe, outperform the native compiler, and either outperform or achieve comparable performance as the ATLAS selftuning library and the handtuned vendor BLAS library. Jacobi results also substantially outperform the native compilers. 1
Dynamic knobs for responsive poweraware computing
 In Proceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’11
, 2011
"... We present PowerDial, a system for dynamically adapting application behavior to execute successfully in the face of load and power fluctuations. PowerDial transforms static configuration parameters into dynamic knobs that the PowerDial control system can manipulate to dynamically trade off the accur ..."
Abstract

Cited by 51 (16 self)
 Add to MetaCart
(Show Context)
We present PowerDial, a system for dynamically adapting application behavior to execute successfully in the face of load and power fluctuations. PowerDial transforms static configuration parameters into dynamic knobs that the PowerDial control system can manipulate to dynamically trade off the accuracy of the computation in return for reductions in the computational resources that the application requires to produce its results. These reductions translate directly into performance improvements and power savings. Our experimental results show that PowerDial can enable our benchmark applications to execute responsively in the face of power caps that would otherwise significantly impair responsiveness. They also show that PowerDial can significantly reduce the number of machines required to service intermittent load spikes, enabling reductions in power and capital costs.
A Dynamically Tuned Sorting Library
, 2004
"... Empirical search is a strategy used during the installation of library generators such as ATLAS, FFTW, and SPIRAL to identify the algorithm or the version of an algorithm that delivers the best performance. In the past, empirical search has been applied almost exclusively to scientific problems. In ..."
Abstract

Cited by 51 (7 self)
 Add to MetaCart
Empirical search is a strategy used during the installation of library generators such as ATLAS, FFTW, and SPIRAL to identify the algorithm or the version of an algorithm that delivers the best performance. In the past, empirical search has been applied almost exclusively to scientific problems. In this paper, we discuss the application of empirical search to sorting, which is one of the best understood symbolic computing problems. When contrasted with the dense numerical computations of ATLAS, FFTW, and SPIRAL, sorting presents a new challenge, namely that the relative performance of the algorithms depend not only on the characteristics of the target machine and the size of the input data but also on the distribution of values in the input data set. Empirical search is applied in the study reported here as part of a sorting library generator. The resulting routines dynamically adapt to the characteristics of the input data by selecting the best sorting algorithm from a small set of alternatives. To generate the run time selection mechanism our generator makes use of machine learning to predict the best algorithm as a function of the characteristics of the input data set and the performance of the different algorithms on the target machine. This prediction is based on the data obtained through empirical search at installation time. Our results show that our approach is quite effective. When sorting data inputs of 12M keys with various standard deviations, our adaptive approach selected the best algorithm for all the input data sets and all platforms that we tried in our experiments. The wrong decision could have introduced a performance degradation of up to 133%, with an average value of 44%.
A scalable autotuning framework for compiler optimization
 In Proceedings of the 23rd IEEE International Parallel And Distributed Computing Symposium (IPDPS
"... We describe a scalable and generalpurpose framework for autotuning compilergenerated code. We combine Active Harmony’s parallel search backend with the CHiLL compiler transformation framework to generate in parallel a set of alternative implementations of computation kernels and automatically sel ..."
Abstract

Cited by 50 (10 self)
 Add to MetaCart
(Show Context)
We describe a scalable and generalpurpose framework for autotuning compilergenerated code. We combine Active Harmony’s parallel search backend with the CHiLL compiler transformation framework to generate in parallel a set of alternative implementations of computation kernels and automatically select the one with the bestperforming implementation. The resulting system achieves performance of compilergenerated code comparable to the fully automated version of the ATLAS library for the tested kernels. Performance for various kernels is 1.4 to 3.6 times faster than the native Intel compiler without search. Our search algorithm simultaneously evaluates different combinations of compiler optimizations and converges to solutions in only a few tens of searchsteps. 1