Results 1  10
of
67
Automated empirical optimizations of software and the ATLAS project
 PARALLEL COMPUTING
, 2001
"... This paper describes the automatically tuned linear algebra software (ATLAS) project, as well as the fundamental principles that underly it. ATLAS is an instantiation of a new paradigm in high performance library production and maintenance, which we term automated empirical optimization of software ..."
Abstract

Cited by 301 (36 self)
 Add to MetaCart
This paper describes the automatically tuned linear algebra software (ATLAS) project, as well as the fundamental principles that underly it. ATLAS is an instantiation of a new paradigm in high performance library production and maintenance, which we term automated empirical optimization of software (AEOS); this style of library management has been created in order to allow software to keep pace with the incredible rate of hardware advancement inherent in Moore's Law. ATLAS is the application of this new paradigm to linear algebra software, with the present emphasis on the basic linear algebra subprograms (BLAS), a widely used, performancecritical,
A Fast Fourier Transform Compiler
, 1999
"... FFTW library for computing the discrete Fourier transform (DFT) has gained a wide acceptance in both academia and industry, because it provides excellent performance on a variety of machines (even competitive with or faster than equivalent libraries supplied by vendors). In FFTW, most of the perform ..."
Abstract

Cited by 158 (5 self)
 Add to MetaCart
FFTW library for computing the discrete Fourier transform (DFT) has gained a wide acceptance in both academia and industry, because it provides excellent performance on a variety of machines (even competitive with or faster than equivalent libraries supplied by vendors). In FFTW, most of the performancecritical code was generated automatically by a specialpurpose compiler, called genfft, that outputs C code. Written in Objective Caml, genfft can produce DFT programs for any input length, and it can specialize the DFT program for the common case where the input data are real instead of complex. Unexpectedly, genfft “discovered” algorithms that were previously unknown, and it was able to reduce the arithmetic complexity of some other existing algorithms. This paper describes the internals of this specialpurpose compiler in some detail, and it argues that a specialized compiler is a valuable tool.
Similarity search over time series data using wavelets
 In ICDE
, 2002
"... We consider the use of wavelet transformations as a dimensionality reduction technique to permit efficient similarity search over highdimensional timeseries data. While numerous transformations have been proposed and studied, the only wavelet that has been shown to be effective for this applicatio ..."
Abstract

Cited by 61 (0 self)
 Add to MetaCart
We consider the use of wavelet transformations as a dimensionality reduction technique to permit efficient similarity search over highdimensional timeseries data. While numerous transformations have been proposed and studied, the only wavelet that has been shown to be effective for this application is the Haar wavelet. In this work, we observe that a large class of wavelet transformations (not only orthonormal wavelets but also biorthonormal wavelets)can be used to support similarity search. This class includes the most popular and most effective wavelets being used in image compression. We present a detailed performance study of the effects of using different wavelets on the performance of similarity search for timeseries data. We include several wavelets that outperform both the Haar wavelet and the best known nonwavelet transformations for this application. To ensure our results are usable by an application engineer, we also show how to configure an indexing strategy for the best performing transformations. Finally, we identify classes of data that can be indexed efficiently using these wavelet transformations. 1.
Virtual Radios
, 1998
"... Conventional software radios take advantage of vastly improved A/D converters and DSP hardware. Our approach, which we refer to as virtual radios, also depends upon high performance A/D converters. However, rather than use DSPs, we have chosen to ride the curve of rapidly improving workstation hardw ..."
Abstract

Cited by 45 (3 self)
 Add to MetaCart
Conventional software radios take advantage of vastly improved A/D converters and DSP hardware. Our approach, which we refer to as virtual radios, also depends upon high performance A/D converters. However, rather than use DSPs, we have chosen to ride the curve of rapidly improving workstation hardware. We use wideband digitization and then perform all of the digital signal processing in user space on a general purpose workstation. This approach allows us to experiment with new approaches to signal processing that exploit the hardware and software resources of the workstation. Furthermore, it allows us to experiment with different ways of structuring systems in which the radio component of communication devices are integrated with higherlevel applications. This paper describes the design and performance of an environment we have constructed that facilitates building virtual radios and of two applications built using that environment. The environment consists of an I/O subsystem that p...
ArchitectureCognizant Divide and Conquer Algorithms
, 1999
"... Divide and conquer programs can achieve good performance on parallel computers and computers with deep memory hierarchies. We introduce architecturecognizant divide and conquer algorithms, and explore how they can achieve even better performance. An architecturecognizant algorithm has functionall ..."
Abstract

Cited by 27 (5 self)
 Add to MetaCart
Divide and conquer programs can achieve good performance on parallel computers and computers with deep memory hierarchies. We introduce architecturecognizant divide and conquer algorithms, and explore how they can achieve even better performance. An architecturecognizant algorithm has functionallyequivalent variants of the divide and/or combine functions, and a variant policy that specifies which variant to use at each level of recursion. An optimal variant policy is chosen for each target computer via experimentation. With h levels of recursion, an exhaustive search requires (v h ) experiments (where v is the number of variants). We present a method based on dynamic programming that reduces this to (h c ) (where c is typically a small constant) experiments for a class of architecturecognizant programs. We verify our technique on two kernels (matrix multiply and 2D Point Jacobi) using three architectures. Our technique improves performance by up to a factor of two, compared...
Memory Characteristics of Iterative Methods
, 1999
"... Conventional implementations of iterative numerical algorithms, especially multigrid methods, merely reach a disappointing small percentage of the theoretically available CPU performance when applied to representative large problems. One of the most important reasons for this phenomenon is that th ..."
Abstract

Cited by 22 (9 self)
 Add to MetaCart
Conventional implementations of iterative numerical algorithms, especially multigrid methods, merely reach a disappointing small percentage of the theoretically available CPU performance when applied to representative large problems. One of the most important reasons for this phenomenon is that the current DRAM technology cannot provide the data fast enough to keep the CPU busy. Although the fundamentals of cache optimizations are quite simple, current compilers cannot optimize even elementary iterative schemes. In this paper, we analyze the memory and cache behavior of iterative methods with extensive profiling and describe program transformation techniques to improve the cache performance of two and threedimensional multigrid algorithms. 1 Introduction Multigrid methods [11, 5] are among the most attractive algorithms for the solution of large sparse systems of equations that arise in the solution of elliptic partial differential equations (PDEs). However, even simple multi...
Scheduling Threads for Low Space Requirement and Good Locality
 In Proceedings of the Eleventh Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA
, 1999
"... The running time and memory requirement of a parallel program with dynamic, lightweight threads depends heavily on the underlying thread scheduler. In this paper, we present a simple, asynchronous, spaceefficient scheduling algorithm for shared memory machines that combines the low scheduling overh ..."
Abstract

Cited by 20 (1 self)
 Add to MetaCart
The running time and memory requirement of a parallel program with dynamic, lightweight threads depends heavily on the underlying thread scheduler. In this paper, we present a simple, asynchronous, spaceefficient scheduling algorithm for shared memory machines that combines the low scheduling overheads and good locality of work stealing with the low space requirements of depthfirst schedulers. For a nestedparallel program with depth D and serial space requirement S 1 , we show that the expected space requirement is S 1 +O(K \Delta p \Delta D) on p processors. Here, K is a useradjustable runtime parameter, which provides a tradeoff between running time and space requirement. Our algorithm achieves good locality and low scheduling overheads by automatically increasing the granularity of the work scheduled on each processor. We have implemented the new scheduling algorithm in the context of a native, userlevel implementation of Posix standard threads or Pthreads, and evaluated its p...
An Adaptive Software Library for Fast Fourier Transforms
 In Proceedings of the International Conference on Supercomputing
, 2000
"... In this paper we present an adaptive and portable software library for the fast Fourier transform (FFT). The library consists of a number of composable blocks of code called codelets, each computing a part of the transform. The actual FFT algorithm used by the code is determined at runtime by selec ..."
Abstract

Cited by 18 (2 self)
 Add to MetaCart
In this paper we present an adaptive and portable software library for the fast Fourier transform (FFT). The library consists of a number of composable blocks of code called codelets, each computing a part of the transform. The actual FFT algorithm used by the code is determined at runtime by selecting the fastest strategy among all possible strategies, given available codelets, for a given transform size. We also presentanefficient automatic method of generating the library modules by using a specialpurpose compiler. The code generator is written in C and it generates a library of C codelets. The code generator is shown to be flexible and extensible and the entire library can be generated in a matter of seconds. Wehaveevaluated the library for performance on the IBMSP2, SGI2000, HPExemplar and Intel Pentium systems. We use the results from these evaluations to build performance models for the FFT library on different platforms. The library is shown to be portable, adaptive and efficient. 1.
Adaptive Use of Iterative Methods in PredictorCorrector Interior Point Methods for Linear Programming
 NUMERICAL ALGORITHMS
, 1999
"... ..."
Uniform Frequency Images: Adding Geometry to Images to Produce SpaceEfficient Textures
 IEEE Visualization
, 2000
"... : We discuss the concept of uniform frequency images, which exhibit uniform local frequency properties. Such images make optimal use of space when sampled close to their Nyquist limit. A warping function may be applied to an arbitrary image to redistribute its local frequency content, reducing its ..."
Abstract

Cited by 14 (0 self)
 Add to MetaCart
: We discuss the concept of uniform frequency images, which exhibit uniform local frequency properties. Such images make optimal use of space when sampled close to their Nyquist limit. A warping function may be applied to an arbitrary image to redistribute its local frequency content, reducing its highest frequencies and increasing its lowest frequencies in order to approach this uniform frequency ideal. The warped image may then be downsampled according to its new, reduced Nyquist limit, thereby reducing its storage requirements. To reconstruct the original image, the inverse warp is applied. We present a general, topdown algorithm to automatically generate a piecewiselinear warping function with this frequency balancing property for a given input image. The image size is reduced by applying the warp and then downsampling. We store this warped, downsampled image plus a small number of polygons with texture coordinates to describe the inverse warp. The original image is later recon...