## Automated empirical optimizations of software and the ATLAS project (2001)

### Cached

### Download Links

- [www.netlib.org]
- [www.cs.utsa.edu]
- [www.netlib.org]
- [www.netlib.org]
- [icl.cs.utk.edu]
- [www.netlib.org]
- DBLP

### Other Repositories/Bibliography

Venue: | PARALLEL COMPUTING |

Citations: | 320 - 37 self |

### BibTeX

@ARTICLE{Whaley01automatedempirical,

author = {R. Clint Whaley and Antoine Petitet and Jack J. Dongarra},

title = {Automated empirical optimizations of software and the ATLAS project},

journal = {PARALLEL COMPUTING},

year = {2001}

}

### Years of Citing Articles

### OpenURL

### Abstract

This paper describes the automatically tuned linear algebra software (ATLAS) project, as well as the fundamental principles that underly it. ATLAS is an instantiation of a new paradigm in high performance library production and maintenance, which we term automated empirical optimization of software (AEOS); this style of library management has been created in order to allow software to keep pace with the incredible rate of hardware advancement inherent in Moore's Law. ATLAS is the application of this new paradigm to linear algebra software, with the present emphasis on the basic linear algebra subprograms (BLAS), a widely used, performance-critical,

### Citations

767 |
A set of Level 3 Basic Linear Algebra Subprograms: Model implementation and test programs
- Dongarra, DuCroz, et al.
- 1990
(Show Context)
Citation Context ...±matrix multi-R. Clint Whaley et al. / Parallel Computing 27 �2001) 3±35 25 Fig. 4. Performance of 500 500 DGEMM across multiple architectures. plication �GEMM) described above, the Level 3 BLAS API =-=[5]-=- speci®es routines performing triangular matrix±matrix multiply �TRMM), triangular system solve �TRSM), symmetric or Hermitian matrix±matrix multiply �SYMM, HEMM), and symmetric or Hermitian rank-k an... |

689 | High Performance Compilers for Parallel Computing - Wolfe - 1996 |

567 | Basic linear algebra subprograms for fortran usage - Lawson, Hanson, et al. - 1979 |

483 | FFTW: an adaptive software architecture for the FFT - Frigo, Johnson - 1998 |

465 | An extended set of Fortran Basic Linear Algebra Subprograms - Dongarra, DuCroz, et al. - 1988 |

449 | Optimizing Supercompilers for Supercomputers - Wolfe - 1989 |

392 | Automatically tuned linear algebra software - Whaley, Dongarra - 1998 |

234 | Optimizing matrix multiply using PHiPAC: a Portable, HighPerformance, ANSI C coding methodology
- Bilmes, Asanović, et al.
- 1997
(Show Context)
Citation Context ...g. ATLAS was not, however, the ®rst project to harness AEOS-like techniques for library production and maintenance. As far as we know, the ®rst such successful project was FFTW [9±11], and the PHiPAC =-=[3]-=- project was the ®rst to attempt to apply them to matrix multiply. Other projects with AEOS-like designs include [18± 20]. The philosophies, approach and application success of these projects vary wid... |

165 | A fast fourier transform compiler - Frigo - 1998 |

89 | GEMM-based level 3 BLAS : high-performance model implementations and performance evaluation benchmark - Kågström, Ling, et al. - 1995 |

68 | The fastest Fourier transform in the west - Frigo, Johnson - 1997 |

42 | Algorithm 656: An extended set of basic linear algebra subprograms: Model implementation and test programs - Dongarra, Croz, et al. - 1988 |

36 | Recursive blocked data formats and BLASs for dense linear algebra algorithms
- Gustavson, Henriksson, et al.
- 1998
(Show Context)
Citation Context ... past and continuing research has been done on this problem. For instance, partitioning schemes may utilize ®xed and machine-speci®c blocking as in [4,15], or more generalized recursive schemes as in =-=[12,13]-=-. ATLAS implements a relative simple recursive GEMM-based BLAS design. The row and column dimensions of the triangular, symmetric or Hermitian matrix and only the appropriate dimension of the general ... |

32 | The spectral decomposition of nonsymmetric matrices on distributed memory parallel computers - Bai, Demmel, et al. - 1997 |

20 | A Proposal for Standard Linear Algebra Subprograms - Hanson, Krogh, et al. - 1973 |

18 | P.: Superscalar GEMM-based level 3 BLAS - the on-going evolution of a portable and highperformance library
- Gustavson, Henriksson, et al.
- 1998
(Show Context)
Citation Context ... past and continuing research has been done on this problem. For instance, partitioning schemes may utilize ®xed and machine-speci®c blocking as in [4,15], or more generalized recursive schemes as in =-=[12,13]-=-. ATLAS implements a relative simple recursive GEMM-based BLAS design. The row and column dimensions of the triangular, symmetric or Hermitian matrix and only the appropriate dimension of the general ... |

14 | A Parallel Block Implementation of Level 3 BLAS for MIMD Vector Processors
- Dayde, Duff, et al.
- 1994
(Show Context)
Citation Context ...ossible partitioning algorithms, and a great deal of past and continuing research has been done on this problem. For instance, partitioning schemes may utilize ®xed and machine-speci®c blocking as in =-=[4,15]-=-, or more generalized recursive schemes as in [12,13]. ATLAS implements a relative simple recursive GEMM-based BLAS design. The row and column dimensions of the triangular, symmetric or Hermitian matr... |

9 | ªA Parallel Block Implementation of Level 3 - Dayde, Duff, et al. - 1994 |

8 |
Radicati di Brozolo, The IBM RISC System and linear algebra operations
- Dongarra, Mayes, et al.
(Show Context)
Citation Context ...d algorithm for matrix multiply, it is still possible to arrange for the operations to be performed with data for the most part in cache by dividing the matrix into blocks. For additional details see =-=[8]-=-. Using this BLAS routine, the rest of the Level 3 BLAS can be e ciently supported, so GEMM is the Level 3 BLAS computational kernel. ATLAS supports this kernel using both parameterized adaptation and... |

6 | di Brozolo. The IBM RISC System 6000 and linear algebra operations - Dongarra, Mayes, et al. - 1991 |

4 | OptimQR - A Software-package to create near-optimal solvers for sparse systems of linear equations. http://ostenfeld.dk/ jakob/OptimQR - Ostergaard |

3 | homepage for a complete list of the people involved - See |

2 | homepage for a complete list of the people involved. Signal Processing algorithms Implementation Research for Adaptable Libraries. http://www.ece.cmu.edu/ spiral - See |

1 | Optimizing Matrix Multiply usingPHiPAC:aPortable, High-Performance, ANSICCodingMethodology - Bilmes, Asanovic, et al. - 1996 |

1 | AParallelBlockImplementationofLevel3BLASfor MIMD Vector Processors - Dayde, Petitet - 1994 |