#### DMCA

## Autotuning and specialization: Speeding up matrix multiply for small matrices with compiler technology (2009)

Venue: | IN THE FOURTH INTERNATIONAL WORKSHOP ON AUTOMATIC PERFORMANCE TUNING |

Citations: | 3 - 2 self |

### Citations

466 | Automatically Tuned Linear Algebra Software
- Whaley, Dongarra
- 1998
(Show Context)
Citation Context ... to auto-tuning software that employs empirical techniques to evaluate a set of alternative mappings of computation kernels to an architecture and select the mapping that obtains the best performance =-=[4, 18, 8, 15, 16]-=-. In this paper, we consider collaborative autotuning tools, which works with application programmers or library developers to automate their performance tuning tasksand permit them to express their ... |

261 | Optimizing matrix multiply using PHiPAC: A portable high-performance ANSI C methodology
- Bilmes, Asanovic, et al.
- 1997
(Show Context)
Citation Context ... to auto-tuning software that employs empirical techniques to evaluate a set of alternative mappings of computation kernels to an architecture and select the mapping that obtains the best performance =-=[4, 18, 8, 15, 16]-=-. In this paper, we consider collaborative autotuning tools, which works with application programmers or library developers to automate their performance tuning tasksand permit them to express their ... |

209 | SPIRAL: Code generation for DSP transforms
- Puschel, Moura, et al.
- 2005
(Show Context)
Citation Context ... to auto-tuning software that employs empirical techniques to evaluate a set of alternative mappings of computation kernels to an architecture and select the mapping that obtains the best performance =-=[4, 18, 8, 15, 16]-=-. In this paper, we consider collaborative autotuning tools, which works with application programmers or library developers to automate their performance tuning tasksand permit them to express their ... |

150 | OSKI: A library of automatically tuned sparse matrix kernels
- Vuduc, Demmel, et al.
- 2005
(Show Context)
Citation Context ...stem for digital signal processing transforms [15]. OSKI combines install-time evaluations with run-time models to tune sparse-matrix vector multiplication and other solvers such as triangular solver =-=[17]-=-. Compiler assisted autotuning and tools facilitating code transformations have also been extensively studied. Knijnenburg et. al. compared various search algorithms in the space of tiling two dimensi... |

108 | Combined selection of tile sizes and unroll factors using iterative compilation
- Kisuki, Knijnenburg, et al.
- 2000
(Show Context)
Citation Context ...so been extensively studied. Knijnenburg et. al. compared various search algorithms in the space of tiling two dimensions and unrolling one dimension for multiple loop orders of matrix multiplication =-=[14]-=-. Chen et. al. combine compiler models and heuristics with guided empirical evaluations to take advantage of their complementary strengths [6]. Tiwari et. al. combine Active Harmony and CHiLL to gener... |

73 | Fast Fourier Transform of the west
- Frigo, Johnson
(Show Context)
Citation Context |

64 | Is search really necessary to generate high-performance
- Yotov, Li, et al.
- 2005
(Show Context)
Citation Context ...ency. Additional code transformations that can improve instruction-level parallelism (ILP) are performed to optimize the computation. Several examples in the literature describe this general approach =-=[4, 18, 6, 9, 20]-=-. However if we look closer at matrices of size 10 or smaller, those same BLAS libraries perform below 25% of peak performance as seen in Figure 2(b). This is because the optimization strategy for sma... |

56 | Combining models and guided empirical search to optimize for multiple levels of the memory hierarchy
- Chen, Chame, et al.
- 2005
(Show Context)
Citation Context ...ency. Additional code transformations that can improve instruction-level parallelism (ILP) are performed to optimize the computation. Several examples in the literature describe this general approach =-=[4, 18, 6, 9, 20]-=-. However if we look closer at matrices of size 10 or smaller, those same BLAS libraries perform below 25% of peak performance as seen in Figure 2(b). This is because the optimization strategy for sma... |

48 | A scalable auto-tuning framework for compiler optimization
- Tiwari, Chen, et al.
(Show Context)
Citation Context |

38 | CHiLL: A framework for composing high-level loop transformations
- Chen, Chame, et al.
- 2008
(Show Context)
Citation Context ...rror prone process. CHiLL allows the programmer to apply complex transformation strategies to a loop nest by specifying a series of composable loop transformations using a high-level script interface =-=[5, 7, 16]-=-. CHiLL is a state-of-the-art polyhedral loop transformation framework designed with the autotuning environment in mind. Its high-level script interface can be used by application programmers or compi... |

24 | Annotation-based empirical performance tuning using Orio
- Hartono, Norris, et al.
- 2009
(Show Context)
Citation Context ...luate code variants. They use a search strategy similar to the Nelder-Mead algorithm [16]. Hartono et. al. use annotations in the code to describe performance improving transformations for C programs =-=[10]-=-. POET is a scripting language for parameterizing complex code transformations [19], which can be used in an autotuning process as well. For tuning matrix multiply for small matrices, the work most cl... |

21 | Poet: Parameterized optimizations for empirical tuning
- Yi, Seymour, et al.
- 2007
(Show Context)
Citation Context ...m [16]. Hartono et. al. use annotations in the code to describe performance improving transformations for C programs [10]. POET is a scripting language for parameterizing complex code transformations =-=[19]-=-, which can be used in an autotuning process as well. For tuning matrix multiply for small matrices, the work most closely related to ours is Herrero and Navarro’s, which focus on specializing matrix ... |

19 | FLAME: Formal Linear Algebra Methods Environment
- GUNNELS, GUSTAVSON, et al.
- 2001
(Show Context)
Citation Context ...ency. Additional code transformations that can improve instruction-level parallelism (ILP) are performed to optimize the computation. Several examples in the literature describe this general approach =-=[4, 18, 6, 9, 20]-=-. However if we look closer at matrices of size 10 or smaller, those same BLAS libraries perform below 25% of peak performance as seen in Figure 2(b). This is because the optimization strategy for sma... |

18 |
Parameterized optimizations for empirical tuning
- YI, VUDUC, et al.
(Show Context)
Citation Context ...m [16]. Hartono et. al. use annotations in the code to describe performance improving transformations for C programs [10]. POET is a scripting language for parameterizing complex code transformations =-=[19]-=-, which can be used in an autotuning process as well. For tuning matrix multiply for small matrices, the work most closely related to ours is Herrero and Navarro’s, which focus on specializing matrix ... |

9 |
Model-guided empirical optimization for memory hierarchy
- Chen
- 2007
(Show Context)
Citation Context ...rror prone process. CHiLL allows the programmer to apply complex transformation strategies to a loop nest by specifying a series of composable loop transformations using a high-level script interface =-=[5, 7, 16]-=-. CHiLL is a state-of-the-art polyhedral loop transformation framework designed with the autotuning environment in mind. Its high-level script interface can be used by application programmers or compi... |

7 | Improving Performance of Hypermatrix Cholesky Factorization
- Herrero, Navarro
- 2003
(Show Context)
Citation Context ...ng process as well. For tuning matrix multiply for small matrices, the work most closely related to ours is Herrero and Navarro’s, which focus on specializing matrix multiplication for small matrices =-=[11]-=-. However, their code variants were generated manually and it’s not clear how many code variants in the parameter space were evaluated. In contrast, our heuristic-based parameter space pruning is auto... |

6 | Loop optimization using hierarchical compilation and kernel decomposition
- Barthou, Donadio, et al.
- 2007
(Show Context)
Citation Context ...nd their parameter space for larger specialized matrix sizes.do 10, i=1,M do 20, j=1,N s0: c(i,j) = 0.0d0 do 30, k=1,K s1: c(i,j) = c(i,j) + a(i,k)*b(k,j) 30 continue 20 continue 10 continue permute(=-=[1,2,3]-=-) known(M=N=K=10) unroll(1,1,u1) unroll(1,2,u2) unroll(1,3,u3) (a)original.f permute([2,3,1]) known(M=N=K=10) unroll(1,1,u1) unroll(1,2,u2) unroll(1,3,u3) unroll(0,3,u3) permute([3,1,2]) known(M=N=K=1... |

3 | Improving the Performance of Tensor Matrix Vector Multiplication in Cumulative Reaction Probability Based Quantum Chemistry Codes
- Kaushik, Gropp, et al.
- 2008
(Show Context)
Citation Context ...optimized tensor matrix vector multiplication routine with mxm in nek5000 and the dgemm of Intel’s MKL, and showed the hand-optimized routine performed the best for small, highly rectangular matrices =-=[13]-=-. Barthou et. al. reduce the search space by separating optimizations for in-cache computation kernels from those for memory hierarchy [3]. 5 Conclusion This paper describes an autotuning approach for... |

2 |
Intel Fortran Compiler User and Reference Guides
- Intel
- 2008
(Show Context)
Citation Context ...KB L2 cache, 2 MB L3 cache and 4 GB of memory. Since it runs 64-bit Linux (Ubuntu 8.04-x86 64), all 16 XMM registers are available for use. CHiLL version 0.1.5 [7] and the Intel compiler version 10.1 =-=[12]-=- are used to transform and compile the code variants. All performance measurements are on single core unless mentioned otherwise. 3.2 Generating the Library Figure 5 shows the block diagram of the exp... |