## Advanced Code Generation for High Performance Fortran (0)

Venue: | In Languages, Compilation Techniques and Run Time Systems for Scalable Parallel Systems, Lecture Notes in Computer Science Series |

Citations: | 13 - 2 self |

### BibTeX

@INPROCEEDINGS{Adve_advancedcode,

author = {Vikram Adve and John Mellor-crummey},

title = {Advanced Code Generation for High Performance Fortran},

booktitle = {In Languages, Compilation Techniques and Run Time Systems for Scalable Parallel Systems, Lecture Notes in Computer Science Series},

year = {},

pages = {553--596},

publisher = {Springer-Verlag}

}

### OpenURL

### Abstract

this paper, we describe techniques developed in the Rice dHPF compiler to address key code generation challenges that arise in achieving high performance for regular applications on message-passing systems. We focus on techniques required to implement advanced optimizations and to achieve consistently high performance with existing optimizations. Many of the core communication analysis and code generation algorithms in dHPF are expressed in terms of abstract equations manipulating integer sets. This approach enables general and yet simple implementations of sophisticated optimizations, making it more practical to include a comprehensive set of optimizations in data-parallel compilers. It also enables the compiler to support much more aggressive computation partitioning algorithms than in previous compilers. We therefore believe this approach can provide higher and more consistent levels of performance than are available today. 1. Introduction

### Citations

1457 |
Theory of linear and integer programming
- Schrijver
- 1986
(Show Context)
Citation Context ...5]. In this approach, each code generation or communication optimization problem is described by a collection of linear inequalities representing integer sets or mappings. Fourier-Motzkin elimination =-=[41]-=- is used to simplify the resulting inequalities, and to compute a range of values for individual index variables that together enumerate the integer points described by these inequalities. Code genera... |

946 |
Performance FORTRAN Forum. High performance fortran language specification, version 2.0
- High
- 1997
(Show Context)
Citation Context ...-executor approach. To permit a symbolic number of processors or cyclic(k) distribution with symbolic k, we use a virtual processor (VP) model that naturally matches the semantics of templates in HPF =-=[22]-=-. The VP model uses a virtual processor array for each physical processor array, using template indices (i.e., ignoring the distribute directive) in dimensions where the block size or number of proces... |

839 | Efficiently Computing Static Single Assignment Form and version 1
- Cytron, Ferrante, et al.
- 1991
(Show Context)
Citation Context ...alue numbering. A value number in dHPF is a handle for a symbolic expression tree. Value numbers are constructed from dataflow analysis of the program based on its Static Single Assignment (SSA) form =-=[13]-=-, such that any two subexpressions that are known to have identical runtime values are assigned the same value number [21]. Their construction subsumes expression simplification, constant propagation,... |

393 |
The high performance Fortran handbook
- Koelbel, Loveman, et al.
- 1994
(Show Context)
Citation Context ...e therefore believe this approach can provide higher and more consistent levels of performance than are available today. 1. Introduction Data-parallel languages such as High-Performance Fortran (HPF) =-=[29, 31]-=- aim to make parallel scientific computing accessible to a much wider audience by providing a simple, portable, abstract programming model applicable to a wide variety of parallel computing systems. F... |

195 | F.: Scanning polyhedra with do loops
- Ancourt, Irigoin
- 1991
(Show Context)
Citation Context ...cellent performance in cases where they apply. Three groups have used a more abstract and general approach based on linear inequalities to support code generation for communication and iteration sets =-=[2, 3, 4, 5]-=-. In this approach, each code generation or communication optimization problem is described by a collection of linear inequalities representing integer sets or mappings. Fourier-Motzkin elimination [4... |

182 |
A practical algorithm for exact array dependence analysis
- PUGH
- 1992
(Show Context)
Citation Context ...niversity of Maryland for this purpose [27]. The library operations use powerful algorithms based on Fourier-Motzkin elimination for manipulating integer tuple sets represented by Presburger formulae =-=[37]-=-. In particular, the library provides two key capabilities: it supports a general class of integer set operations including set union, and it provides an algorithm to generate efficient code that enum... |

169 |
Superb: A tool for semi-automatic MIMD/SIMD parallelization
- Zima, Bast, et al.
- 1986
(Show Context)
Citation Context ...ages out of loops, reducing the number of data copies, and exploiting collective communication. Furthermore, even for these optimizations, most research and commercial data-parallel compilers to date =-=[7, 10, 15, 16, 17, 19, 24, 32, 33, 35, 42, 45, 46]-=- (including the Rice Fortran 77D compiler [24]) perform communication analysis and code generation for specific combinations of the form of references, data layouts and computation partitionings. Whil... |

156 |
Compiling Global NameSpace Parallel Loops for Distributed Execution
- Koelbel, Mehrotra
- 1991
(Show Context)
Citation Context ...ages out of loops, reducing the number of data copies, and exploiting collective communication. Furthermore, even for these optimizations, most research and commercial data-parallel compilers to date =-=[7, 10, 15, 16, 17, 19, 24, 32, 33, 35, 42, 45, 46]-=- (including the Rice Fortran 77D compiler [24]) perform communication analysis and code generation for specific combinations of the form of references, data layouts and computation partitionings. Whil... |

155 |
Communication optimization and code generation for distributed memory machines
- Amarasinghe, Lam
- 1993
(Show Context)
Citation Context ...e non-local references to the same or different variables, in order to reduce the total number of messages and to eliminate redundant communication. Previous implementations in Fortran 77D [24], SUIF =-=[2]-=-, Paradigm [5], and IBM's pHPF [11, 16] have some significant limitations. In particular, coalescing can produce fairly complex data sets from the union of data sets for individual references. The pre... |

151 |
Process Decomposition Through Locality of Reference
- Rogers, Pingali
- 1989
(Show Context)
Citation Context ...earchers, current compilers implement only a small fraction of these optimizations, generally focusing on the most fundamental ones such as static loop partitioning based on the "owner-computes&q=-=uot; rule [39]-=-, moving messages out of loops, reducing the number of data copies, and exploiting collective communication. Furthermore, even for these optimizations, most research and commercial data-parallel compi... |

131 |
An Optimizing Fortran D Compiler for MIMD Distributed-Memory Machines
- Tseng
- 1993
(Show Context)
Citation Context ...uations manipulating integer sets rather than as a collection of strategies for different cases. Optimizations we have formulated in this manner include message vectorization [14], message coalescing =-=[43]-=-, recognizing in-place communication [1], code generation for our general CP model [1], non-local index set splitting [32], control-flow simplification [34], and generalized loop-peeling for improving... |

106 |
Compiling communication-efficient programs for massively parallel machines
- Li, Chen
- 1991
(Show Context)
Citation Context ...e data sets to be communicated [16, 24, 32]. -- Exploiting collective communication is essential for achieving good speedup in important cases such as reductions, broadcasts, and array redistribution =-=[33]-=-. On certain systems, collective communication primitives may also provide significant benefits for other patterns such as shift communication. The important patterns (particularly reductions and broa... |

101 | The Paradigm Compiler for Distributed-Memory Multicomputers
- Banerjee, Chandy, et al.
(Show Context)
Citation Context ...cellent performance in cases where they apply. Three groups have used a more abstract and general approach based on linear inequalities to support code generation for communication and iteration sets =-=[2, 3, 4, 5]-=-. In this approach, each code generation or communication optimization problem is described by a collection of linear inequalities representing integer sets or mappings. Fourier-Motzkin elimination [4... |

93 |
Abstract debugging of higher-order imperative languages
- Bourdoncle
- 1993
(Show Context)
Citation Context ...riables imposed by loops, conditional branches, assertions, and integer computations. Several previous systems have supported strategies for computing and exploiting range information about variables =-=[20, 8, 9, 25, 44]-=-. Two key differences that distinguish our work are that we handle more general logical combinations of constraints on variables (not just ranges) and we use these constraints to simplify control flow... |

82 |
Compiler analysis of the value ranges for variables
- Harrison
- 1977
(Show Context)
Citation Context ...riables imposed by loops, conditional branches, assertions, and integer computations. Several previous systems have supported strategies for computing and exploiting range information about variables =-=[20, 8, 9, 25, 44]-=-. Two key differences that distinguish our work are that we handle more general logical combinations of constraints on variables (not just ranges) and we use these constraints to simplify control flow... |

76 |
The Omega Library Interface Guide
- Kelly, Maslov, et al.
- 1996
(Show Context)
Citation Context ... for High Performance Fortran 11 map or set, and projection to eliminate a variable from a map or set. We use the Omega library developed by Pugh et al. at the University of Maryland for this purpose =-=[27]-=-. The library operations use powerful algorithms based on Fourier-Motzkin elimination for manipulating integer tuple sets represented by Presburger formulae [37]. In particular, the library provides t... |

75 | Compiler support for machine-independent parallel programming in Fortran D
- Hiranandani, Kennedy, et al.
- 1991
(Show Context)
Citation Context ...d special-case expressions for the iteration sets for "interior" and "boundary" processors [24]. (In fact, both these groups have described basic compilation steps in terms of abst=-=ract set operations [23, 32]-=-; however, this was used only as a pedagogical abstraction and the corresponding compilers were implemented using case-based analysis. ) Li and Chen describe algorithms to classify communication cause... |

75 | Code generation for multiple mappings
- Kelly, Pugh, et al.
- 1995
(Show Context)
Citation Context ...erations including set union, and it provides an algorithm to generate efficient code that enumerates points in a given sequence of iteration spaces associated with a sequence of statements in a loop =-=[26]-=-. (Appendix A describes this code generation capability. ) These capabilities are an invaluable asset for implementing set-based versions of the core HPF compiler optimizations as well as enabling a v... |

74 | Gated SSA-Based Demand-Driven Symbolic Analysis for Parallelizing Compilers
- Tu, Padua
- 1995
(Show Context)
Citation Context ...riables imposed by loops, conditional branches, assertions, and integer computations. Several previous systems have supported strategies for computing and exploiting range information about variables =-=[20, 8, 9, 25, 44]-=-. Two key differences that distinguish our work are that we handle more general logical combinations of constraints on variables (not just ranges) and we use these constraints to simplify control flow... |

73 | ªA linear Algebra Framework for Static HPF Code Distribution,º Technical Report A-278-CRI, CRI-Ecole des Mines
- Ancourt, Coelho, et al.
- 1995
(Show Context)
Citation Context ...cellent performance in cases where they apply. Three groups have used a more abstract and general approach based on linear inequalities to support code generation for communication and iteration sets =-=[2, 3, 4, 5]-=-. In this approach, each code generation or communication optimization problem is described by a collection of linear inequalities representing integer sets or mappings. Fourier-Motzkin elimination [4... |

70 | On compiling array expressions for efficient execution on distributed-memory machines - Gupta, Kaushik, et al. - 1993 |

68 |
Updating distributed variables in local computations
- Gerndt
- 1990
(Show Context)
Citation Context ...d in terms of abstract equations manipulating integer sets rather than as a collection of strategies for different cases. Optimizations we have formulated in this manner include message vectorization =-=[14]-=-, message coalescing [43], recognizing in-place communication [1], code generation for our general CP model [1], non-local index set splitting [32], control-flow simplification [34], and generalized l... |

66 | Generating communication for array statements: Design, implementation, and evaluation
- Stichnoth, O'Hallaron, et al.
- 1994
(Show Context)
Citation Context ...ages out of loops, reducing the number of data copies, and exploiting collective communication. Furthermore, even for these optimizations, most research and commercial data-parallel compilers to date =-=[7, 10, 15, 16, 17, 19, 24, 32, 33, 35, 42, 45, 46]-=- (including the Rice Fortran 77D compiler [24]) perform communication analysis and code generation for specific combinations of the form of references, data layouts and computation partitionings. Whil... |

56 | Symbolic range propagation
- Blume, Eigenmann
- 1995
(Show Context)
Citation Context |

56 | Preliminary experiences with the Fortran D compiler
- Hiranandani, Kennedy, et al.
- 1993
(Show Context)
Citation Context ...ted symbolically at compile time. Even within this class of applications, stateof -the-art commercial and research compilers do not consistently achieve performance competitive with hand-written code =-=[16, 24]-=-. Although many important optimizations for such systems have been proposed by previous researchers, current compilers implement only a small fraction of these optimizations, generally focusing on the... |

50 | Global Communication Analysis and Optimization
- Chakrabarti, Gupta, et al.
- 1996
(Show Context)
Citation Context ... or different variables, in order to reduce the total number of messages and to eliminate redundant communication. Previous implementations in Fortran 77D [24], SUIF [2], Paradigm [5], and IBM's pHPF =-=[11, 16]-=- have some significant limitations. In particular, coalescing can produce fairly complex data sets from the union of data sets for individual references. The previous implementations are limited to ca... |

49 | Optimal Evaluation of Array Expressions on Massively Parallel Machines
- Chatterjee, Gilbert, et al.
- 1995
(Show Context)
Citation Context ... set operations on integer sets, as shown in [1]. There is also a large body of work on techniques to enumerate communication sets and iteration sets in the presence of cyclic(k) distributions (e.g., =-=[12, 17, 28, 35, 45]-=-). Compared to more general approaches based on integer sets or linear inequalities, these techniques likely provide more efficient support for cyclic(k) distributions, particularly when k ? 1, but wo... |

48 | Interprocedural Symbolic Analysis
- Havlak
- 1994
(Show Context)
Citation Context ...flow analysis of the program based on its Static Single Assignment (SSA) form [13], such that any two subexpressions that are known to have identical runtime values are assigned the same value number =-=[21]-=-. Their construction subsumes expression simplification, constant propagation, auxiliary induction variable recognition, and computing range information for expressions of loop index variables. A valu... |

45 |
A2 22pn upper bound on the complexity of presburger arithmetic
- OPPEN
- 1978
(Show Context)
Citation Context ...ng such a general representation is the compile-time cost of the algorithms used in Omega. In particular, simplification of formulae in Presburger arithmetic can be extremely costly in the worst-case =-=[36]-=-. Pugh has shown, however, that when the underlying algorithms in Omega (for Fourier-Motzkin elimination) are applied to dependence analysis, the execution time is quite small even for complex constra... |

39 | A Linear-Time Algorithm for Computing the Memory Access Sequence in Data Parallel Programs
- Kennedy, Nedeljkovi'c, et al.
- 1995
(Show Context)
Citation Context ... set operations on integer sets, as shown in [1]. There is also a large body of work on techniques to enumerate communication sets and iteration sets in the presence of cyclic(k) distributions (e.g., =-=[12, 17, 28, 35, 45]-=-). Compared to more general approaches based on integer sets or linear inequalities, these techniques likely provide more efficient support for cyclic(k) distributions, particularly when k ? 1, but wo... |

36 | An HPF compiler for the IBM SP2
- Gupta, Midkiff, et al.
- 1995
(Show Context)
Citation Context ...ted symbolically at compile time. Even within this class of applications, stateof -the-art commercial and research compilers do not consistently achieve performance competitive with hand-written code =-=[16, 24]-=-. Although many important optimizations for such systems have been proposed by previous researchers, current compilers implement only a small fraction of these optimizations, generally focusing on the... |

35 |
Vienna Fortran 90
- Benkner, Chapman, et al.
- 1992
(Show Context)
Citation Context |

32 | A methodology for high-level synthesis of communication on multicomputers
- Gupta, Banerjee
- 1992
(Show Context)
Citation Context |

29 | An Implementation Framework for HPF Distributed Arrays on Message-Passing Parallel Computer Systems
- Reeuwijk, Denissen, et al.
- 1996
(Show Context)
Citation Context |

17 | Local iteration set computation for block-cyclic distributions - Midkiff - 1995 |

15 | Communication-Minimal Partitioning of Parallel Loops and Data Arrays for Cache-Coherent Distributed-Memory Multiprocessors - Barua, Kranz, et al. - 1996 |

12 |
Compiling High Performance Fortran
- Bozkus, Meadows, et al.
- 1995
(Show Context)
Citation Context |

12 |
Compiler Support for Machine-Independent Parallelization of Irregular Problems
- Hanxleden
- 1994
(Show Context)
Citation Context ...) is that static code generation techniques will not be useful for a code with irregular or complex partitionings. Such cases require runtime strategies such as the inspector-executor approach (e.g., =-=[32, 18, 40]-=-). However, regular and irregular partitionings may coexist in the same program, and perhaps even in a single loop nest. This raises the need for a flexible code generation framework that allows each ... |

11 | Resource-Based Communication Placement Analysis
- Kennedy, Sethi
- 1996
(Show Context)
Citation Context ...the placement of communication so as to determine how far each message can be vectorized out of enclosing loops, and to optionally move communication calls early or late to hide communication latency =-=[30]-=-. The third step uses the algorithms of Li and Chen [33] to to determine if specialized collective communication primitives such as a broadcast could be exploited. (Reductions are recognized using sep... |

10 | Compiling High Performance Fortran for Distributed-Memory Systems
- Bircsak, Bolduc, et al.
- 1995
(Show Context)
Citation Context |

4 |
Data flow analysis for ’intractable’ imbedded system software
- Johnson
- 1986
(Show Context)
Citation Context |

4 | Simplifying control flow in compiler-generated parallel code
- Mellor-Crummey, Adve
- 1997
(Show Context)
Citation Context ...sage vectorization [14], message coalescing [43], recognizing in-place communication [1], code generation for our general CP model [1], non-local index set splitting [32], control-flow simplification =-=[34]-=-, and generalized loop-peeling for improving parallelism. By formulating these algorithms in terms of operations on integer sets, we are able to abstract away the details of the CPs, references, and d... |

4 |
Integer lattice based methods for local address generation for blockcyclic distributions
- Ramanujam
- 1998
(Show Context)
Citation Context ...de run-time resolution [39], the inspectorexecutor approach, and run-time techniques for handling cyclic(k) partitionings. The latter two approaches are described in other chapters within this volume =-=[40, 38]-=-. It is relatively straightforward to partition a simple loop nest that contains a single statement or a sequence of statements with the same CP. The loop bounds can be reduced so that each processor ... |

3 | HPF analysis and code generation using integer sets
- Adve, Mellor-Crummey, et al.
- 1997
(Show Context)
Citation Context ...than as a collection of strategies for different cases. Optimizations we have formulated in this manner include message vectorization [14], message coalescing [43], recognizing in-place communication =-=[1]-=-, code generation for our general CP model [1], non-local index set splitting [32], control-flow simplification [34], and generalized loop-peeling for improving parallelism. By formulating these algor... |

1 |
The High Performance Fortran 2.0 Language, chapter 1
- Kennedy, Koelbel
(Show Context)
Citation Context ...e therefore believe this approach can provide higher and more consistent levels of performance than are available today. 1. Introduction Data-parallel languages such as High-Performance Fortran (HPF) =-=[29, 31]-=- aim to make parallel scientific computing accessible to a much wider audience by providing a simple, portable, abstract programming model applicable to a wide variety of parallel computing systems. F... |

1 |
Runtime Support for Irregular Problems, chapter 17
- Saltz
- 1997
(Show Context)
Citation Context ...) is that static code generation techniques will not be useful for a code with irregular or complex partitionings. Such cases require runtime strategies such as the inspector-executor approach (e.g., =-=[32, 18, 40]-=-). However, regular and irregular partitionings may coexist in the same program, and perhaps even in a single loop nest. This raises the need for a flexible code generation framework that allows each ... |