## The Static Parallelization of Loops and Recursions (1997)

Venue: | In Proc. 11th Int. Symp. on High Performance Computing Systems (HPCS'97 |

Citations: | 3 - 2 self |

### BibTeX

@INPROCEEDINGS{Lengauer97thestatic,

author = {Christian Lengauer and Sergei Gorlatch and Christoph A. Herrmann},

title = {The Static Parallelization of Loops and Recursions},

booktitle = {In Proc. 11th Int. Symp. on High Performance Computing Systems (HPCS'97},

year = {1997}

}

### OpenURL

### Abstract

We demonstrate approaches to the static parallelization of loops and recursions on the example of the polynomial product. Phrased as a loop nest, the polynomial product can be parallelized automatically by applying a space-time mapping technique based on linear algebra and linear programming. One can choose a parallel program that is optimal with respect to some objective function like the number of execution steps, processors, channels, etc. However, at best, linear execution time complexity can be attained. Through phrasing the polynomial product as a divide-and-conquer recursion, one can obtain a parallel program with sublinear execution time. In this case, the target program is not derived by an automatic search but given as a program skeleton, which can be deduced by a sequence of equational program transformations. We discuss the use of such skeletons, compare and assess the models in which loops and divide-and-conquer recursions are parallelized and comment on the performance pr...

### Citations

2438 | The Design and Analysis of Computer Algorithms - Aho, Hopcroft, et al. - 1974 |

1309 | Monads for Functional Programming
- Wadler
- 1995
(Show Context)
Citation Context ...ing cost-optimality in this case are very restrictive. 4.2 Homomorphisms A very simple D&C skeleton is the homomorphism. It does not capture all D&C situations, and it is defined most often for lists =-=[7, 37]-=-, although it can also be defined for other data structures, e.g., trees [17] and arrays [33]. 4.2.1 From the source program to the target program Unary function h is a list homomorphism [7] iff its v... |

415 |
Algorithmic Skeletons: Structured Management of Parallel Computation
- Cole
- 1989
(Show Context)
Citation Context ...llel implementations by hand, albeit formally, with equational reasoning. However, most of the parallelization process is problem-independent. The starting point is a program schema called a skeleton =-=[9]-=-. We discuss two D&C skeletons, instantiated to the polynomial product, and their parallelizations: Subsect. 4.1. The first is a skeleton for call-balanced fixed-degree D&C, which we parallelize with ... |

113 |
Loop Transformations for Restructuring Compilers: The Foundations
- Banerjee
- 1993
(Show Context)
Citation Context ...ope model. Links to some of them are provided in the Web pages cited here. A good, unhurried introduction to loop parallelization with an emphasis on the polytope model is the book series of Banerjee =-=[4, 5, 6]-=-. 5.2 Divide-and-conquer parallelization For the parallelization of D&C, there is not yet a unified model, in which different choices of parallelization can be evaluated with a common yardstick and co... |

88 | Array Expansion
- Feautrier
- 1988
(Show Context)
Citation Context ... Another change we make is that we convert the updates of c to single-assignment form, which gives rise a doubly indexed variable �� c; there are also automatic techniques for this kind of convers=-=ion [15]. This l-=-eads to the following program, in which the elements �� c[; 0] and �� c[n; ] contain the final values of the coefficients of the product polynomial: for i := 0 to n do for j := n downto 0 do i... |

48 | Automatic Parallelization in the Polytope Model
- Feautrier
(Show Context)
Citation Context ...pe model has been extended significantly recently: 1. Dependences and space-time mappings may be piecewise affine (the number of affine pieces must be constant, i.e., independent of the problem size) =-=[16]-=-. 2. Loop nests may be imperfect (i.e., not all computations must be in the innermost loop) [16]. 3. Upper loop bounds may be arbitary and, indeed, unknown at compile time [23]. A consequence of (3) i... |

42 |
Parallel programming with list homomorphisms
- Cole
- 1995
(Show Context)
Citation Context ... form of a problem may exist but be not immediately clear. An example is scan [20]. Other algorithms can be turned into a D&C and, further, into a homomorphic form with the aid of auxiliary functions =-=[10, 38]-=-. 4. The application of the promotion property gives us a parametrized granularity of parallelism which is controlled by the size of the chunks in which distribution function dist splits the list. Dep... |

27 |
Upwards and downwards accumulations on trees
- Gibbons
- 1993
(Show Context)
Citation Context ...simple D&C skeleton is the homomorphism. It does not capture all D&C situations, and it is defined most often for lists [7, 37], although it can also be defined for other data structures, e.g., trees =-=[17]-=- and arrays [33]. 4.2.1 From the source program to the target program Unary function h is a list homomorphism [7] iff its value on a concatenation of two lists can be computed by combining the values ... |

27 | Systematic efficient parallelization of scan and other list homomorphisms
- Gorlatch
- 1996
(Show Context)
Citation Context ... the form of operation fi . E.g., there is a specialization of the homomorphism skeleton, called DH (for distributable homomorphism), for which a family of practical, efficient implementations exists =-=[19, 21]-=-. The similarity of (3) and (5) is obvious: h should be the polynomial product fi, and operation fi should be polynomial addition \Phi. However, there are two mismatches: 1. Operations fi and \Phi are... |

25 | Regular partitioning for synthesizing fixed-size systolic arrays
- DARTE
- 1991
(Show Context)
Citation Context ...e of an increased execution time complexity. In particular, one can make the number of processors independent of the problem size by partitioning the resulting processor array into fixed-size "ti=-=les" [13, 39]-=-. 4 A divide-and-conquer algorithm Rather than enforcing a total order on the cumulative summation in specification (2) of the coefficients of the product polynomial, we can accumulate the summands wi... |

17 |
Automatic parallelization of while-loops using speculative execution
- Collard
- 1995
(Show Context)
Citation Context ...i.e., not all computations must be in the innermost loop) [16]. 3. Upper loop bounds may be arbitary and, indeed, unknown at compile time [23]. A consequence of (3) is that while loops can be handled =-=[12, 22, 30]-=-. This entails a serious departure from the polytope model. The space-time mapping of loops is becoming a viable component of parallelizing compilation [31]. Loop parallelizers that are based on the p... |

5 |
Reference manual of the Bouclettes parallelizer
- Boulet, Dijon, et al.
- 1994
(Show Context)
Citation Context ...e from the polytope model. The space-time mapping of loops is becoming a viable component of parallelizing compilation [31]. Loop parallelizers that are based on the polytope model include Bouclettes =-=[8]-=-, LooPo [24], OPERA [32], Feautrier's PAF, and PIPS [2]. However, recent sophisticated techniques of space-time mapping have not yet filtered through to commercial compilers. In particular, automatic ... |

4 |
PIPS: A Framework for Building Interprocedural Compilers, Parallelizers and Optimizers
- Keryell, Ancourt, et al.
- 1996
(Show Context)
Citation Context ...ops is becoming a viable component of parallelizing compilation [31]. Loop parallelizers that are based on the polytope model include Bouclettes [8], LooPo [24], OPERA [32], Feautrier's PAF, and PIPS =-=[2]-=-. However, recent sophisticated techniques of space-time mapping have not yet filtered through to commercial compilers. In particular, automatic methods for partitioning and projecting (i.e., trading ... |

4 |
Efficient exploration of nonuniform space-time transformations for optimal systolic array synthesis
- Baltus, Allen
- 1993
(Show Context)
Citation Context ...nfluence the potential for parallelism, so one has to be careful. We choose to count the subscript of A up and that of B down; automatic methods can help in exploring the search space for this choice =-=[3]. An-=-other change we make is that we convert the updates of c to single-assignment form, which gives rise a doubly indexed variable �� c; there are also automatic techniques for this kind of conversion... |

4 | From Transformations to Methodology in Parallel Program Development: A Case Study
- Gorlatch
- 1996
(Show Context)
Citation Context ...achine vendor takes over. Alternatively, the user can him/herself program broadcasts, reductions and exchanges with neighbours and define a suitable physical processor topology, e.g., a mesh of trees =-=[18]-=-. 4.2.2 Complexity considerations We consider multiplying two polynomials of degree n on a virtual square of p 2 processors. The time complexity with pipelined broadcasting and reduction is [18]: t = ... |

1 |
Loop Parallelization. Series on Loop Transformations for Restructuring Compilers
- Banerjee
- 1994
(Show Context)
Citation Context ...ne can achieve is linearity in the problem size: 2 at least one loop must enumerate time, i.e., be sequential. In the pure version of the model, one can usually get away with just one sequential loop =-=[5]-=-. The remaining loops enumerate space, i.e., are parallel. This requires a polynomial amount of processors since the loops bounds are affine expressions. The cost is not affected by the parallelizatio... |

1 |
Dependence Analysis. Series on Loop Transformations for Restructuring Compilers
- Banerjee
- 1997
(Show Context)
Citation Context ...ope model. Links to some of them are provided in the Web pages cited here. A good, unhurried introduction to loop parallelization with an emphasis on the polytope model is the book series of Banerjee =-=[4, 5, 6]-=-. 5.2 Divide-and-conquer parallelization For the parallelization of D&C, there is not yet a unified model, in which different choices of parallelization can be evaluated with a common yardstick and co... |

1 |
editors. Theory and practice of higher-order parallel programming
- Cole, Gorlatch, et al.
- 1997
(Show Context)
Citation Context ... adapt his/her application to this schema. In the last couple of years, the development and study of skeletons has received an increasing amount of attention and a research community has been forming =-=[11]-=-. The skeleton approach can become a viable paradigm for parallel programming if 1. the parallel programming community manages to agree on a set of algorithmic skeletons which capture a large number o... |