## Acceleration of First and Higher Order Recurrences on Processors with Instruction Level Parallelism (1993)

Venue: | In Sixth International Workshop on Languages and Compilers for Parallel Computing |

Citations: | 11 - 2 self |

### BibTeX

@INPROCEEDINGS{Schlansker93accelerationof,

author = {Michael Schlansker and Vinod Kathail},

title = {Acceleration of First and Higher Order Recurrences on Processors with Instruction Level Parallelism},

booktitle = {In Sixth International Workshop on Languages and Compilers for Parallel Computing},

year = {1993},

pages = {406--429}

}

### Years of Citing Articles

### OpenURL

### Abstract

This report describes parallelization techniques for accelerating a broad class of recurrences on processors with instruction level parallelism. We introduce a new technique, called blocked back-substitution, which has lower operation count and higher performance than previous methods. The blocked back-substitution technique requires unrolling and non-symmetric optimization of innermost loop iterations. We present metrics to characterize the performance of software-pipelined loops and compare these metrics for a range of height reduction techniques and processor architectures.

### Citations

621 | Trace scheduling: A technique for global microcode compaction - Fisher - 1981 |

259 | The superblock: an effective technique for VLIW and superscalar compilation - Hwu, Mahlke, et al. - 1993 |

237 |
Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing
- Rau, Glaeser
- 1981
(Show Context)
Citation Context ... maximal height required by dependence over all pairs of operations in the trace. Algebraic height reductions must accommodate the distance between all operations in the trace. In software pipelining =-=[14, 15]-=-, successive loop iterations are identical. Height reduction must minimize the relative height between each iteration and identical prior iterations In software pipelines, successive iterations overla... |

172 | The Multiflow Trace Scheduling Compiler
- Lowney, Freudenberger, et al.
- 1993
(Show Context)
Citation Context ... associative reduction of a number of terms to a scalar. It has been implemented within the Cydra 5 compiler (see [18] in which the method is called "riffled" reduction) and within the Trace=-= compiler [12]-=-. Consider the pseudo code of Table 1 which shows the original code for a reduction to scalar loop side by side with the interleaved reduction code. 5sTable 1: (a) Original reduction, (b) Interleaved ... |

142 |
Compiling for the Cydra 5
- Dehnert, Towle
- 1993
(Show Context)
Citation Context ...ecific processor and corresponding code for a software pipeline has been selected, we can use Ops/iter to calculate ResMII. ResMII and the following two metrics have been described in Dehnert et. al. =-=[18]-=-. ResMII establishes a resource bound on the rate of execution of loop iterations within software pipelines. In the traditional definition of ResMII, if a processor has u identical function units each... |

109 | The Structure of Computers and Computation - Kuck - 1978 |

86 |
A systolic array optimizing compiler
- Lam
- 1987
(Show Context)
Citation Context ... maximal height required by dependence over all pairs of operations in the trace. Algebraic height reductions must accommodate the distance between all operations in the trace. In software pipelining =-=[14, 15]-=-, successive loop iterations are identical. Height reduction must minimize the relative height between each iteration and identical prior iterations In software pipelines, successive iterations overla... |

37 | A parallel method for tridiagonal equations
- Wang
- 1981
(Show Context)
Citation Context ...s applied, the operation count is order n log n leading to excessive operation count for large problem sizes. The partition method was introduced by Chen et. al. [6] and studied further by H. H. Wang =-=[7]-=- and H. A. Van Der Vorst et. al. [8]. The partition method is amenable to parallelization and results in lower operation count (approximately 5n) when solving a first order linear recurrence. In this ... |

33 |
Parallel tridiagonal equation solvers
- Stone
- 1975
(Show Context)
Citation Context ...ard" so that each value s[i] prior to the execution of a remap is renamed as s[i+1] after the remap. Thus, the value s (identically equal to s[0]) after the execution of a single remap is renamed=-= as s[1]-=-. This allows multiple values from a sequence of scalar assignments to remain alive without the automatic overwrite of a value when the next member of the sequence is computed. This notation is partic... |

24 |
Recognizing and parallelizing bounded recurrences
- Callahan
- 1991
(Show Context)
Citation Context ... height reduction techniques and use these techniques to efficiently process algebraic recurrences of first and higher order without the introduction of any specialized instructions. Work by Callahan =-=[10] introduce-=-s a fairly general class of recurrences called "bounded recurrences" . A primary objective of Callahan is to establish a notational framework providing for the general description of recurre... |

19 |
Some aspects of the cyclic reduction algorithm for block tridiagonal linear systems
- Heller
- 1976
(Show Context)
Citation Context ...oop side by side with the interleaved reduction code. 5sTable 1: (a) Original reduction, (b) Interleaved reduction s[1] = s in do i=1,n s = s[1] + a i remap(s) enddo s out = s[1] s[1] = sin ,s[2]= 0,s=-=[3]-=- = 0, L s[b] = 0 do i=1,n s = s[b] + ai remap(s) enddo sout = s[1] + s[2]+L+s[b] (a) (b) In this example, we assume that an initial scalar value live in to the body of the loop sin is copied into s[1]... |

18 |
Practical parallel band triangular system solvers
- Chen, Kuck, et al.
- 1978
(Show Context)
Citation Context ...lic reduction. When cyclic reduction is applied, the operation count is order n log n leading to excessive operation count for large problem sizes. The partition method was introduced by Chen et. al. =-=[6]-=- and studied further by H. H. Wang [7] and H. A. Van Der Vorst et. al. [8]. The partition method is amenable to parallelization and results in lower operation count (approximately 5n) when solving a f... |

17 |
Data flow and dependence analysis for instruction level parallelism
- Rau
- 1992
(Show Context)
Citation Context ...redundant operation count which is bounded by a constant multiplier. 1.2 Loop Recurrence Notation The notation used to describe the flow of values between loop iterations is taken from a paper by Rau =-=[16]-=-. In this paper, we use the concept of the expanded virtual register (EVR) which is a linearly ordered set of virtual registers with a special remap operation. Note that the use of the EVR is not rest... |

16 | Solving triangular systems on a parallel computer - Sameh, Brent - 1977 |

14 |
Time and parallel processor bounds for linear recurrence systems
- Chen, Kuck
- 1988
(Show Context)
Citation Context ...s / iter = 1 and Re cMII = l b. Note that for associative operations which have a well behaved inverse operation (e.g. additive inverse), the first loop can be eliminated and proper values for s[1], s=-=[2]-=-, ...,s[b] can be pre-calculated before entering a back-substituted loop with trip count n. 2.3 Blocked Back-substitution In this section, we present a new technique which unrolls loop bodies to achie... |

9 |
Compiling techniques for first-order linear recurrences on a vector computer
- Tanaka, Iwasawa, et al.
- 1988
(Show Context)
Citation Context ... impossible. The dismantling of a single loop into multiple nested loops also increases vector startup penalties and requires more load store instructions than optimal methods. Work by Tanaka et. al. =-=[9]-=- introduces a height reduction technique similar to blocked back-substitution. The technique is used to accelerate first order linear recurrences on a vector computer that has been augmented with a ha... |

9 |
Code generation schemas for modulo scheduled DO-loop and WHILE-loops
- Rau, Schlansker, et al.
(Show Context)
Citation Context ...registers. Compilation techniques using virtual rotating registers are suitable for conventional register files but require the replication of loop program text as described in a paper by Rau et. al. =-=[17]. A scalar-=- value s (usually calculated repeatedly within the body of a loop) can be referenced using the notation s[i]. Each time a remap(s) is executed, all values in the sequence shift "upward" so t... |

6 |
Vectorization of linear recurrence relations
- Vorst, Dekker
- 1989
(Show Context)
Citation Context ...der n log n leading to excessive operation count for large problem sizes. The partition method was introduced by Chen et. al. [6] and studied further by H. H. Wang [7] and H. A. Van Der Vorst et. al. =-=[8]-=-. The partition method is amenable to parallelization and results in lower operation count (approximately 5n) when solving a first order linear recurrence. In this technique, a simple recurrence loop ... |

5 | Acceleration of Algebraic Recurrences on Processors with Instruction Level Parallelism
- Schlansker, Kathail
- 1993
(Show Context)
Citation Context ... all coefficients have non-zero loop-variant values. 2. Recurrences in which all coefficients have non-zero but loop-invariant values. An extended version of this paper, published as technical report =-=[19]-=-, contains performance formulas for recurrences in which ci = 0. 4.1 Formulas for Coefficients in Back-substituted Expressions This subsection derives formulas for coefficients in the expression that ... |