## On The Implementation And Effectiveness Of Autoscheduling For Shared-Memory Multiprocessors (1995)

Citations: | 16 - 2 self |

### BibTeX

@TECHREPORT{Moreira95onthe,

author = {José Eduardo Moreira and Jos'e Eduardo Moreira and Ph. D and Constantine D. Polychronopoulos},

title = {On The Implementation And Effectiveness Of Autoscheduling For Shared-Memory Multiprocessors},

institution = {},

year = {1995}

}

### OpenURL

### Abstract

processors Physical processors Alignment Distribution dependent mapping Implementation Figure 3.4 HPF approach to data partition and distribution. states that iteration i is to be executed by the processor to which A(i) is assigned. Therefore processor p 1 executes iterations f1; 2; 3; 4g. The ON clause is a feature borrowed from the language Kali [25]. 3.1.3 HPF The High Performance Fortran (HPF) [6, 26, 27] language was designed as a set of extensions and modifications to Fortran 90 to support data parallel programming. The ability to achieve top performance on MIMD and SIMD computers with nonuniform memory access was one of the main goals of the project. The design of HPF was influenced by Fortran D and Vienna Fortran [28, 29]. Just as Fortran D approaches the problem of data partitioning and distribution in two stages, HPF uses three. First, arrays are aligned to each other. Second, arrays are distributed across a user-defined rectilinear arrangement of abstract processo...

### Citations

2197 |
The art of computer programming
- Knuth
- 1973
(Show Context)
Citation Context ...12 312.55 1024 599.55 32 64 128 256 512 1024 10 0 10 1 10 2 10 3 Matrix Size Maximum achievable speedup on SMM Figure 7.21 Maximum speedup of Strassen's algorithm. 7.4.2.5 Quicksort (QUICK) Quicksort =-=[117, 118] is a recu-=-rsive "divide and conquer" algorithm that sorts a vector of elements (integers). It does so through the following steps: 1. It chooses one of the elements of the array as the pivot. In our i... |

1958 |
Matrix computations
- Golub, Loan
- 1996
(Show Context)
Citation Context ... * * * - + START STOP CMM Figure 7.19 CMM benchmark. 172 7.4.2.4 Strassen's Matrix Multiply (SMM) This benchmark is an implementation of Strassen's recursive matrix multiply algorithm as described in =-=[116]-=-. Strassen's algorithm performs the multiplication of two n \Theta n matrices A and B through the following steps: 1. divide each of the input matrices into four n 2 \Theta n 2 quadrants. 2. perform t... |

1258 |
Graph Theory with Applications
- Bondy, Murty
- 1976
(Show Context)
Citation Context ...ph G = (X; A) where fx ! y 2 A; x; y 2 X j 9 path from x to y in Gg Given the adjacency matrix F representation of an ATG G, the adjacency matrix representation Fsof the closure G can be computed by (=-=[107]-=-): Fs= n X i=1 F i (6:13) where F i is defined as F 0 = I F i = F \Theta F i\Gamma1 ; i ? 0 (6:14) The algorithm in Figure 6.17 is a naive way to compute the transitive reduction of a graph. For each ... |

513 | Software pipelining: an effective scheduling technique for vliw machines
- Lam
- 1988
(Show Context)
Citation Context ...We would like to investigate how to use the HTG data structure to perform aggressive code optimization, such as global register allocation, superscalar instruction scheduling, and software pipelining =-=[123]-=-. We believe the HTG is a self-sufficient data structure that supersedes the more traditional flow and dependence graphs, but we still have to assess the efficiency of optimized machine code generated... |

211 |
Guided Self-Scheduling: A Practical Scheduling Scheme for Parallel Supercomputers
- Polychronopoulos, Kuck
- 1987
(Show Context)
Citation Context ...scheduling algorithms are supported: ffl Self-scheduling (SS): in this case, the chunk-size is always 1. ffl Guided self-scheduling (GSS): in this case, the chunk-size C is computed by the expression =-=[102]-=- C = R P where R is the number of remaining iterations in the loop (those not yet scheduled) and P is the number of processors in the partition executing the loop. After the chunk size and the startin... |

203 | The PERFECT Club Benchmarks: Effective Performance Evaluation of Supercomputers
- Berry, Chen, et al.
- 1989
(Show Context)
Citation Context ...llenge. The pseudocode for this benchmark is shown in Figure 7.17. The matrices are of size n \Theta n and the parameter n was set to (32; 64; 128; 256). 7.4.2.2 TRFD TRFD is one of the Perfect Codes =-=[114]-=-. It is an example of a scientific application with very little directly exploitable functional parallelism. For the measurements in this thesis we do not attempt to extract or exploit functional para... |

158 |
The transitive reduction of a directed graph
- Aho, Garey, et al.
- 1972
(Show Context)
Citation Context ...f Figure 6.17 has complexity O(mn 4 ) where m and n are the number of arcs and nodes in graph G, respectively. More efficient algorithms for computing the transitive reduction in O(n 2:8 ) time exist =-=[108]-=-. 120 procedure Reduction(G) f (* G is a directed graph, G = (X; A) *) H = G* foreach (x ! y) 2 A) f G = G \Gamma (x ! y) if G 6= H f G = G+ (x ! y) g g g Figure 6.17 Algorithm to compute the transiti... |

148 |
Computer ”experiments” on classical fluids. I. Thermodynamical properties of Lennard–Jones molecules
- Verlet
- 1967
(Show Context)
Citation Context ...t of the size of the problem and depends only on the structure of the code. 7.4.2.7 Molecular dynamics (MDJ) This benchmark is extracted from an industrial molecular dynamics code from Asahi Chemical =-=[120, 121]-=-. The benchmark is a functional parallel section of the most heavily used subroutine in the code. This section consists of a loop that iterates over all the cells in the grid (total of 1000 cells). Ea... |

105 | Symbolic analysis for parallelizing compilers
- Haghighat
- 1996
(Show Context)
Citation Context ...s of tasks. While this may be a simple matter for sequence of statements, it becomes more complicated as tasks include subroutine calls and loops. Parafrase-2's extensive symbolic analysis capability =-=[109]-=- can be used to derive precise timing information for tasks, but the results from [91] show that even simple estimates, such as assuming a large constant time for tasks with subroutine calls or loops,... |

104 | Performance analysis of parallelizing compilers on the Perfect benchmarks programs
- Blume, Eigenman
- 1992
(Show Context)
Citation Context ...necessary to support reentrant code, recursion, and, most importantly in our case, concurrency. Variable privatization has been demonstrated to be an important feature for parallelization of programs =-=[97]-=-. To implement variable privatization with static allocation, it is necessary to expand the loop local variables by adding one dimension to their structure and distributing them across the processors ... |

81 |
Algorithms + Data Structures = Programs
- Wirth
- 1978
(Show Context)
Citation Context ...12 312.55 1024 599.55 32 64 128 256 512 1024 10 0 10 1 10 2 10 3 Matrix Size Maximum achievable speedup on SMM Figure 7.21 Maximum speedup of Strassen's algorithm. 7.4.2.5 Quicksort (QUICK) Quicksort =-=[117, 118] is a recu-=-rsive "divide and conquer" algorithm that sorts a vector of elements (integers). It does so through the following steps: 1. It chooses one of the elements of the array as the pivot. In our i... |

65 |
Simulation of Multiprocessors: Accuracy and Performance
- Goldschmidt
- 1993
(Show Context)
Citation Context ...r with earliest time; if (CurrentProcessor == new) f return; g else f old = CurrentProcessor; unblock(new); block(old); g g Figure 7.3 The function SelectProcessor, the processor scheduler. TangoLite =-=[112]-=- and EPGsim [113]. An illustration of the correspondence of what happens in real time and simulated time is shown in Figure 7.4. A synchronization event is a point in the program where the actions of ... |

43 | Polaris: The next generation in parallelizing compilers
- Blume, Eigenmann, et al.
- 1994
(Show Context)
Citation Context ...statement list. The syntax rule is shown in Figure 6.11, and the class declaration in Figure 6.14. Figure 6.15 shows examples of the use of the StmtList class for parsing and code generation. Polaris =-=[104, 105]-=- is an example of a compiler that uses object-oriented technology to a much larger extent. With this organization, it is relatively straightforward to apply a function to all nodes of a certain type o... |

37 |
Low-overhead scheduling of nested parallelism
- Hummel, Schonberg
- 1991
(Show Context)
Citation Context ... is limited by the total number of stacks available. Techniques for implementing the cactus stack with one stack per processor, and the associated restrictions 86 in task scheduling, are discussed in =-=[91, 99]-=-. Only the generic heap scheme was implemented in autoscheduling. An efficient technique for memory allocation used in the implementation of APL, another system with high requirements for dynamic memo... |

34 | A Fortran-to-C converter
- Feldman, Gay, et al.
- 1993
(Show Context)
Citation Context ...aph Processing. 116 4. HTG Optimizations. 5. Code Generation. Each of these passes is explained in more detail below. Some of the operations performed by the autoscheduling compiler were based on f2c =-=[106]-=-. 6.6.1 Parsing The parser is generated automatically from the language grammar with bison and flex. As it parses the program, it builds the AST and the auxiliary lists discussed in Section 6.3. The p... |

28 | Memory allocation costs in large C and C++ programs
- Detlefs, Dosser, et al.
- 1994
(Show Context)
Citation Context ...ts, one for each processor, can be used in shared-memory multiprocessors to reduce contention. A discussion of memory allocation costs, comparing the performance of different systems, can be found in =-=[101]-=-. The management of activation frames is always a source of overhead. It is large or small depending on how fast activation frames can be allocated and deallocated and how many of those operations are... |

24 | The Polaris internal representation
- Faigin, Weatherford, et al.
- 1994
(Show Context)
Citation Context ...statement list. The syntax rule is shown in Figure 6.11, and the class declaration in Figure 6.14. Figure 6.15 shows examples of the use of the StmtList class for parsing and code generation. Polaris =-=[104, 105]-=- is an example of a compiler that uses object-oriented technology to a much larger extent. With this organization, it is relatively straightforward to apply a function to all nodes of a certain type o... |

17 |
A hierarchical task queue organization for shared-memory multiprocessor systems
- Dandamudi, Cheng
- 1995
(Show Context)
Citation Context ...n, the sensitivity of the efficiency parameter to N and k diminishes. Hierarchical queue organizations serve as an alternative to fully centralized or fully distributed organizations, as discussed in =-=[96]-=-. 5.3 Granularity Control Task granularity is one of the fundamental optimization problems in parallel processing. The granularity, or grain size, of a task is informally used to indicate the size of ... |

10 |
III. Switch-stacks: A scheme for microtasking nested parallel loops
- Chow, Harrison
- 1990
(Show Context)
Citation Context ... Detailed analysis of such algorithms, however, is beyond the scope of this thesis. Another alternative to processor-independent activation frame allocation and deallocation is the switchstacksscheme =-=[98]-=-. The latter scheme is efficient in performing allocation and deallocation, but the depth of exploited parallelism is limited by the total number of stacks available. Techniques for implementing the c... |

9 |
Memory Latency Reduction via Data Prefetching and Data Forwarding in Shared-Memory Multiprocessors
- Poulsen
- 1994
(Show Context)
Citation Context ...ime; if (CurrentProcessor == new) f return; g else f old = CurrentProcessor; unblock(new); block(old); g g Figure 7.3 The function SelectProcessor, the processor scheduler. TangoLite [112] and EPGsim =-=[113]-=-. An illustration of the correspondence of what happens in real time and simulated time is shown in Figure 7.4. A synchronization event is a point in the program where the actions of one processor may... |

9 |
Large-Scale Computer Simulation of Fully Developed Turbulent Channel Flow with Heat Transfer, Int
- Lyons, Hanratty, et al.
- 1991
(Show Context)
Citation Context ...f n. Again, we observe scalable functional parallelism. 7.4.2.6 Computational fluid dynamics (CFD) This benchmark is the kernel of a Fourier-Chebyshev spectral Computational Fluid Dynamics (CFD) code =-=[119]-=-. The task graph is shown in Figure 7.24. It basically consists of four stages that operate on matrices of size n \Theta n. The first stage involves six two-dimensional 177 F F F F F F * * * * * * - -... |

6 |
New Method for Searching for Neighbors in Molecular Dynamics Computations
- Quentrec, Brot
- 1973
(Show Context)
Citation Context ...t of the size of the problem and depends only on the structure of the code. 7.4.2.7 Molecular dynamics (MDJ) This benchmark is extracted from an industrial molecular dynamics code from Asahi Chemical =-=[120, 121]-=-. The benchmark is a functional parallel section of the most heavily used subroutine in the code. This section consists of a loop that iterates over all the cells in the grid (total of 1000 cells). Ea... |

3 |
MaTrix++; An Object-Oriented Approach to the Hierarchical Matrix Algebra
- Collins, Browne
- 1994
(Show Context)
Citation Context ...t tasks in the code. The parallelism of the independent tasks can then be exploited with autoscheduling. The specification of computation for matrices with problem-specific structures is discussed in =-=[125]. 237 APPE-=-NDIX A HTGIL REFERENCE This appendix describes the syntax for HTGIL using yacc notation. Symbols enclosed by double quotes ("") correspond to language terminals represented by the correspond... |

1 |
Storage management in IBM APL systems
- Trimble
- 1991
(Show Context)
Citation Context ...was implemented in autoscheduling. An efficient technique for memory allocation used in the implementation of APL, another system with high requirements for dynamic memory allocation, is described in =-=[100]-=-. The scheme is based on a set of n lists, numbered 0; 1; : : : ; n \Gamma 1, of free cells. Let the minimum size of a cell be K = 2 k bytes. List i is composed of blocks of size 2 k+i . Larger blocks... |

1 |
Parallelization of TRFD," tech
- Andrews
- 1991
(Show Context)
Citation Context ... extract or exploit functional parallelism in TRFD, and we rather concentrate on the exploitation of the plentiful loop parallelism in this code. We utilize a parallel version of TRFD as described in =-=[115]-=-. Most of the work in TRFD is performed by subroutine OLDA, which is called for different values of a parameter n = (10; 15; 20; 25; 30; 35; 40). OLDA consists of three parallel loops in 169 benchmark... |

1 |
Land avoidance and load balancing in ocean simulation
- DeRose, Gallivan, et al.
- 1993
(Show Context)
Citation Context ... could perform static load balancing, such as described in [109]. However, there are real scientific applications in which the optimal partitioning of the grid for load balance is input-set dependent =-=[122]-=-. In this case, the ideal partitioning cannot be computed at compile time. It is for this class of applications that dynamic scheduling is most helpful. Each of the five benchmarks operates on three n... |

1 |
LIDEX: A system for description, simulation and analysis of computer architecture and organization
- Moreira
- 1990
(Show Context)
Citation Context ...llelism. One example is digital system simulation. Instead of having a generic simulator that operates on a data structure describing the system, a specific simulator for that system can be generated =-=[124]-=-. The structure of this simulator will be isomorphic to the structure of the digital system and independent subsystems will generate independent tasks in the code. The parallelism of the independent t... |