## Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories (2008)

### BibTeX

@MISC{Manik08automaticdata,

author = {Muthu Manik},

title = {Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories},

year = {2008}

}

### OpenURL

### Abstract

Several parallel architectures such as GPUs and the Cell processor have fast explicitly managed on-chip memories, in addition to slow off-chip memory. They also have very high computational power with multiple levels of parallelism. A significant challenge in programming these architectures is to effectively exploit the parallelism available in the architecture and manage the fast memories to maximize performance. In this paper we develop an approach to effective automatic data management for on-chip memories, including creation of buffers in on-chip (local) memories for holding portions of data accessed in a computational block, automatic determination of array access functions of local buffer references, and generation of code that moves data between slow off-chip memory and fast local memories. We also address the problem of mapping computation in regular programs to multi-level parallel architectures using a multi-level tiling approach, and study the impact of on-chip memory availability on the selection of tile sizes at various levels. Experimental results on a GPU demonstrate the effectiveness of the proposed approach.

### Citations

448 | The omega test: a fast and practical integer programming algorithm for dependence analysis
- Pugh
- 1993
(Show Context)
Citation Context ...e accessed by the affine reference. In the subsequent discussion, we denote the image of Is by Fras as FrasIs. There has been a significant body of work on dependence analysis in the polyhedral model =-=[15, 34, 39]-=-. We now discuss briefly the representation of dependences in the polyhedral model. An instance of statement s (denoted by iteration vector ⃗is) depends on an instance of statement t (denoted by itera... |

239 |
Strategies for cache and local memory management by global progra, optimizations
- Gannon, Gallivan
- 1988
(Show Context)
Citation Context ...rmly generated affine references. The idea of estimation of the number of references to an array in order to predict cache effectiveness has been discussed by Ferrante et al. [19] and Gallivan et al. =-=[20]-=-. The idea of finding image of the iteration space onto the array space to optimize global transfers has been discussed in [20]; but only a framework for estimating bounds for the number of elements a... |

216 | Some efficient solutions to the affine scheduling problem: I. onedimensional time
- Feautrier
- 1992
(Show Context)
Citation Context ...nce polyhedron). The technique of employing the polyhedral model to find (affine) program transformations has been widely used for improvement of sequential programs (source-to-source transformation) =-=[16, 17]-=- as well as automatic parallelization of programs [28, 21, 18, 11]. An affine transform of a statement s is defined as an affine mapping that maps an instance of s in the original program to an instan... |

209 | Dataflow analysis of array and scalar references
- Feautrier
- 1991
(Show Context)
Citation Context ...e accessed by the affine reference. In the subsequent discussion, we denote the image of Is by Fras as FrasIs. There has been a significant body of work on dependence analysis in the polyhedral model =-=[15, 34, 39]-=-. We now discuss briefly the representation of dependences in the polyhedral model. An instance of statement s (denoted by iteration vector ⃗is) depends on an instance of statement t (denoted by itera... |

165 | Parametric integer programming
- Feautrier
- 1988
(Show Context)
Citation Context ...r bounds of each dimension of this convex hull. The bounds are determined in the form of an affine function of parameters of the program block, using the Parametric Integer Programming (PIP) software =-=[14]-=-. These bounds determine the size of the local memory array created for the partition. Let the dimensionality of the convex hull be n. Let i1,i2,...,in represent the variables denoting each dimension ... |

129 | M.S.: Maximizing parallelism and minimizing synchronization with affine transforms
- Lim, Lam
- 1997
(Show Context)
Citation Context ...ion 8. 2. Overview of Polyhedral Model This section provides some background information on the polytope/polyhedral model, a powerful algebraic framework for representing programs and transformations =-=[29, 33]-=-. The polyhedral model is used by ourframework to perform automatic allocation and data movement in scratchpad memories (discussed in detail in Section 3). A hyperplane is an n − 1 dimensional affine... |

98 |
On estimating and enhancing cache effectiveness
- Ferrante, Sarkar, et al.
- 1991
(Show Context)
Citation Context ...ut they handle only uniformly generated affine references. The idea of estimation of the number of references to an array in order to predict cache effectiveness has been discussed by Ferrante et al. =-=[19]-=- and Gallivan et al. [20]. The idea of finding image of the iteration space onto the array space to optimize global transfers has been discussed in [20]; but only a framework for estimating bounds for... |

97 | Sequoia: Programming the memory hierarchy
- Fatahalian, Knight, et al.
- 2006
(Show Context)
Citation Context ...works have addressed scratchpad memory management [22, 24, 25] (to name a few). Multi-level tiling approach has been employed in various contexts such as tiling for various levels of memory hierarchy =-=[2, 6, 13]-=-, and tiling for parallelism and locality [37, 5]. Multi-level tiling has become a key technique for high-performance computation. There has been work on generating efficient multi-level tiled code fo... |

97 | Counting solutions to linear and nonlinear constraints through Ehrhart polynomials: Applications to analyze and transform scientific programs
- Clauss
- 1996
(Show Context)
Citation Context ...consider multiple references. Also, for non-uniformly generated references, arbitrary correction factors were given for arriving at lower and upper bounds for the number of distinct references. Clauss=-=[9]-=- and Pugh [35] have presented more expensive but exact techniques to count the number of distinct accesses. There has been a significant amount of research on memory reduction and optimization of data... |

83 | Counting solutions to Presburger formulas: how and why
- Pugh
- 1994
(Show Context)
Citation Context ...iple references. Also, for non-uniformly generated references, arbitrary correction factors were given for arriving at lower and upper bounds for the number of distinct references. Clauss[9] and Pugh =-=[35]-=- have presented more expensive but exact techniques to count the number of distinct accesses. There has been a significant amount of research on memory reduction and optimization of data locality for ... |

72 | Data and memory optimization techniques for embedded systems
- Panda, Catthoor, et al.
(Show Context)
Citation Context ...referred to as shared memory) that resides in the multiprocessor. Programming GPUs for general-purpose applications is enabled through the Compute Unified Device Architecture (CUDA) programming model =-=[30]-=-. The CUDA programming model abstracts the multiprocessors as a grid of virtual processors called thread blocks, and abstracts the SIMD units within a multiprocessor as a grid of virtual processors ca... |

51 | Exact Memory Size Estimation for Array Computation without Loop Unrolling
- Zhao, Malik
- 1999
(Show Context)
Citation Context ...r-on-chip (SOC) systems. In the case of memory optimizations, Panda et al., Balasa et al., and the IMEC group have derived several transformations for improving memory performanceon embedded systems =-=[3, 8, 31, 36, 40]-=-. Their work is a collection of techniques that form a custom memory management methodology referred to as data transfer and storage exploration (DTSE). There is a large body of work on estimating the... |

48 | Automatic Parallelization in the Polytope Model
- Feautrier
(Show Context)
Citation Context ... model to find (affine) program transformations has been widely used for improvement of sequential programs (source-to-source transformation) [16, 17] as well as automatic parallelization of programs =-=[28, 21, 18, 11]-=-. An affine transform of a statement s is defined as an affine mapping that maps an instance of s in the original program to an instance in the transformed program. The affine mapping function of a st... |

43 | Automatic Parallelization of Loop Programs for Distributed Memory Architectures
- Griebl
(Show Context)
Citation Context ... model to find (affine) program transformations has been widely used for improvement of sequential programs (source-to-source transformation) [16, 17] as well as automatic parallelization of programs =-=[28, 21, 18, 11]-=-. An affine transform of a statement s is defined as an affine mapping that maps an instance of s in the original program to an instance in the transformed program. The affine mapping function of a st... |

42 | Compiler-directed scratch pad memory hierarchy design and management
- Kandemir, Choudhary
- 2002
(Show Context)
Citation Context ...3, 8, 36, 40] (and references therein). Most of these works assume the given sequential execution order and find the memory requirements. A number of works have addressed scratchpad memory management =-=[22, 24, 25]-=- (to name a few). Multi-level tiling approach has been employed in various contexts such as tiling for various levels of memory hierarchy [2, 6, 13], and tiling for parallelism and locality [37, 5]. M... |

34 |
A strategy for array management in local memory
- Eisenbeis, Jalby, et al.
- 1991
(Show Context)
Citation Context ... on arrays into portions to be kept in local memory and global memory. They compute a bounding box for each equivalent group of uniformly generated references as in the case of [38]. Eisenbeis et al. =-=[12]-=- consider elements to move to local memory from a view of individual iteration of a loop nest instead of an atomic unit of computation of the program. Kandemir et al. [25] propose an approach for dyna... |

31 |
Praun. Programming for parallelism and locality with hierarchically tiled arrays
- Bikshandi, Guo, et al.
- 2006
(Show Context)
Citation Context ...2, 24, 25] (to name a few). Multi-level tiling approach has been employed in various contexts such as tiling for various levels of memory hierarchy [2, 6, 13], and tiling for parallelism and locality =-=[37, 5]-=-. Multi-level tiling has become a key technique for high-performance computation. There has been work on generating efficient multi-level tiled code for polyhedral iteration spaces that handle tile si... |

28 |
Data access and storage management for embedded programmable processors
- Catthoor, Danckaert, et al.
- 2002
(Show Context)
Citation Context ...r-on-chip (SOC) systems. In the case of memory optimizations, Panda et al., Balasa et al., and the IMEC group have derived several transformations for improving memory performanceon embedded systems =-=[3, 8, 31, 36, 40]-=-. Their work is a collection of techniques that form a custom memory management methodology referred to as data transfer and storage exploration (DTSE). There is a large body of work on estimating the... |

28 | Optimal fine and medium grain parallelism detection in polyhedral reduced dependence graphs
- Darte, Vivien
- 1996
(Show Context)
Citation Context ... model to find (affine) program transformations has been widely used for improvement of sequential programs (source-to-source transformation) [16, 17] as well as automatic parallelization of programs =-=[28, 21, 18, 11]-=-. An affine transform of a statement s is defined as an affine mapping that maps an instance of s in the original program to an instance in the transformed program. The affine mapping function of a st... |

26 |
ªOptimizing Matrix Multiply using PHiPAC: A
- Bilmes, Asanovic, et al.
- 1996
(Show Context)
Citation Context ...works have addressed scratchpad memory management [22, 24, 25] (to name a few). Multi-level tiling approach has been employed in various contexts such as tiling for various levels of memory hierarchy =-=[2, 6, 13]-=-, and tiling for parallelism and locality [37, 5]. Multi-level tiling has become a key technique for high-performance computation. There has been work on generating efficient multi-level tiled code fo... |

23 | Effective automatic parallelization of stencil computations
- Krishnamoorthy, Baskaran, et al.
- 2007
(Show Context)
Citation Context ...gave better performance than other tile sizes for various problem sizes. For the implementation of Jacobi-1D kernel that has a space loop surrounded by a time loop, we used the framework discussed in =-=[27]-=- to modify the tiled code to enable concurrent start of execution in all processes, and performed multi-level tiling over the modified code. We ran experiments on the kernel for various problem sizes ... |

21 | Violated dependence analysis
- Vasilache, Bastoul, et al.
- 2006
(Show Context)
Citation Context ...e accessed by the affine reference. In the subsequent discussion, we denote the image of Is by Fras as FrasIs. There has been a significant body of work on dependence analysis in the polyhedral model =-=[15, 34, 39]-=-. We now discuss briefly the representation of dependences in the polyhedral model. An instance of statement s (denoted by iteration vector ⃗is) depends on an instance of statement t (denoted by itera... |

19 | Multi-level Tiling: M for the Price of One
- Kim, Renganarayanan, et al.
- 2007
(Show Context)
Citation Context ...ation. There has been work on generating efficient multi-level tiled code for polyhedral iteration spaces that handle tile sizes at compile time [23] and that handle tile sizes as symbolic parameters =-=[26]-=-. 8. Conclusions In this paper, we have developed approaches to address two main challenges in modern high-performance multilevel parallel architectures with explicitly managed scratchpad memories, na... |

19 | Reducing memory requirements of nested loops for embedded systems
- Ramanujam, Hong, et al.
- 2001
(Show Context)
Citation Context ...r-on-chip (SOC) systems. In the case of memory optimizations, Panda et al., Balasa et al., and the IMEC group have derived several transformations for improving memory performanceon embedded systems =-=[3, 8, 31, 36, 40]-=-. Their work is a collection of techniques that form a custom memory management methodology referred to as data transfer and storage exploration (DTSE). There is a large body of work on estimating the... |

13 | Multiprocessor system-on-chip data reuse analysis for exploring customized memory hierarchies
- Issenin, Brockmeyer, et al.
(Show Context)
Citation Context ...3, 8, 36, 40] (and references therein). Most of these works assume the given sequential execution order and find the memory requirements. A number of works have addressed scratchpad memory management =-=[22, 24, 25]-=- (to name a few). Multi-level tiling approach has been employed in various contexts such as tiling for various levels of memory hierarchy [2, 6, 13], and tiling for parallelism and locality [37, 5]. M... |

11 |
Improving Parallelism and Data Locality with Affine Partitioning
- Lim
- 2001
(Show Context)
Citation Context |

10 | Compiler optimizations for real time execution of loops on limited memory embedded systems
- Anantharaman, Pande
- 1998
(Show Context)
Citation Context ... accessed due to two references, belonging to different classes, are overlapping, then two different local arrays would be created to hold the overlapping accessed data spaces. Anantharaman and Pande =-=[1]-=- perform data partitioning on arrays into portions to be kept in local memory and global memory. They compute a bounding box for each equivalent group of uniformly generated references as in the case ... |

10 | S.V.: Towards optimal multi-level tiling for stencil computations
- Renganarayanan, Harthikote-Matha, et al.
- 2007
(Show Context)
Citation Context ...2, 24, 25] (to name a few). Multi-level tiling approach has been employed in various contexts such as tiling for various levels of memory hierarchy [2, 6, 13], and tiling for parallelism and locality =-=[37, 5]-=-. Multi-level tiling has become a key technique for high-performance computation. There has been work on generating efficient multi-level tiled code for polyhedral iteration spaces that handle tile si... |

9 | Affine transformations for communication minimal parallelization and locality optimization of arbitrarily nested loop sequences
- Bondhugula, Baskaran, et al.
- 2007
(Show Context)
Citation Context ...es share this portion among themselves. Given any input program, the first step is to find the parallelism available in the computation. Our approach uses the framework developed by Bondhugula et al. =-=[7]-=- for this purpose. Given any affine input program, the framework finds an optimal set of affine transformations (or, equivalently, tiling hyperplanes) for each statement to minimize the volume of comm... |

5 | Loop transformation methodologies for array-oriented memory management
- Balasa, Kjeldsberg, et al.
- 2006
(Show Context)
Citation Context |

5 | Near-optimal allocation of local memory arrays
- Schreiber, Cronquist
- 2004
(Show Context)
Citation Context ...k In this section we discuss prior work that has addressed compiler issues in multi-level parallel architectures and architectures with explicitly managed scratchpad memories. Schreiber and Cronquist =-=[38]-=- have proposed an approach to do near-optimal allocation of scratchpad memories and near-optimal reindexing of array elements in scratchpad memories. Their approach generates separate storage efficien... |

2 |
A compiler based approach for dynamically managing scratch-pad memories in embedded systems
- Kandemir, Ramanujam, et al.
(Show Context)
Citation Context ...se of [38]. Eisenbeis et al. [12] consider elements to move to local memory from a view of individual iteration of a loop nest instead of an atomic unit of computation of the program. Kandemir et al. =-=[25]-=- propose an approach for dynamically managing scratchpad memories, but they handle only uniformly generated affine references. The idea of estimation of the number of references to an array in order t... |

1 |
A cost-effective implementation of multilevel tiling
- Jimnez, Llabera, et al.
(Show Context)
Citation Context ... has become a key technique for high-performance computation. There has been work on generating efficient multi-level tiled code for polyhedral iteration spaces that handle tile sizes at compile time =-=[23]-=- and that handle tile sizes as symbolic parameters [26]. 8. Conclusions In this paper, we have developed approaches to address two main challenges in modern high-performance multilevel parallel archit... |