## Minimizing Average Schedule Length under Memory Constraints by Optimal Partitioning and Prefetching (2000)

Venue: | Journal of VLSI Signal Processing Systems for Signal, Image and Video Technology |

Citations: | 4 - 4 self |

### BibTeX

@INPROCEEDINGS{Wang00minimizingaverage,

author = {Zhong Wang and Timothy W. O'neil and Edwin H. -m. Sha},

title = {Minimizing Average Schedule Length under Memory Constraints by Optimal Partitioning and Prefetching},

booktitle = {Journal of VLSI Signal Processing Systems for Signal, Image and Video Technology},

year = {2000},

pages = {27--215}

}

### OpenURL

### Abstract

Over the last 20 years, the performance gap between CPU and memory has been steadily increasing. As a result, a variety of techniques has been devised to hide that performance gap, from intermediate fast memories (caches) to various prefetching and memory management techniques for manipulating the data present in these caches. In this paper we propose a new memory management technique that takes advantage of access pattern information that is available at compile time by prefetching certain data elements before explicitly being requested by the CPU, as well as maintaining certain data in the local memory over a number of iterations. In order to better take advantage of the locality of reference present in loop structures, our technique also uses a new approach to memory by partitioning it and reducing execution to each partition, so that information is reused at much smaller time intervals than if execution followed the usual pattern. These combined approaches - using a new set of memory instructions as well as partitioning the memory - lead to improvements in total execution time of approximately 25% over existing methods.

### Citations

384 |
A loop transformation theory and an algorithm to maximize parallelism
- Wolf, Lam
- 1991
(Show Context)
Citation Context ...l computation points so as to increase computation granularity and thereby reduce communication time. Wolf and Lam proposed a loop transformation technique for maximizing parallelism or data locality =-=[22-=-]. Boulet et al introduced a criterion for dening optimal tiling in a scalable environment. In his method, an optimal tile shape can be determined by these criteria, and the tile size is obtained from... |

288 | Iterative Modulo Scheduling: An Algorithm for Software Pipelining Loops
- Rau
- 1994
(Show Context)
Citation Context ...and Eichenberger have done research on optimum modulo schedules, taking into consideration the minimum register requirement. They consider not only the datasow, but also the controlsow of the program =-=[8,18-=-]. None of the above research eorts, however, includes the prefetching idea or considers the data fetching latency in their algorithms. 2 DO 10 n1 = 1 , N1 DO 20 n2 = 1, N2 y ( n1 , n2 ) = x ( n1 , n2... |

139 |
BCombining loop transformations considering caches and scheduling
- Wolf, Maydan, et al.
- 1998
(Show Context)
Citation Context ...niques, (such assssion, fusion, tiling, interchanging, etc) and presented a model for estimating total machine cycle time, taking into account software pipelining, register pressure and loop overhead =-=[23]-=-. Passos and Sha proved that in the multi-dimensional case (e.g., nested loops), full-parallelism can always be achieved by using multi-dimensional retiming [14]. Modulo scheduling by Ramanujam [17] i... |

137 |
Stride directed prefetching in scalar processors
- FU, PATEL, et al.
- 1992
(Show Context)
Citation Context ...hat an ane loop nest can always be transformed into an uniform loop nest. This transformation (uniformization [7]) greatly reduces the complexity of the problem. Prefetching schemes based on hardware =-=[6, 9, 19]-=-, software [11, 12], or both [5, 13, 24] have been 1 extensively studied. In hardware prefetching schemes, the prefetching activities are controlled solely by the hardware. In contrast, software prefe... |

135 | A Performance Study of Software and Hardware Data Prefetching Schemes
- Chen, Baer
- 1994
(Show Context)
Citation Context ...sformed into an uniform loop nest. This transformation (uniformization [7]) greatly reduces the complexity of the problem. Prefetching schemes based on hardware [6, 9, 19], software [11, 12], or both =-=[5, 13, 24]-=- have been 1 extensively studied. In hardware prefetching schemes, the prefetching activities are controlled solely by the hardware. In contrast, software prefetching schemes rely on compiler technolo... |

69 | Sequential Hardware Prefetching in Shared-Memory Multiprocessors
- Dahlgren, Dubois, et al.
- 1995
(Show Context)
Citation Context ...hat an ane loop nest can always be transformed into an uniform loop nest. This transformation (uniformization [7]) greatly reduces the complexity of the problem. Prefetching schemes based on hardware =-=[6, 9, 19]-=-, software [11, 12], or both [5, 13, 24] have been 1 extensively studied. In hardware prefetching schemes, the prefetching activities are controlled solely by the hardware. In contrast, software prefe... |

49 |
Register Requirements of Pipelined Processors
- Mangione-Smith, Abraham, et al.
- 1992
(Show Context)
Citation Context ...ing [14]. Modulo scheduling by Ramanujam [17] is a technique for exploiting instruction level parallelism (ILP) in the loop. It can result in high performance code but increased register requirements =-=[10]-=-. Rau and Eichenberger have done research on optimum modulo schedules, taking into consideration the minimum register requirement. They consider not only the datasow, but also the controlsow of the pr... |

38 | Data Prefetching for High-Performance Processors
- Chen
- 1993
(Show Context)
Citation Context ...model as that in our algorithm, but the ALU part uses the traditional 26 list scheduling algorithm and the memory is not partitioned. In hardware prefetching scheduling, we use the model presented in =-=[4]-=-. In this model, to take advantage of the data locality, the next block in the remote memory is also loaded whenever a block is loaded from the remote memory to local memory. The same architecture mod... |

34 | Minimum register requirements for a modulo schedule
- Eichenberger, Abraham
- 1994
(Show Context)
Citation Context ...and Eichenberger have done research on optimum modulo schedules, taking into consideration the minimum register requirement. They consider not only the datasow, but also the controlsow of the program =-=[8,18-=-]. None of the above research eorts, however, includes the prefetching idea or considers the data fetching latency in their algorithms. 2 DO 10 n1 = 1 , N1 DO 20 n2 = 1, N2 y ( n1 , n2 ) = x ( n1 , n2... |

34 | Data Relocation and Prefetching in Programs with Large Data Sets
- Yamada
- 1995
(Show Context)
Citation Context ...sformed into an uniform loop nest. This transformation (uniformization [7]) greatly reduces the complexity of the problem. Prefetching schemes based on hardware [6, 9, 19], software [11, 12], or both =-=[5, 13, 24]-=- have been 1 extensively studied. In hardware prefetching schemes, the prefetching activities are controlled solely by the hardware. In contrast, software prefetching schemes rely on compiler technolo... |

29 |
Cache Miss Heuristics and Preloading Techniques for General-Purpose Programs
- Ozawa, Kimura, et al.
- 1995
(Show Context)
Citation Context ...can always be transformed into an uniform loop nest. This transformation (uniformization [7]) greatly reduces the complexity of the problem. Prefetching schemes based on hardware [6, 9, 19], software =-=[11, 12]-=-, or both [5, 13, 24] have been 1 extensively studied. In hardware prefetching schemes, the prefetching activities are controlled solely by the hardware. In contrast, software prefetching schemes rely... |

28 | Data Prefetching for Software DSMs
- BIANCHINI, PINTO, et al.
- 1998
(Show Context)
Citation Context ...oop transformation approaches can be used to improve the performance of prefetching. Bianchini et al developed a runtime data prefetching strategy for software-based distributed shared-memory systems =-=[1]-=-. Wallace and Bagherzadel proposed a mathematical model and a new prefetching mechanism. A simulation on the SPEC95 benchmarks showed an improvement in the instruction fetching rate [20]. In their wor... |

27 | A tile selection algorithm for data locality and cache interference
- Chame, Moon
- 1999
(Show Context)
Citation Context ...iminating self interference and simultaneously minimizing capacity and cross-interference misses. Their experimental results show that the algorithm consistentlysnds tiles that yield lower miss rates =-=[2]-=-. Nevertheless, the traditional tiling techniques only concentrate on reducing communication cost. They do not consider how best to balance the computation and communication. There is no detailed sche... |

26 | Achieving full parallelism using multi-dimensional retiming
- Passos, Sha
- 1996
(Show Context)
Citation Context ...g, register pressure and loop overhead [23]. Passos and Sha proved that in the multi-dimensional case (e.g., nested loops), full-parallelism can always be achieved by using multi-dimensional retiming =-=[14]-=-. Modulo scheduling by Ramanujam [17] is a technique for exploiting instruction level parallelism (ILP) in the loop. It can result in high performance code but increased register requirements [10]. Ra... |

17 | Optimal software pipelining of nested loops
- Ramanujam
- 1994
(Show Context)
Citation Context ...d [23]. Passos and Sha proved that in the multi-dimensional case (e.g., nested loops), full-parallelism can always be achieved by using multi-dimensional retiming [14]. Modulo scheduling by Ramanujam =-=[17]-=- is a technique for exploiting instruction level parallelism (ILP) in the loop. It can result in high performance code but increased register requirements [10]. Rau and Eichenberger have done research... |

13 |
Uniformization of linear recurrence equations: a step towards the automatic synthesis of systolic arrays
- Dongen, Quinton
- 1988
(Show Context)
Citation Context ...oop nests have ane dependencies, the study of uniform loop nests is justied by the fact that an ane loop nest can always be transformed into an uniform loop nest. This transformation (uniformization [=-=7]-=-) greatly reduces the complexity of the problem. Prefetching schemes based on hardware [6, 9, 19], software [11, 12], or both [5, 13, 24] have been 1 extensively studied. In hardware prefetching schem... |

13 |
An Adaptive Sequential Prefetching Scheme in Shared-Memory Multiprocessors
- Tcheun, Yoon, et al.
- 1997
(Show Context)
Citation Context ...hat an ane loop nest can always be transformed into an uniform loop nest. This transformation (uniformization [7]) greatly reduces the complexity of the problem. Prefetching schemes based on hardware =-=[6, 9, 19]-=-, software [11, 12], or both [5, 13, 24] have been 1 extensively studied. In hardware prefetching schemes, the prefetching activities are controlled solely by the hardware. In contrast, software prefe... |

9 |
Combining loop fusion with prefetching on shared-memory multiprocessors
- Manjikian
- 1997
(Show Context)
Citation Context ...can always be transformed into an uniform loop nest. This transformation (uniformization [7]) greatly reduces the complexity of the problem. Prefetching schemes based on hardware [6, 9, 19], software =-=[11, 12]-=-, or both [5, 13, 24] have been 1 extensively studied. In hardware prefetching schemes, the prefetching activities are controlled solely by the hardware. In contrast, software prefetching schemes rely... |

8 | Modeled and measured instruction fetching performance for superscalar microprocessors
- Wallace, Bagherzadeh
- 1998
(Show Context)
Citation Context ...memory systems [1]. Wallace and Bagherzadel proposed a mathematical model and a new prefetching mechanism. A simulation on the SPEC95 benchmarks showed an improvement in the instruction fetching rate =-=[20]-=-. In their work, the ALU part of the schedule is not considered. Nevertheless, solely considering the prefetching is not enough for optimizing the overall system performance. As we point out in this p... |

5 |
Resource-constrained loop list scheduler for DSP algorithms
- Wang, Parhi
- 1995
(Show Context)
Citation Context ...d Parhi presented an algorithm for resource-constrained scheduling of DSP applications when the number of processors issxed and the objective is to obtain a schedule with the minimum iteration period =-=[21]-=-. Wolf et al studied combinations of various loop transformation techniques, (such assssion, fusion, tiling, interchanging, etc) and presented a model for estimating total machine cycle time, taking i... |

3 |
Y.Robert. (pen)-ultimate tiling
- Bouilet, Risset
- 1994
(Show Context)
Citation Context ... criterion for dening optimal tiling in a scalable environment. In his method, an optimal tile shape can be determined by these criteria, and the tile size is obtained from the resources constraints [=-=16]-=-. Another interesting result was produced by Chame and Moon. They propose a new tile selection algorithm for eliminating self interference and simultaneously minimizing capacity and cross-interference... |

2 |
Scheduling of uniform multi-dimensioanl systems under resource constraints
- Passos, Sha
- 1998
(Show Context)
Citation Context ... be scheduled by any loop scheduling algorithm. We optimize the ALU schedule by using the multidimensional rotational scheduling algorithm because it has been shown to achieve an optimal ALU schedule =-=[15]-=-. This paper presents a method of deciding the best partition which achieves a balanced schedule, as well as deriving the theory to calculate the total memory requirement for a certain partition. Expe... |

1 | Loop scheduling optimization with data prefetching based on multi-dimensional retiming
- Chen, Tongsima, et al.
- 1998
(Show Context)
Citation Context ...re instruction level parallelism, the schedule length is 21 CPU clock cycles because of the long memory prefetch time which dominates the execution time. Finally, using the PBS algorithm presented in =-=[3]-=- which takes into account the balance between ALU computation and memory access time, a much better performance can be obtained, but the average schedule length is still 7 CPU clock cycles. Therefore,... |