#### DMCA

## Global communication optimization for tensor contraction expressions under memory constraints (2003)

### Cached

### Download Links

- [www.researchgate.net]
- [www.researchgate.net]
- DBLP

### Other Repositories/Bibliography

Venue: | In IPDPS ’03: Proceedings of the 17th International Parallel & Distributed Processing Symposium |

Citations: | 19 - 13 self |

### Citations

331 | Improving Data Locality with Loop Transformations,
- McKinley, Carr, et al.
- 1996
(Show Context)
Citation Context .... Callahan et al. [1] present a technique to convert array references to scalar accesses in innermost loops. As mentioned earlier, loop fusion has also been used as a means of improving data locality =-=[11, 24, 22, 21]-=-. There has been much less work investigating the use of loop fusion as a means of reducing memory requirements [8, 23]. Another significant way in which our approach differs from other work that we a... |

225 | Improving register allocation for subscripted variables.
- Callahan, Carr, et al.
- 1990
(Show Context)
Citation Context ...reordering; and their work is not aimed at minimizing array sizes. Lewis et al. [18] discusses the application of fusion directly to array statements in languages such as F90 and ZPL. Callahan et al. =-=[1]-=- present a technique to convert array references to scalar accesses in innermost loops. As mentioned earlier, loop fusion has also been used as a means of improving data locality [11, 24, 22, 21]. The... |

148 | Maximizing loop parallelism and improving data locality via loop fusion and distribution.
- Kennedy, McKinley
- 1993
(Show Context)
Citation Context ...tly arises from the rotation of the array T 1 for each iteration of the f loop. 5 Related work Much work has been done on improving locality and parallelism by using loop fusion. Kennedy and McKinley =-=[10]-=- presented an algorithm for fusing a collection of loops to minimize the parallel loop synchronization overhead and maximize parallelism. They proved that finding loop fusions that maximize locality i... |

84 | Improving effective bandwidth through compiler enhancement of global cache reuse,”
- Ding, Kennedy
- 2004
(Show Context)
Citation Context ...ec. S(a,b, i, j) S(a,b, i, j) 〈a,b〉 N/A 230.4MB 0 N/A Kennedy [11] developed a fast algorithm that allows accurate modeling of data sharing as well as the use of fusion-enabling transformations. Ding =-=[6]-=- illustrates the use of loop fusion in reducing storage requirements through an example, but does not provide a general solution. Singhai and McKinley [24] examined the effects of loop fusion on data ... |

69 | Collective Loop Fusion for Array Contraction.
- Gao, Olsen, et al.
- 1992
(Show Context)
Citation Context ...thesized code containing nested loop structures. Traditional compiler research does not address this use of loop fusion because this problem does not arise with manually-produced programs. Gao et al. =-=[8]-=- studied the contraction of arrays into scalars through loop fusion as a means to reduce array access overhead. They partitioned a collection of loop nests into fusible clusters using a max-flow min-c... |

63 | Fusion of loops for parallelism and locality - Manjikian, Abdelrahman - 1997 |

56 | On the Complexity of Loop Fusion,
- Darte
- 1999
(Show Context)
Citation Context ...oop synchronization overhead and maximize parallelism. They proved that finding loop fusions that maximize locality is NP-hard. Two polynomial-time algorithms for improving locality were given. Darte =-=[5]-=- discusses the complexity of maximal fusion of parallel loops. Recently, 0-7695-1926-1/03/$17.00 (C) 2003 IEEE Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS’03)... |

44 | The implementation and evaluation of fusion and contraction in array languages
- Lewis, Lin, et al.
- 1998
(Show Context)
Citation Context ...ions in compiling APL programs has been discussed by Guibas and Wyatt [7]. They considered loop fusion without any loop reordering; and their work is not aimed at minimizing array sizes. Lewis et al. =-=[18]-=- discusses the application of fusion directly to array statements in languages such as F90 and ZPL. Callahan et al. [1] present a technique to convert array references to scalar accesses in innermost ... |

43 |
Performance Computational Chemistry Group, NWChem, A Computational Chemistry Package for Parallel Computers, Version 4.6(Pacific Northwest National Laboratory
- High
- 2004
(Show Context)
Citation Context ...n, and describes our algorithm for the memory-constrained communication minimization problem. Section 4 presents results from the application of the new algorithm to an example abstracted from NWChem =-=[9]-=-. We discuss related work in Section 5. Conclusions are provided in Section 6. 2 Elaboration of problem addressed In the class of computations considered, the final result to be computed can be expres... |

38 | Memory-Optimal Evaluation of Expression Trees Involving Large Objects
- Lam, Cociorva, et al.
- 1999
(Show Context)
Citation Context ...on transformations is important. So we addressed the problem of finding the choice of loop fusions for a given operator tree that minimizes the space required for all intermediate arrays after fusion =-=[15, 14]-=-. In this paper we address the optimization of a parallel implementation of this class of computations. If memory were abundant, the issue would be that of determining the optimal distributions/re-dis... |

37 | Optimization of array accesses by collective loop transformations
- Sarkar, Gao
- 1991
(Show Context)
Citation Context ...r, loop fusion has also been used as a means of improving data locality [11, 24, 22, 21]. There has been much less work investigating the use of loop fusion as a means of reducing memory requirements =-=[8, 23]-=-. Another significant way in which our approach differs from other work that we are aware of, is that we attempt global optimization across a collection of loop nests using empirically derived cost mo... |

34 | Towards Automatic Synthesis of High-Performance Codes for Electronic Structure Calculations: Data Locality Optimization
- Cociorva, Wilkins, et al.
- 2001
(Show Context)
Citation Context ...ude algebraic transformations to minimize the number of arithmetic operations [13, 16], loop fusion and array contraction for memory space minimization [15, 16], tiling and data locality optimization =-=[2]-=-, and space-time trade-off optimization [3]. Since the problem of determining the set of algebraic transformations to minimize operation count was found to be NPcomplete, we developed a pruning search... |

34 |
Compilation and delayed evaluation in APL. In:
- Guibas, Wyatt
- 1978
(Show Context)
Citation Context ...er-processor memory constraints in a distributed memory machine. Loop fusion in the context of delayed evaluation of array expressions in compiling APL programs has been discussed by Guibas and Wyatt =-=[7]-=-. They considered loop fusion without any loop reordering; and their work is not aimed at minimizing array sizes. Lewis et al. [18] discusses the application of fusion directly to array statements in ... |

33 | Space-Time Trade-Off Optimization for a Class of Electronic Structure Calculations
- Cociorva, Baumgartner, et al.
- 2002
(Show Context)
Citation Context ...he number of arithmetic operations [13, 16], loop fusion and array contraction for memory space minimization [15, 16], tiling and data locality optimization [2], and space-time trade-off optimization =-=[3]-=-. Since the problem of determining the set of algebraic transformations to minimize operation count was found to be NPcomplete, we developed a pruning search procedure [13] that is very efficient in p... |

32 | Data locality enhancement by memory reduction.
- Song, Xu, et al.
- 2001
(Show Context)
Citation Context ...motivated by data locality enhancement and not memory reduction. Also, they only considered fusions of conformable loop nests, i.e., loop nests that contain exactly the same set of loops. Song et al. =-=[25]-=- have explored the use of loop fusion for memory reduction for sequential execution. They do not consider trading off memory for recomputation or the impact of data distribution on communication costs... |

31 | On Optimizing a Class of MultiDimensional Loops with Reductions for Parallel Execution
- Lam, Sadayappan, et al.
- 1997
(Show Context)
Citation Context ... architecture. A number of compile-time optimizations are being incorporated into the program synthesis system. These include algebraic transformations to minimize the number of arithmetic operations =-=[13, 16]-=-, loop fusion and array contraction for memory space minimization [15, 16], tiling and data locality optimization [2], and space-time trade-off optimization [3]. Since the problem of determining the s... |

30 |
Performance Optimization of a Class of Loops Implementing MultiDimensional Integrals
- Lam
- 1999
(Show Context)
Citation Context ... architecture. A number of compile-time optimizations are being incorporated into the program synthesis system. These include algebraic transformations to minimize the number of arithmetic operations =-=[13, 16]-=-, loop fusion and array contraction for memory space minimization [15, 16], tiling and data locality optimization [2], and space-time trade-off optimization [3]. Since the problem of determining the s... |

29 |
A Parameterized Loop Fusion Algorithm for Improving Parallelism and Cache Locality.
- Singhai, McKinley
- 1997
(Show Context)
Citation Context ... use of fusion-enabling transformations. Ding [6] illustrates the use of loop fusion in reducing storage requirements through an example, but does not provide a general solution. Singhai and McKinley =-=[24]-=- examined the effects of loop fusion on data locality and parallelism together. They viewed the optimization problem as one of partitioning a weighted directed acyclic graph in which the nodes represe... |

26 |
Achieving chemical accuracy with coupled cluster theory
- Lee, Scuseria
- 1997
(Show Context)
Citation Context ...ian, NWChem, PSI, and MOLPRO. In particular, they comprise the bulk of the computation with the coupled cluster approach to the accurate description of the electronic structure of atoms and molecules =-=[17, 20]-=-. Computational approaches to modeling the structure and interactions of molecules, the electronic and optical properties of molecules, the heats and rates of chemical reactions, etc., are very import... |

23 |
Fast greedy weighted fusion.
- Kennedy
- 2000
(Show Context)
Citation Context ...N/A 34.6 sec. T 1(b,c,d, f ) T 1(b,c,d) 〈d,b〉 〈d,b〉 108.0MB 902.0 sec. 888.5 sec. T 2(b,c, j,k) T 2(b,c, j,k) 〈k,b〉 〈k,b〉 230.4MB 0 36.2 sec. S(a,b, i, j) S(a,b, i, j) 〈a,b〉 N/A 230.4MB 0 N/A Kennedy =-=[11]-=- developed a fast algorithm that allows accurate modeling of data sharing as well as the use of fusion-enabling transformations. Ding [6] illustrates the use of loop fusion in reducing storage require... |

20 | A compiler optimization algorithm for shared-memory multiprocessors
- McKinley
- 1998
(Show Context)
Citation Context .... Callahan et al. [1] present a technique to convert array references to scalar accesses in innermost loops. As mentioned earlier, loop fusion has also been used as a means of improving data locality =-=[11, 24, 22, 21]-=-. There has been much less work investigating the use of loop fusion as a means of reducing memory requirements [8, 23]. Another significant way in which our approach differs from other work that we a... |

18 |
Schaefer III. An Introduction to Coupled Cluster Theory for Computational Chemists
- Crawford, F
- 2000
(Show Context)
Citation Context ...ed in part by the National Science Foundation through the Information Technology Research program (CHE-0121676 and CHE-0121706), and NSF grants CCR-0073800 and EIA-9986052. by coupled cluster methods =-=[4]-=-, in which many computationally intensive components are expressible as a set of tensor contractions (explained later with an example). We are developing a program synthesis system that will transform... |

7 | Optimization of memory usage requirement for a class of loops implementing multi-dimensional integrals
- Lam, Cociorva, et al.
- 1999
(Show Context)
Citation Context ...ed into the program synthesis system. These include algebraic transformations to minimize the number of arithmetic operations [13, 16], loop fusion and array contraction for memory space minimization =-=[15, 16]-=-, tiling and data locality optimization [2], and space-time trade-off optimization [3]. Since the problem of determining the set of algebraic transformations to minimize operation count was found to b... |

3 |
For a brief description, see
- Kumar, Grama, et al.
- 1994
(Show Context)
Citation Context ...tensor contractions, which are essentially generalized matrix products on higher dimensional arrays (an example is provided later). We use a generalization of Cannon’s matrix multiplication algorithm =-=[12]-=- as the basis for the individual contractions. 0-7695-1926-1/03/$17.00 (C) 2003 IEEE Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS’03)sFor many problems of prac... |