## Efficient layout transformation for disk-based multidimensional arrays (2004)

### Cached

### Download Links

- [www.cse.ohio-state.edu]
- [www.csc.lsu.edu]
- [csc.lsu.edu]
- [www.cse.ohio-state.edu]
- [csc.lsu.edu]
- [www.csc.lsu.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | In HiPC. 386–398 |

Citations: | 1 - 0 self |

### BibTeX

@INPROCEEDINGS{Krishnamoorthy04efficientlayout,

author = {Sriram Krishnamoorthy and Gerald Baumgartner and Chi-chung Lam and Jarek Nieplocha},

title = {Efficient layout transformation for disk-based multidimensional arrays},

booktitle = {In HiPC. 386–398},

year = {2004}

}

### OpenURL

### Abstract

Abstract. I/O libraries such as PANDA and DRA use blocked layouts for efficient access to disk-resident multi-dimensional arrays, with the shape of the blocks being chosen to match the expected access pattern of the array. Sometimes, different applications, or different phases of the same application, have very different access patterns for an array. In such situations, an array’s blocked layout representation must be transformed for efficient access. In this paper, we describe a new approach to solve the layout transformation problem and demonstrate its effectiveness in the context of the Disk Resident Arrays (DRA) library. The approach handles re-blocking and permutation of dimensions. Results are provided that demonstrate the performance benefit as compared to currently available mechanisms. 1

### Citations

129 | FFTs in external or hierarchical memory
- Bailey
- 1990
(Show Context)
Citation Context ...will allow for efficient access. An example is the out-of-core 2D Fast Fourier Transform (FFT), where the array is accessed by columns in one phase and by rows in the other. The multi-dimensional FFT =-=[5, 6]-=- can be implemented as a series of one-dimensional FFTs, one along each dimension. Anothersexample illustrating very different access patterns is with image data in three and four (including time) dim... |

70 |
Global arrays: a nonuniform memory access programming model for high-performance computers
- Nieplocha, Harrison, et al.
- 1996
(Show Context)
Citation Context ... the proposed approach for efficient layout transformation. In Section 5, experimental results are presented. Section 6 concludes the paper. 2 Disk Resident Arrays The Global Arrays (GA) library [16] =-=[17]-=- provides a shared-memory programming model in which data locality is explicitly managed by the programmer. Explicit function calls are used to transfer data between global address space and local sto... |

33 |
A Fast Computer Method Matrix Transposing
- eklundh
- 1972
(Show Context)
Citation Context ... into memory, transposed and written to disk. Since the different row segments of a 2D tile are not contiguous on disk, this could be extremely inefficient unless the tile size is very large. Eklundh =-=[19]-=- proposed a multi-pass algorithm, in which the minimum unit of I/O is a row. The number of passes in the algorithm is proportional to the array dimensions. Kaushik et al. [20] reduced the number of re... |

31 | Disk Resident Arrays: An Array-Oriented I/O Library for Out-Of-Core Computations
- Nieplocha, Foster
- 1996
(Show Context)
Citation Context ...sk resident data. To optimize performance in collective I/O operations between arrays located on disk and in distributed main memory of parallel computers [1], I/O libraries like PANDA [2, 3] and DRA =-=[4]-=- use a blocked layout representation for the disk-based multidimensional arrays instead of the dimension-ordered representation used typically for the representation of multidimensional arrays in main... |

31 | Towards Automatic Synthesis of High-Performance Codes for Electronic Structure Calculations: Data Locality Optimization - Cociorva, Wilkins, et al. - 2001 |

29 |
Performance Computational Chemistry Group. NWChem, A Computational Chemistry Package for Parallel Computers, Version 4.6. Pacific Northwest National Laboratory
- High
- 2004
(Show Context)
Citation Context ...ften the tensors (essentially multi-dimensional arrays) are too large to fit in memory and must be disk-based. The input tensors are often generated by other quantum chemistry packages such as NWChem =-=[15]-=-, with a layout quite different from that needed for efficient processing by the TCE-generated code. This paper describes an approach to efficient transformation of data between diskbased multidimensi... |

28 | A High-Level Approach to Synthesis of High-Performance Codes for Quantum Chemistry - Baumgartner, Bernholdt, et al. - 2002 |

27 | Space-Time Trade-Off Optimization for a Class of Electronic Structure Calculations - Cociorva, Baumgartner, et al. - 2002 |

19 |
Multidimensional array I/O in Panda 1.0
- Seamons, Winslett
- 1996
(Show Context)
Citation Context ...us blocks of disk resident data. To optimize performance in collective I/O operations between arrays located on disk and in distributed main memory of parallel computers [1], I/O libraries like PANDA =-=[2, 3]-=- and DRA [4] use a blocked layout representation for the disk-based multidimensional arrays instead of the dimension-ordered representation used typically for the representation of multidimensional ar... |

19 |
Efficient transposition algorithms for large matrices
- Kaushik, Huang, et al.
- 1993
(Show Context)
Citation Context ...e is very large. Eklundh [19] proposed a multi-pass algorithm, in which the minimum unit of I/O is a row. The number of passes in the algorithm is proportional to the array dimensions. Kaushik et al. =-=[20]-=- reduced the number of read operations and increased the read block size compared to Eklundh’s algorithm. Sun and Prasanna [21] proposed an algorithm that minimized the total number of I/O operations,... |

16 | Global Communication Optimization for Tensor Contraction Expressions under Memory Constraints - Cociorva, Gao, et al. - 2001 |

13 | Data locality optimization for synthesis of efficient out-of-core algorithms - Krishnan, Krishnamoorthy, et al. - 2003 |

12 | Optimizing collective I/O performance on parallel computers: A multisystem study
- Chen, Foster, et al.
- 1997
(Show Context)
Citation Context ...hat I/O be done using contiguous blocks of disk resident data. To optimize performance in collective I/O operations between arrays located on disk and in distributed main memory of parallel computers =-=[1]-=-, I/O libraries like PANDA [2, 3] and DRA [4] use a blocked layout representation for the disk-based multidimensional arrays instead of the dimension-ordered representation used typically for the repr... |

9 | An efficient algorithm for out-of-core matrix transposition
- Suh, Prasanna
(Show Context)
Citation Context ...in the algorithm is proportional to the array dimensions. Kaushik et al. [20] reduced the number of read operations and increased the read block size compared to Eklundh’s algorithm. Sun and Prasanna =-=[21]-=- proposed an algorithm that minimized the total number of I/O operations, while potentially increasing the total volume of I/O. Krishnamoorthy et al. [22] formulated these algorithms in a tensor produ... |

8 | Efficient synthesis of out-of-core algorithms using a nonlinear optimization solver - Krishnan, Krishnamoorthy, et al. - 2004 |

8 |
Global arrays: a portable programming model for distributed memory computers
- Nieplocha, Harrison, et al.
- 1994
(Show Context)
Citation Context ...sents the proposed approach for efficient layout transformation. In Section 5, experimental results are presented. Section 6 concludes the paper. 2 Disk Resident Arrays The Global Arrays (GA) library =-=[16]-=- [17] provides a shared-memory programming model in which data locality is explicitly managed by the programmer. Explicit function calls are used to transfer data between global address space and loca... |

6 |
A stepwise approach to computing the multidimensional Fast Fourier Transform of large arrays
- Anderson
- 1980
(Show Context)
Citation Context ...will allow for efficient access. An example is the out-of-core 2D Fast Fourier Transform (FFT), where the array is accessed by columns in one phase and by rows in the other. The multi-dimensional FFT =-=[5, 6]-=- can be implemented as a series of one-dimensional FFTs, one along each dimension. Anothersexample illustrating very different access patterns is with image data in three and four (including time) dim... |

5 | On efficient out-of-core matrix transposition
- Krishnamoorthy, Baumgartner, et al.
- 2003
(Show Context)
Citation Context ...ared to Eklundh’s algorithm. Sun and Prasanna [21] proposed an algorithm that minimized the total number of I/O operations, while potentially increasing the total volume of I/O. Krishnamoorthy et al. =-=[22]-=- formulated these algorithms in a tensor product notation and derived a generic algorithm that attempts to minimizes the total execution time by taking into consideration the I/O characteristics of th... |

3 | P.: Efficient parallel out-of-core matrix transposition
- Krishnamoorthy, Baumgartner, et al.
- 2003
(Show Context)
Citation Context ...nimizes the total execution time by taking into consideration the I/O characteristics of the system, and subsequently extended it to a multi-processor system, in which each processor has a local disk =-=[23]-=- . Most of the above approaches assume the array dimensions and the memory size to be powers-of-2. This assumption, coupled with the fact that the required transformation is a transposition, allows di... |

2 |
T.K.: Adaptive resolution isosurface construction in three and four dimensions
- Kazhiyur-Mannar, Wenger, et al.
- 2003
(Show Context)
Citation Context ...n writes to disk, and hence limits the blocking possible. To efficiently perform computations on the stored data in a parallel system, the data might have to transformed into a different blocked form =-=[7]-=-. Thus there are situations where performance can be greatly improved by transforming the layout of a multidimensional array on disk to match the application’s access pattern. Our primary motivation f... |