Results 1 - 10
of
22
Nonlinear Array Layouts for Hierarchical Memory Systems
, 1999
"... Programming languages that provide multidimensional arrays and a flat linear model of memory must implement a mapping between these two domains to order array elements in memory. This layout function is fixed at language definition time and constitutes an invisible, non-programmable array attribute. ..."
Abstract
-
Cited by 67 (4 self)
- Add to MetaCart
Programming languages that provide multidimensional arrays and a flat linear model of memory must implement a mapping between these two domains to order array elements in memory. This layout function is fixed at language definition time and constitutes an invisible, non-programmable array attribute. In reality, modern memory systems are architecturally hierarchical rather than flat, with substantial differences in performance among different levels of the hierarchy. This mismatch between the model and the true architecture of memory systems can result in low locality of reference and poor performance. Some of this loss in performance can be recovered by re-ordering computations using transformations such as loop tiling. We explore nonlinear array layout functions as an additional means of improving locality of reference. For a benchmark suite composed of dense matrix kernels, we show by timing and simulation that two specific layouts (4D and Morton) have low implementation costs (2--5% of total running time) and high performance benefits (reducing execution time by factors of 1.1-2.5); that they have smooth performance curves, both across a wide range of problem sizes and over representative cache architectures; and that recursion-based control structures may be needed to fully exploit their potential.
Parallel Programmability and the Chapel Language
- Int. J. High Perform. Comput. Appl
"... It is an increasingly common belief that the programmability of parallel machines is lacking, and that the high-end computing (HEC) community is suffering as a result of it. The population of users who can effectively program parallel machines comprises only a small fraction of those who can effecti ..."
Abstract
-
Cited by 55 (5 self)
- Add to MetaCart
It is an increasingly common belief that the programmability of parallel machines is lacking, and that the high-end computing (HEC) community is suffering as a result of it. The population of users who can effectively program parallel machines comprises only a small fraction of those who can effectively program traditional sequential computers, and this gap seems only to be widening as time passes. The parallel computing community’s inability to tap the skills
Recursive Array Layouts and Fast Parallel Matrix Multiplication
- In Proceedings of Eleventh Annual ACM Symposium on Parallel Algorithms and Architectures
, 1999
"... Matrix multiplication is an important kernel in linear algebra algorithms, and the performance of both serial and parallel implementations is highly dependent on the memory system behavior. Unfortunately, due to false sharing and cache conflicts, traditional column-major or row-major array layouts i ..."
Abstract
-
Cited by 44 (3 self)
- Add to MetaCart
Matrix multiplication is an important kernel in linear algebra algorithms, and the performance of both serial and parallel implementations is highly dependent on the memory system behavior. Unfortunately, due to false sharing and cache conflicts, traditional column-major or row-major array layouts incur high variability in memory system performance as matrix size varies. This paper investigates the use of recursive array layouts for improving the performance of parallel recursive matrix multiplication algorithms. We extend previous work by Frens and Wise on recursive matrix multiplication to examine several recursive array layouts and three recursive algorithms: standard matrix multiplication, and the more complex algorithms of Strassen and Winograd. We show that while recursive array layouts significantly outperform traditional layouts (reducing execution times by a factor of 1.2--2.5) for the standard algorithm, they offer little improvement for Strassen's and Winograd's algorithms;...
Recursive Array Layouts and Fast Matrix Multiplication
, 1999
"... The performance of both serial and parallel implementations of matrix multiplication is highly sensitive to memory system behavior. False sharing and cache conflicts cause traditional column-major or row-major array layouts to incur high variability in memory system performance as matrix size var ..."
Abstract
-
Cited by 31 (0 self)
- Add to MetaCart
The performance of both serial and parallel implementations of matrix multiplication is highly sensitive to memory system behavior. False sharing and cache conflicts cause traditional column-major or row-major array layouts to incur high variability in memory system performance as matrix size varies. This paper investigates the use of recursive array layouts to improve performance and reduce variability. Previous work on recursive matrix multiplication is extended to examine several recursive array layouts and three recursive algorithms: standard matrix multiplication, and the more complex algorithms of Strassen and Winograd. While recursive layouts significantly outperform traditional layouts (reducing execution times by a factor of 1.2--2.5) for the standard algorithm, they offer little improvement for Strassen's and Winograd's algorithms. For a purely sequential implementation, it is possible to reorder computation to conserve memory space and improve performance between ...
NanosCompiler: A Research Platform for OpenMP Extensions
- In First European Workshop on OpenMP
, 1999
"... This paper describes the main functionalities of the OpenMP NanosCompiler. It is a source-to-source parallelizing compiler implemented around a hierarchical internal program representation that captures the parallelism expressed by the user (through OpenMP directives and extensions) and the parallel ..."
Abstract
-
Cited by 27 (5 self)
- Add to MetaCart
This paper describes the main functionalities of the OpenMP NanosCompiler. It is a source-to-source parallelizing compiler implemented around a hierarchical internal program representation that captures the parallelism expressed by the user (through OpenMP directives and extensions) and the parallelism automatically discovered by the compiler through a detailed analysis of data and control dependences. The compiler is finally responsible for encapsulating work into threads, establishing their execution precedences and selecting the mechanisms to execute them in parallel. One of the main features of the NanosCompiler is the abilitity to exploit multiple levels of parallelism and generate work from multiple simultaneously executing threads.
Efficient Support of Parallel Sparse Computation for Array Intrinsic Functions of Fortran 90
, 1998
"... Fortran 90 provides a rich set of array intrinsic functions. They form a rich source of parallelism and play an increasingly important role in automatic support of data parallel programming. However, there is no such support if these intrinsic functions are applied to sparse data sets. We address th ..."
Abstract
-
Cited by 6 (4 self)
- Add to MetaCart
Fortran 90 provides a rich set of array intrinsic functions. They form a rich source of parallelism and play an increasingly important role in automatic support of data parallel programming. However, there is no such support if these intrinsic functions are applied to sparse data sets. We address this open gap by presenting an efficient library for parallel sparse computations with Fortran 90 array intrinsic operations. Our method provides both compression schemes and distribution schemes on distributed memory environments applicable to higherdimensional sparse arrays. Sparse programs can be expressed concisely using array expressions, and parallelized with the help of our library. Preliminary experimental results on an IBM SP2 workstation cluster show that our approach is promising in supporting efficient sparse matrix computations on both sequential and distributed memory environments. 1 Introduction An increasing number of programming languages, such as APL, Fortran 90, High Perfor...
Program optimization in the domain of high-performance parallelism
- In this volume
, 2004
"... Abstract. I consider the problem of the domain-specific optimization of programs. I review different approaches, discuss their potential, and sketch instances of them from the practice of high-performance parallelism. Readers need not be familiar with high-performance computing. 1 ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Abstract. I consider the problem of the domain-specific optimization of programs. I review different approaches, discuss their potential, and sketch instances of them from the practice of high-performance parallelism. Readers need not be familiar with high-performance computing. 1
Volume Driven Data Distribution for NUMA-Machines
- In Proceedings from the 6th International Euro-Par Conference on Parallel Processing
, 2000
"... Highly scalable parallel computers, e.g. SCI-coupled workstation clusters, are NUMA architectures. Thus good static locality is essential for high performance and scalability of parallel programs on these machines. This paper describes novel techniques to optimize static locality at compilation ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Highly scalable parallel computers, e.g. SCI-coupled workstation clusters, are NUMA architectures. Thus good static locality is essential for high performance and scalability of parallel programs on these machines. This paper describes novel techniques to optimize static locality at compilation time by application of data transformations and data distributions. The metric which guides the optimizations employs Ehrhart polynomials and allows to calculate the amount of static locality precisely . The e#ectiveness of our novel techniques has been confirmed by experiments conducted on the SCI-coupled workstation cluster of the PC at the University of Paderborn.
Expressing Irregular Computations in Modern Fortran Dialects
- IN FOURTH WORKSHOP ON LANGUAGES, COMPILERS, AND RUN-TIME SYSTEMS FOR SCALABLE COMPUTERS, LECTURE NOTES IN COMPUTER SCIENCE
, 1998
"... Modern dialects of Fortran enjoy wide use and good support on highperformance computers as performance-oriented programming languages. By providing the ability to express nested data parallelism in Fortran, we enable irregular computations to be incorporated into existing applications with minima ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Modern dialects of Fortran enjoy wide use and good support on highperformance computers as performance-oriented programming languages. By providing the ability to express nested data parallelism in Fortran, we enable irregular computations to be incorporated into existing applications with minimal rewriting and without sacrificing performance within the regular portions of the application.
PARADIGM (version 2.0): A New HPF Compilation System
- In Proc. 1999 International Parallel Processing Symposium (IPPS'99
, 1999
"... In this paper, a we present sample performance figures for a new linear algebra-based compilation framework implemented in a research HPF compiler called PARADIGM. The metrics considered include compilation times, execution times, and communication costs. We compare all of these metrics against comm ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
In this paper, a we present sample performance figures for a new linear algebra-based compilation framework implemented in a research HPF compiler called PARADIGM. The metrics considered include compilation times, execution times, and communication costs. We compare all of these metrics against commercial, industrial strength compilers such aspghpf (v 2.2) andxlhpf (v 1.01) and show the superior benefits of PARADIGM (v 2.0) in all of the metrics used. We also demonstrate how robustly our framework performs in the presence of arbitrary alignments and distributions. The framework’s symbolic manipulation capability is derived from an off-the-shelf commercial symbolic analysis software called Mathematica. b Measured metrics for a few popular benchmarks such as Automatic Differentiation and Integration (ADI), Euler Fluxes, TOMCATV and 2-D Explicit Hydrodynamics (EXPL) have been presented.

