Results 1  10
of
36
Nonlinear Array Layouts for Hierarchical Memory Systems
, 1999
"... Programming languages that provide multidimensional arrays and a flat linear model of memory must implement a mapping between these two domains to order array elements in memory. This layout function is fixed at language definition time and constitutes an invisible, nonprogrammable array attribute. ..."
Abstract

Cited by 76 (5 self)
 Add to MetaCart
(Show Context)
Programming languages that provide multidimensional arrays and a flat linear model of memory must implement a mapping between these two domains to order array elements in memory. This layout function is fixed at language definition time and constitutes an invisible, nonprogrammable array attribute. In reality, modern memory systems are architecturally hierarchical rather than flat, with substantial differences in performance among different levels of the hierarchy. This mismatch between the model and the true architecture of memory systems can result in low locality of reference and poor performance. Some of this loss in performance can be recovered by reordering computations using transformations such as loop tiling. We explore nonlinear array layout functions as an additional means of improving locality of reference. For a benchmark suite composed of dense matrix kernels, we show by timing and simulation that two specific layouts (4D and Morton) have low implementation costs (25% of total running time) and high performance benefits (reducing execution time by factors of 1.12.5); that they have smooth performance curves, both across a wide range of problem sizes and over representative cache architectures; and that recursionbased control structures may be needed to fully exploit their potential.
Recursive Array Layouts and Fast Parallel Matrix Multiplication
 In Proceedings of Eleventh Annual ACM Symposium on Parallel Algorithms and Architectures
, 1999
"... Matrix multiplication is an important kernel in linear algebra algorithms, and the performance of both serial and parallel implementations is highly dependent on the memory system behavior. Unfortunately, due to false sharing and cache conflicts, traditional columnmajor or rowmajor array layouts i ..."
Abstract

Cited by 54 (5 self)
 Add to MetaCart
(Show Context)
Matrix multiplication is an important kernel in linear algebra algorithms, and the performance of both serial and parallel implementations is highly dependent on the memory system behavior. Unfortunately, due to false sharing and cache conflicts, traditional columnmajor or rowmajor array layouts incur high variability in memory system performance as matrix size varies. This paper investigates the use of recursive array layouts for improving the performance of parallel recursive matrix multiplication algorithms. We extend previous work by Frens and Wise on recursive matrix multiplication to examine several recursive array layouts and three recursive algorithms: standard matrix multiplication, and the more complex algorithms of Strassen and Winograd. We show that while recursive array layouts significantly outperform traditional layouts (reducing execution times by a factor of 1.22.5) for the standard algorithm, they offer little improvement for Strassen's and Winograd's algorithms;...
Recursive Array Layouts and Fast Matrix Multiplication
, 1999
"... The performance of both serial and parallel implementations of matrix multiplication is highly sensitive to memory system behavior. False sharing and cache conflicts cause traditional columnmajor or rowmajor array layouts to incur high variability in memory system performance as matrix size var ..."
Abstract

Cited by 39 (0 self)
 Add to MetaCart
(Show Context)
The performance of both serial and parallel implementations of matrix multiplication is highly sensitive to memory system behavior. False sharing and cache conflicts cause traditional columnmajor or rowmajor array layouts to incur high variability in memory system performance as matrix size varies. This paper investigates the use of recursive array layouts to improve performance and reduce variability. Previous work on recursive matrix multiplication is extended to examine several recursive array layouts and three recursive algorithms: standard matrix multiplication, and the more complex algorithms of Strassen and Winograd. While recursive layouts significantly outperform traditional layouts (reducing execution times by a factor of 1.22.5) for the standard algorithm, they offer little improvement for Strassen's and Winograd's algorithms. For a purely sequential implementation, it is possible to reorder computation to conserve memory space and improve performance between ...
LoadSharing in Heterogeneous Systems via Weighted Factoring
 in Proceedings of the 8th Annual ACM Symposium on Parallel Algorithms and Architectures
, 1997
"... We consider the problem of scheduling a parallel loop with independent iterations on a network of heterogeneous workstations, and demonstrate the effectiveness of a variant of factoring, a scheduling policy originating in the context of shared addressspace homogeneous multiprocessors. In the new s ..."
Abstract

Cited by 38 (0 self)
 Add to MetaCart
(Show Context)
We consider the problem of scheduling a parallel loop with independent iterations on a network of heterogeneous workstations, and demonstrate the effectiveness of a variant of factoring, a scheduling policy originating in the context of shared addressspace homogeneous multiprocessors. In the new scheme, weighted factoring, processors are dynamically assigned decreasing size chunks of iterations in proportion to their processing speeds. Through experiments on a network of SUN Sparc workstations we show that weighted factoring significantly outperforms variants of a workstealing loadbalancing algorithm and on certain applications dramatically outperforms factoring as well. We then study weighted work assignment analytically, giving upper and lower bounds on its performance under the assumption that the processor iteration execution times can be modeled as weighted random variables. Department of Computer Science, Polytechnic University, Brooklyn, NY, 11201. Research supported by AR...
Application Level Scheduling of Gene Sequence Comparison on Metacomputers
 In Proceedings of the 12th ACM International Conference on Supercomputing
, 1998
"... This paper investigates the efficacy of ApplicationLevel Scheduling (AppLeS) [3] for a parallel gene sequence library comparison application in production metacomputing settings. We compare an AppLeSenhanced version of the application to an original implementation designed and tuned to use the nativ ..."
Abstract

Cited by 34 (5 self)
 Add to MetaCart
(Show Context)
This paper investigates the efficacy of ApplicationLevel Scheduling (AppLeS) [3] for a parallel gene sequence library comparison application in production metacomputing settings. We compare an AppLeSenhanced version of the application to an original implementation designed and tuned to use the native scheduling mechanisms of Mentat [6]  a metacomputing software infrastructure. The experimental data shows that the AppLeS versions outperform the best Mentat versions over a range of problem sizes and computational settings. The structure of the AppLeS we have defined for this application does not depend on the scheduling algorithms that it uses. Instead, the AppLeS scheduler considers the uncertainty associated with the information it uses in its scheduling decisions to choose between the static placement of computation, and the dynamic assignment of computation during execution. We propose that this framework is general enough to represent the class of metacomputing applications that a...
CacheEfficient Matrix Transposition
"... We investigate the memory system performance of several algorithms for transposing an N N matrix inplace, where N is large. Specifically, we investigate the relative contributions of the data cache, the translation lookaside buffer, register tiling, and the array layout function to the overall runn ..."
Abstract

Cited by 26 (0 self)
 Add to MetaCart
We investigate the memory system performance of several algorithms for transposing an N N matrix inplace, where N is large. Specifically, we investigate the relative contributions of the data cache, the translation lookaside buffer, register tiling, and the array layout function to the overall running time of the algorithms. We use various memory models to capture and analyze the effect of various facets of cache memory architecture that guide the choice of a particular algorithm, and attempt to experimentally validate the predictions of the model. Our major conclusions are as follows: limited associativity in the mapping from main memory addresses to cache sets can significantly degrade running time; the limited number of TLB entries can easily lead to thrashing; the fanciest optimal algorithms are not competitive on real machines even at fairly large problem sizes unless cache miss penalties are quite high; lowlevel performance tuning “hacks”, such as register tiling and array alignment, can significantly distort the effects of improved algorithms; and hierarchical nonlinear layouts are inherently superior to the standard canonical layouts (such as row or columnmajor) for
this problem.
Data Exploration of Turbulence Simulations using a Database Cluster
, 2007
"... We describe a new environment for the exploration of turbulent flows that uses a cluster of databases to store complete histories of Direct Numerical Simulation (DNS) results. This allows for spatial and temporal exploration of highresolution data that were traditionally too large to store and too c ..."
Abstract

Cited by 21 (4 self)
 Add to MetaCart
(Show Context)
We describe a new environment for the exploration of turbulent flows that uses a cluster of databases to store complete histories of Direct Numerical Simulation (DNS) results. This allows for spatial and temporal exploration of highresolution data that were traditionally too large to store and too computationally expensive to produce on demand. We perform analysis of these data directly on the databases nodes, which minimizes the volume of network traffic. The low network demands enable us to provide public access to this experimental platform and its datasets through Web services. This paper details the system design and implementation. Specifically, we focus on hierarchical spatial indexing, cachesensitive spatial scheduling of batch workloads, localizing computation through data partitioning, and load balancing techniques that minimize data movement. We provide real examples of how scientists use the system to perform highresolution turbulence research from standard desktop computing environments.
A Comparison of Task Pools for Dynamic Load Balancing of Irregular Algorithms
, 2004
"... Since a static work distribution does not allow for satisfactory speedups of parallel irregular algorithms, there is a need for a dynamic distribution of work and data that can be adapted to the runtime behavior of the algorithm. Task pools are data structures which can distribute tasks dynamically ..."
Abstract

Cited by 18 (3 self)
 Add to MetaCart
Since a static work distribution does not allow for satisfactory speedups of parallel irregular algorithms, there is a need for a dynamic distribution of work and data that can be adapted to the runtime behavior of the algorithm. Task pools are data structures which can distribute tasks dynamically to different processors where each task specifies computations to be performed and provides the data for these computations. This paper discusses the characteristics of taskbased algorithms and describes the implementation of selected types of task pools for sharedmemory multiprocessors. Several task pools have been implemented in C with POSIX threads and in Java. The task pools differ in the data structures to store the tasks, the mechanism to achieve load balance, and the memory manager used to store the tasks. Runtime experiments have been performed on three different sharedmemory systems using a synthetic algorithm, the hierarchical radiosity method, and a volume rendering algorithm.
Load Balancing and Data Locality in the Parallelization of the Fast Multipole Algorithm
, 1996
"... Scientific problems are often irregular, large and computationally intensive. Efficient parallel implementations of algorithms that are employed in finding solutions to these problems play an important role in the development of science. This thesis studies the parallelization of a certain class of ..."
Abstract

Cited by 13 (9 self)
 Add to MetaCart
(Show Context)
Scientific problems are often irregular, large and computationally intensive. Efficient parallel implementations of algorithms that are employed in finding solutions to these problems play an important role in the development of science. This thesis studies the parallelization of a certain class of irregular scientific problems, the Nbody problem, using a classical hierarchical algorithm: the Fast Multipole Algorithm (FMA). Hierarchical Nbody algorithms in general, and the FMA in particular, are amenable to parallel execution. However, performance gains are difficult to obtain, due to load imbalances that are primarily caused by the irregular distribution of bodies and of computation domains. Understanding application characteristics is essential for obtaining high performance implementations on parallel machines. After surveying the available parallelism in the FMA, we address the problem of exploiting this parallelism with partitioning and scheduling techniques that optimally map i...
Experiences With Fractiling In NBody Simulations
 In Proceedings of High Performance Computing'98 Symposium
, 1998
"... Nbody simulations pose load balancing problems mainly due to the irregularity of the distribution of particles and to the different processing requirements of particles in the interior and of those near the boundary of the computation space. In the past, most of the methods to overcome performance ..."
Abstract

Cited by 9 (6 self)
 Add to MetaCart
(Show Context)
Nbody simulations pose load balancing problems mainly due to the irregularity of the distribution of particles and to the different processing requirements of particles in the interior and of those near the boundary of the computation space. In the past, most of the methods to overcome performance degradation due to load imbalance used profiling work from a previous time step. The overhead of these methods increases with the problem size and the number of processors. Moreover, these methods are not robust to load imbalances due to systemic variances (data access latency and operating system interference). Recently, Fractiling, a new dynamic scheduling technique based on a probabilistic analysis, has considerably improved performance on Nbody simulations in distributed memory sharedaddress space environment. This technique adapts to algorithmic as well as systemic variances. Our goal is to experimentally extend this technique and evaluate its benefits in a message passing environment...