Results 1  10
of
10
Localizing Nonaffine Array References
, 1999
"... Existing techniques can enhance the locality of arrays indexed by affine functions of induction variables. This paper presents a technique to localize nonaffine array references, such as the indirect memory references common in sparsematrix computations. Our optimization combines elements of tilin ..."
Abstract

Cited by 56 (10 self)
 Add to MetaCart
Existing techniques can enhance the locality of arrays indexed by affine functions of induction variables. This paper presents a technique to localize nonaffine array references, such as the indirect memory references common in sparsematrix computations. Our optimization combines elements of tiling, datacentric tiling, data remapping and inspectorexecutor parallelization. We describe our technique, bucket tiling, which includes the tasks of permutation generation, data remapping, and loop regeneration. We show that profitability cannot generally be determined at compiletime, but requires an extension to runtime. We demonstrate our technique on three codes: integer sort, conjugate gradient, and a kernel used in simulating a beating heart. We observe speedups of 1.91 on integer sort, 1.57 on conjugate gradient, and 2.69 on the heart kernel. 1. Introduction Researchers have long sought to increase data locality and exploit parallelism in loop nests [34, 32, 16, 5, 33, 18]. These wor...
Optimizing Sparse Matrix Vector Multiplication on SMPs
 In Ninth SIAM Conference on Parallel Processing for Scientific Computing
, 1999
"... We describe optimizations of sparse matrixvector multiplication on uniprocessors and SMPs. The optimization techniques include register blocking, cache blocking, and matrix reordering. We focus on optimizations that improve performance on SMPs, in particular, matrix reordering implemented using two ..."
Abstract

Cited by 36 (2 self)
 Add to MetaCart
(Show Context)
We describe optimizations of sparse matrixvector multiplication on uniprocessors and SMPs. The optimization techniques include register blocking, cache blocking, and matrix reordering. We focus on optimizations that improve performance on SMPs, in particular, matrix reordering implemented using two different graph algorithms. We present a performance study of this algorithmic kernel, showing how the optimization techniques affect absolute performance and scalability, how they interact with one another, and how the performance benefits depend on matrix structure.
A Comparison of Locality Transformations for Irregular Codes
, 2000
"... Researchers have proposed several data and computation transformations to improve locality in irregular scientific codes. We experimentally compare their performance and present GPART, a new technique based on hierarchical clustering. Quality partitions are constructed quickly by clustering multiple ..."
Abstract

Cited by 35 (8 self)
 Add to MetaCart
Researchers have proposed several data and computation transformations to improve locality in irregular scientific codes. We experimentally compare their performance and present GPART, a new technique based on hierarchical clustering. Quality partitions are constructed quickly by clustering multiple neighboring nodes with priority on nodes with high degree, and repeating a few passes. Overhead is kept low by clustering multiple nodes in each pass and considering only edges between partitions. Experimental results show GPART matches the performance of more sophisticated partitioning algorithms to with 6%–8%, with a small fraction of the overhead. It is thus useful for optimizing programs whose running times are not known.
Selfadapting numerical software for next generation applications
 Int. J. High Perf. Comput. Appl
, 2002
"... The challenge for the development of next generation software is the successful management of the complex grid environment while delivering to the scientist the full power of flexible compositions of the available algorithmic alternatives. SelfAdapting Numerical Software (SANS) systems are intended ..."
Abstract

Cited by 31 (6 self)
 Add to MetaCart
(Show Context)
The challenge for the development of next generation software is the successful management of the complex grid environment while delivering to the scientist the full power of flexible compositions of the available algorithmic alternatives. SelfAdapting Numerical Software (SANS) systems are intended to meet this significant challenge. A SANS system comprises intelligent next generation numerical software that domain scientists – with disparate levels of knowledge of algorithmic and programmatic complexities of the underlying numerical software – can use to easily express and efficiently solve their problem. The components of a SANS system are: • A SANS agent with: – An intelligent component that automates method selection based on data, algorithm and system attributes. – A system component that provides intelligent management of and access to the computational grid. – A history database that records relevant information generated by the intelligent component and maintains past performance data of the interaction (e.g., algorithmic, hardware specific, etc.) between SANS components. • A simple scripting language that allows a structured multilayered implementation of the SANS while ensuring portability and extensibility of the user interface and underlying libraries. • An XML/CCAbased vocabulary of metadata to describe behavioural properties of both data and algorithms. • System components, including a runtime adaptive scheduler, and prototype libraries that automate the process of architecturedependent tuning to optimize performance on different platforms. A SANS system can dramatically improve the ability of computational scientists to model complex, interdisciplinary phenomena with maximum efficiency and a minimum of extradomain expertise. SANS innovations (and their generalizations) will provide to the scientific and engineering community a dynamic computational environment in which the most effective library components are automatically selected based on the problem characteristics, data attributes, and the state of the grid. 1
An efficient block variant of GMRES
 SIAM J. Sci. Comput
"... Abstract. We present an alternative to the standard restarted GMRES algorithm for solving a single righthand side linear system Ax = b based on solving the block linear system AX = B. Additional starting vectors and righthand sides are chosen to accelerate convergence. Algorithm performance, i.e. ..."
Abstract

Cited by 10 (2 self)
 Add to MetaCart
(Show Context)
Abstract. We present an alternative to the standard restarted GMRES algorithm for solving a single righthand side linear system Ax = b based on solving the block linear system AX = B. Additional starting vectors and righthand sides are chosen to accelerate convergence. Algorithm performance, i.e. time to solution, is improved by using the matrix A in operations on groups of vectors, or “multivectors, ” thereby reducing the movement of A through memory. The efficient implementation of our method depends on a fast matrixmultivector multiply routine. We present numerical results that show that the time to solution of the new method is up to two and half times faster than that of restarted GMRES on preconditioned problems. Furthermore, we demonstrate the impact of implementation choices on data movement and, as a result, algorithm performance. Key words. GMRES, block GMRES, iterative methods, Krylov subspace techniques, restart, nonsymmetric linear systems, memory access costs AMS subject classifications. 65F10
On improving linear solver performance: A block variant of GMRes
 SIAM Journal on Scientific Computing
"... Abstract. The increasing gap between processor performance and memory access time warrants the reexamination of data movement in iterative linear solver algorithms. For this reason, we explore and establish the feasibility of modifying a standard iterative linear solver algorithm in a manner that ..."
Abstract

Cited by 8 (1 self)
 Add to MetaCart
(Show Context)
Abstract. The increasing gap between processor performance and memory access time warrants the reexamination of data movement in iterative linear solver algorithms. For this reason, we explore and establish the feasibility of modifying a standard iterative linear solver algorithm in a manner that reduces the movement of data through memory. In particular, we present an alternative to the restarted GMRES algorithm for solving a single righthand side linear system Ax = b based on solving the block linear system AX = B. Algorithm performance, i.e. time to solution, is improved by using the matrix A in operations on groups of vectors. Experimental results demonstrate the importance of implementation choices on data movement as well as the effectiveness of the new method on a variety of problems from different application areas.
Automated memory analysis: improving the design and implementation of iterative algorithms
, 2005
"... ..."
(Show Context)
ADAPTIVE IRREGULAR APPLICATIONS
"... Many applications access memory in an irregular manner, causing poor cache performance due to the lack of data locality. In complex scienti c applications such as computational uid dynamics and Nbody simulation, irregular and/or dynamic accesses are abundant. Most research has been focused on impro ..."
Abstract
 Add to MetaCart
(Show Context)
Many applications access memory in an irregular manner, causing poor cache performance due to the lack of data locality. In complex scienti c applications such as computational uid dynamics and Nbody simulation, irregular and/or dynamic accesses are abundant. Most research has been focused on improving locality in regular codes through computation and data reorganization. Unfortunately, since irregular codes have di erent computation structures, transformation techniques for regular computations are not directly applicable to irregular codes. In my thesis, I present locality transformations of computation and data to exploit locality inirregular computations. Codes are rst classi ed according to the numberofirregular accesses in each unitofcomputation (e.g., loop iteration). Transformations are then applied appropriately. For sequential codes, data and computation reordering is applied. Computations are sorted based on the location of data being accessed. Elements of data are partitioned according to access patterns. For parallel codes, localityconscious data and computation
Evaluating Locality Optimizations For Adaptive Irregular Scientific Codes
"... Irregular scientific codes experience poor cache performance due to their memory access patterns. Researchers have proposed several data and computation transformations to improve locality in irregular scientific codes. We experimentally compare their performance and present GPART, a new technique b ..."
Abstract
 Add to MetaCart
(Show Context)
Irregular scientific codes experience poor cache performance due to their memory access patterns. Researchers have proposed several data and computation transformations to improve locality in irregular scientific codes. We experimentally compare their performance and present GPART, a new technique based on hierarchical clustering. Quality partitions are constructed quickly by clustering multiple neighboring nodes with priority on nodes with high degree, and repeating a few passes. Overhead is kept low by clustering multiple nodes in each pass and considering only edges between partitions. We also experimentally evaluate parameters for partitioning algorithms to balance locality with overhead and explore the impact of locality optimizations on parallel performance. Finally, we evaluate locality optimizations for adaptive codes, where connection patterns dynamically change at intervals during program execution. We derive a simple cost model to guide locality optimizations when access pat...
Appeared in the Proceedings of Parallel Architectures and Compilation Techniques ’99. 1 Localizing Nonaffine Array References
"... Existing techniques can enhance the locality of arrays indexed by affine functions of induction variables. This paper presents a technique to localize nonaffine array references, such as the indirect memory references common in sparsematrix computations. Our optimization combines elements of til ..."
Abstract
 Add to MetaCart
(Show Context)
Existing techniques can enhance the locality of arrays indexed by affine functions of induction variables. This paper presents a technique to localize nonaffine array references, such as the indirect memory references common in sparsematrix computations. Our optimization combines elements of tiling, datacentric tiling, data remapping and inspectorexecutor parallelization. We describe our technique, bucket tiling, which includes the tasks of permutation generation, data remapping, and loop regeneration. We show that profitability cannot generally be determined at compiletime, but requires an extension to runtime. We demonstrate our technique on three codes: integer sort, conjugate gradient, and a kernel used in simulating a beating heart. We observe speedups of 1.91 on integer sort, 1.57 on conjugate gradient, and 2.69 on the heart kernel. 1.