Results 1  10
of
23
Synthesis of HighPerformance Parallel Programs for a Class of Ab Initio Quantum Chemistry Models
 PROCEEDINGS OF THE IEEE
, 2005
"... ..."
Memoryoptimal evaluation of expression trees involving large objects
 In Proc. Intl. Conf. on High Perf. Comp
, 1999
"... Abstract. The need to evaluate expression trees involving large objects arises in scientific computing applications such as electronic structure calculations. Often, the tree node objects are very large that only a subset of them can fit in memory at a time. This paper addresses the problem of findi ..."
Abstract

Cited by 32 (21 self)
 Add to MetaCart
Abstract. The need to evaluate expression trees involving large objects arises in scientific computing applications such as electronic structure calculations. Often, the tree node objects are very large that only a subset of them can fit in memory at a time. This paper addresses the problem of finding an evaluation order of nodes in a given expression tree that uses the least memory. We develop an efficient algorithm that finds an optimal evaluation order in O(n 2) time for an nnode expression tree. 1
Towards Automatic Synthesis of HighPerformance Codes for Electronic Structure Calculations: Data Locality Optimization
 In Proc. of the Intl. Conf. on High Performance Computing
, 2001
"... The goal of our project is the development of a program synthesis system to facilitate the development of highperformance parallel programs for a class of computations encountered in computational chemistry and computational physics. These computations are expressible as a set of tensor contract ..."
Abstract

Cited by 31 (24 self)
 Add to MetaCart
The goal of our project is the development of a program synthesis system to facilitate the development of highperformance parallel programs for a class of computations encountered in computational chemistry and computational physics. These computations are expressible as a set of tensor contractions and arise in electronic structure calculations. This paper provides an overview of a planned synthesis system that will take as input a highlevel specification of the computation and generate highperformance parallel code for a number of target architectures. We focus on an approach to performing data locality optimization in this context. Preliminary experimental results on an SGI Origin 2000 are encouraging and demonstrate that the approach is effective.
A HighLevel Approach to Synthesis of HighPerformance Codes for Quantum Chemistry
 In Proc. of Supercomputing 2002
, 2002
"... This paper discusses an approach to the synthesis of highperformance parallel programs for a class of computations encountered in quantum chemistry and physics. These computations are expressible as a set of tensor contractions and arise in electronic structure modeling. An overview is provided of ..."
Abstract

Cited by 28 (14 self)
 Add to MetaCart
This paper discusses an approach to the synthesis of highperformance parallel programs for a class of computations encountered in quantum chemistry and physics. These computations are expressible as a set of tensor contractions and arise in electronic structure modeling. An overview is provided of the synthesis system, that transforms a highlevel specification of the computation into highperformance parallel code, tailored to the characteristics of the target architecture. An example from computational chemistry is used to illustrate how different code structures are generated under different assumptions of available memory on the target computer.
SpaceTime TradeOff Optimization for a Class of Electronic Structure Calculations
, 2002
"... The accurate modeling of the electronic structure of atoms and molecules is very computationally intensive. Many models of electronic structure, such as the Coupled Cluster approach, involve collections of tensor contractions. There are usually a large number of alternative ways of implementing the ..."
Abstract

Cited by 27 (20 self)
 Add to MetaCart
The accurate modeling of the electronic structure of atoms and molecules is very computationally intensive. Many models of electronic structure, such as the Coupled Cluster approach, involve collections of tensor contractions. There are usually a large number of alternative ways of implementing the tensor contractions, representing different tradeoffs between the space required for temporary intermediates and the total number of arithmetic operations. In this paper, we present an algorithm that starts with an operationminimal form of the computation and systematically explores the possible spacetime tradeoffs to identify the form with lowest cost that fits within a specified memory limit. Its utility is demonstrated by applying it to a computation representative of a component in the CCSD(T) formulation in the NWChem quantum chemistry suite from Pacific Northwest National Laboratory.
Loop Optimizations for a Class of MemoryConstrained Computations
, 2001
"... Computeintensive multidimensional summations that involve products of several arrays arise in the modeling of electronic structure of materials. Sometimes several alternative formulations of a computation, representing different spacetime tradeoffs, are possible. By computing and storing some int ..."
Abstract

Cited by 23 (18 self)
 Add to MetaCart
Computeintensive multidimensional summations that involve products of several arrays arise in the modeling of electronic structure of materials. Sometimes several alternative formulations of a computation, representing different spacetime tradeoffs, are possible. By computing and storing some intermediate arrays, reduction of the number of arithmetic operations is possible, but the size of intermediate temporary arrays may be prohibitively large. Loop fusion can be applied to reduce memory requirements, but that could impede effective tiling to minimize memory access costs. This paper develops an integrated model combining loop tiling for enhancing data reuse, and loop fusion for reduction of memory for intermediate temporary arrays. An algorithm is presented that addresses the selection of tile sizes and choice of loops for fusion, with the objective of minimizing cache misses while keeping the total memory usage within a given limit. Experimental results are reported that demonstrate the effectiveness of the combined loop tiling and fusion transformations performed by using the developed framework.
Optimization of Memory Usage and Communication Requirements for a Class of Loops Implementing MultiDimensional Integrals
, 1999
"... Multidimensional integrals of products of several arrays arise in certain scientific computations. To optimize the performance of such computations on parallel computers, the total amount of communication needs to be minimized, while staying within the available memory on each processor. In the con ..."
Abstract

Cited by 14 (9 self)
 Add to MetaCart
Multidimensional integrals of products of several arrays arise in certain scientific computations. To optimize the performance of such computations on parallel computers, the total amount of communication needs to be minimized, while staying within the available memory on each processor. In the context of these integral calculations, this paper addresses a communication minimization problem with memory constraint. We first investigate the memory usage minimization problem for sequential computers. Based on a framework that models the relationship between loop fusion and memory usage, we propose algorithms for finding loop fusion configurations that minimize memory usage under static and dynamic memory allocation models. We suggest ways to further reduce memory usage, when necessary, at the cost of increased arithmetic operations. A practical example shows the performance improvement obtained by our algorithms on an electronic structure computation. The algorithms are then extended to ...
MemoryConstrained Communication Minimization for a Class of Array Computations
 Languages and Compilers for Parallel Computing
, 2002
"... Abstract. The accurate modeling of the electronic structure of atoms and molecules involves computationally intensive tensor contractions involving large multidimensional arrays. The efficient computation of complex tensor contractions usually requires the generation of temporary intermediate arrays ..."
Abstract

Cited by 8 (6 self)
 Add to MetaCart
Abstract. The accurate modeling of the electronic structure of atoms and molecules involves computationally intensive tensor contractions involving large multidimensional arrays. The efficient computation of complex tensor contractions usually requires the generation of temporary intermediate arrays. These intermediates could be extremely large, but they can often be generated and used in batches through appropriate loop fusion transformations. To optimize the performance of such computations on parallel computers, the total amount of interprocessor communication must be minimized, subject to the available memory on each processor. In this paper, we address the memoryconstrained communication minimization problem in the context of this class of computations. Based on a framework that models the relationship between loop fusion and memory usage, we develop an approach to identify the best combination of loop fusion and data partitioning that minimizes interprocessor communication cost without exceeding the perprocessor memory limit. The effectiveness of the developed optimization approach is demonstrated on a computation representative of a component used in quantum chemistry suites. 1
Efficient synthesis of outofcore algorithms using a nonlinear optimization solver
 In: Proc. of 18th Intl. Parallel & Distributed Processing Symposium (IPDPS). (2004
, 2004
"... We address the problem of efficient outofcore code generation for a special class of imperfectly nested loops encoding tensor contractions. These loops operate on arrays too large to fit in physical memory. The problem involves determining optimal tiling and placement of disk I/O statements. This ..."
Abstract

Cited by 8 (5 self)
 Add to MetaCart
We address the problem of efficient outofcore code generation for a special class of imperfectly nested loops encoding tensor contractions. These loops operate on arrays too large to fit in physical memory. The problem involves determining optimal tiling and placement of disk I/O statements. This entails a search in an explosively large parameter space. We formulate the problem as a nonlinear optimization problem and use a discrete constraint solver to generate optimized outofcore code. Measurements on sequential and parallel versions of the generated code demonstrate the effectiveness of the proposed approach. 1
Performance Optimization of a Class of Loops Involving Sums of Products of Sparse Arrays
 Ninth SIAM Conference on Parallel Processing for Scientific Computing
, 1999
"... Multidimensional integrals of products of several arrays arise in certain scientific computations. To optimize the performance of such computations on parallel computers, the total number of arithmetic operations and the total amount of communication need to be minimized. This paper addresses th ..."
Abstract

Cited by 8 (6 self)
 Add to MetaCart
Multidimensional integrals of products of several arrays arise in certain scientific computations. To optimize the performance of such computations on parallel computers, the total number of arithmetic operations and the total amount of communication need to be minimized. This paper addresses the operation minimization subproblem and the communication minimization subproblem. Earlier work had addressed these problems in a restricted context of dense arrays, with additional constraints. In this paper, general solutions are developed that handle sparse arrays and other features such as fast Fourier transforms and multiple use of arrays, that are characteristics of real computational physics. The new algorithm for the operation minimization subproblem has been implemented and used to generate solutions that improve over the best manuallyoptimized ones by a factor of two. 1 Introduction This paper addresses the optimization of a class of loop computations that implement mul...