Results 1  10
of
26
Synthesis of HighPerformance Parallel Programs for a Class of Ab Initio Quantum Chemistry Models
 PROCEEDINGS OF THE IEEE
, 2005
"... ..."
Towards Automatic Synthesis of HighPerformance Codes for Electronic Structure Calculations: Data Locality Optimization
 In Proc. of the Intl. Conf. on High Performance Computing
, 2001
"... The goal of our project is the development of a program synthesis system to facilitate the development of highperformance parallel programs for a class of computations encountered in computational chemistry and computational physics. These computations are expressible as a set of tensor contract ..."
Abstract

Cited by 31 (24 self)
 Add to MetaCart
The goal of our project is the development of a program synthesis system to facilitate the development of highperformance parallel programs for a class of computations encountered in computational chemistry and computational physics. These computations are expressible as a set of tensor contractions and arise in electronic structure calculations. This paper provides an overview of a planned synthesis system that will take as input a highlevel specification of the computation and generate highperformance parallel code for a number of target architectures. We focus on an approach to performing data locality optimization in this context. Preliminary experimental results on an SGI Origin 2000 are encouraging and demonstrate that the approach is effective.
A HighLevel Approach to Synthesis of HighPerformance Codes for Quantum Chemistry
 In Proc. of Supercomputing 2002
, 2002
"... This paper discusses an approach to the synthesis of highperformance parallel programs for a class of computations encountered in quantum chemistry and physics. These computations are expressible as a set of tensor contractions and arise in electronic structure modeling. An overview is provided of ..."
Abstract

Cited by 28 (14 self)
 Add to MetaCart
This paper discusses an approach to the synthesis of highperformance parallel programs for a class of computations encountered in quantum chemistry and physics. These computations are expressible as a set of tensor contractions and arise in electronic structure modeling. An overview is provided of the synthesis system, that transforms a highlevel specification of the computation into highperformance parallel code, tailored to the characteristics of the target architecture. An example from computational chemistry is used to illustrate how different code structures are generated under different assumptions of available memory on the target computer.
SpaceTime TradeOff Optimization for a Class of Electronic Structure Calculations
, 2002
"... The accurate modeling of the electronic structure of atoms and molecules is very computationally intensive. Many models of electronic structure, such as the Coupled Cluster approach, involve collections of tensor contractions. There are usually a large number of alternative ways of implementing the ..."
Abstract

Cited by 27 (20 self)
 Add to MetaCart
The accurate modeling of the electronic structure of atoms and molecules is very computationally intensive. Many models of electronic structure, such as the Coupled Cluster approach, involve collections of tensor contractions. There are usually a large number of alternative ways of implementing the tensor contractions, representing different tradeoffs between the space required for temporary intermediates and the total number of arithmetic operations. In this paper, we present an algorithm that starts with an operationminimal form of the computation and systematically explores the possible spacetime tradeoffs to identify the form with lowest cost that fits within a specified memory limit. Its utility is demonstrated by applying it to a computation representative of a component in the CCSD(T) formulation in the NWChem quantum chemistry suite from Pacific Northwest National Laboratory.
Loop Optimizations for a Class of MemoryConstrained Computations
, 2001
"... Computeintensive multidimensional summations that involve products of several arrays arise in the modeling of electronic structure of materials. Sometimes several alternative formulations of a computation, representing different spacetime tradeoffs, are possible. By computing and storing some int ..."
Abstract

Cited by 23 (18 self)
 Add to MetaCart
Computeintensive multidimensional summations that involve products of several arrays arise in the modeling of electronic structure of materials. Sometimes several alternative formulations of a computation, representing different spacetime tradeoffs, are possible. By computing and storing some intermediate arrays, reduction of the number of arithmetic operations is possible, but the size of intermediate temporary arrays may be prohibitively large. Loop fusion can be applied to reduce memory requirements, but that could impede effective tiling to minimize memory access costs. This paper develops an integrated model combining loop tiling for enhancing data reuse, and loop fusion for reduction of memory for intermediate temporary arrays. An algorithm is presented that addresses the selection of tile sizes and choice of loops for fusion, with the objective of minimizing cache misses while keeping the total memory usage within a given limit. Experimental results are reported that demonstrate the effectiveness of the combined loop tiling and fusion transformations performed by using the developed framework.
Optimization of Memory Usage and Communication Requirements for a Class of Loops Implementing MultiDimensional Integrals
, 1999
"... Multidimensional integrals of products of several arrays arise in certain scientific computations. To optimize the performance of such computations on parallel computers, the total amount of communication needs to be minimized, while staying within the available memory on each processor. In the con ..."
Abstract

Cited by 14 (9 self)
 Add to MetaCart
Multidimensional integrals of products of several arrays arise in certain scientific computations. To optimize the performance of such computations on parallel computers, the total amount of communication needs to be minimized, while staying within the available memory on each processor. In the context of these integral calculations, this paper addresses a communication minimization problem with memory constraint. We first investigate the memory usage minimization problem for sequential computers. Based on a framework that models the relationship between loop fusion and memory usage, we propose algorithms for finding loop fusion configurations that minimize memory usage under static and dynamic memory allocation models. We suggest ways to further reduce memory usage, when necessary, at the cost of increased arithmetic operations. A practical example shows the performance improvement obtained by our algorithms on an electronic structure computation. The algorithms are then extended to ...
HighPerformance Reduction Circuits Using Deeply Pipelined Operators on FPGAs
"... Abstract—Fieldprogrammable gate arrays (FPGAs) have become an attractive option for accelerating scientific applications. Many scientific operations such as matrixvector multiplication and dot product involve the reduction of a sequentially produced stream of values. Unfortunately, because of the ..."
Abstract

Cited by 12 (0 self)
 Add to MetaCart
Abstract—Fieldprogrammable gate arrays (FPGAs) have become an attractive option for accelerating scientific applications. Many scientific operations such as matrixvector multiplication and dot product involve the reduction of a sequentially produced stream of values. Unfortunately, because of the pipelining in FPGAbased floatingpoint units, data hazards may occur during these sequential reduction operations. Improperly designed reduction circuits can adversely impact the performance, impose unrealistic buffer requirements, and consume a significant portion of the FPGA. In this paper, we identify two basic methods for designing serial reduction circuits: the treetraversal method and the striding method. Using accumulation as an example, we analyze the design tradeoffs among the number of adders, buffer size, and latency. We then propose highperformance and areaefficient designs using each method. The proposed designs reduce multiple sets of sequentially delivered floatingpoint values without stalling the pipeline or imposing unrealistic buffer requirements. Using a Xilinx VirtexII Pro FPGA as the target device, we implemented our designs and present performance and area results. Index Terms—Parallel algorithms, reconfigurable hardware. Ç 1
MemoryConstrained Communication Minimization for a Class of Array Computations
 Languages and Compilers for Parallel Computing
, 2002
"... Abstract. The accurate modeling of the electronic structure of atoms and molecules involves computationally intensive tensor contractions involving large multidimensional arrays. The efficient computation of complex tensor contractions usually requires the generation of temporary intermediate arrays ..."
Abstract

Cited by 8 (6 self)
 Add to MetaCart
Abstract. The accurate modeling of the electronic structure of atoms and molecules involves computationally intensive tensor contractions involving large multidimensional arrays. The efficient computation of complex tensor contractions usually requires the generation of temporary intermediate arrays. These intermediates could be extremely large, but they can often be generated and used in batches through appropriate loop fusion transformations. To optimize the performance of such computations on parallel computers, the total amount of interprocessor communication must be minimized, subject to the available memory on each processor. In this paper, we address the memoryconstrained communication minimization problem in the context of this class of computations. Based on a framework that models the relationship between loop fusion and memory usage, we develop an approach to identify the best combination of loop fusion and data partitioning that minimizes interprocessor communication cost without exceeding the perprocessor memory limit. The effectiveness of the developed optimization approach is demonstrated on a computation representative of a component used in quantum chemistry suites. 1
Efficient synthesis of outofcore algorithms using a nonlinear optimization solver
 In: Proc. of 18th Intl. Parallel & Distributed Processing Symposium (IPDPS). (2004
, 2004
"... We address the problem of efficient outofcore code generation for a special class of imperfectly nested loops encoding tensor contractions. These loops operate on arrays too large to fit in physical memory. The problem involves determining optimal tiling and placement of disk I/O statements. This ..."
Abstract

Cited by 8 (5 self)
 Add to MetaCart
We address the problem of efficient outofcore code generation for a special class of imperfectly nested loops encoding tensor contractions. These loops operate on arrays too large to fit in physical memory. The problem involves determining optimal tiling and placement of disk I/O statements. This entails a search in an explosively large parameter space. We formulate the problem as a nonlinear optimization problem and use a discrete constraint solver to generate optimized outofcore code. Measurements on sequential and parallel versions of the generated code demonstrate the effectiveness of the proposed approach. 1
A Performance Optimization Framework for Compilation of Tensor Contraction Expressions into Parallel Programs
 IN PROC. INT’L WORKSHOP ON HIGHLEVEL PARALLEL PROGRAMMING MODELS AND SUPPORTIVE ENVIRONMENTS (HELD IN CONJUNCTION WITH IEEE INT’L PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS
, 2002
"... This paper discusses a program synthesis system to facilitate the generation of highperformance parallel programs for a class of computations encountered in quantum chemistry and physics. These computations are expressible as a set of tensor contractions and arise in electronic structure modeling. ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
This paper discusses a program synthesis system to facilitate the generation of highperformance parallel programs for a class of computations encountered in quantum chemistry and physics. These computations are expressible as a set of tensor contractions and arise in electronic structure modeling. An overview is provided of the synthesis system under development, that will take as input a highlevel specification of the computation and generate highperformance parallel code for a number of target architectures. Several components of the synthesis system are described, focusing on compiletime optimization issues that they address.