Results 1 - 10
of
19
Towards Automatic Synthesis of High-Performance Codes for Electronic Structure Calculations: Data Locality Optimization
- In Proc. of the Intl. Conf. on High Performance Computing
, 2001
"... The goal of our project is the development of a program synthesis system to facilitate the development of high-performance parallel programs for a class of computations encountered in computational chemistry and computational physics. These computations are expressible as a set of tensor contract ..."
Abstract
-
Cited by 28 (22 self)
- Add to MetaCart
The goal of our project is the development of a program synthesis system to facilitate the development of high-performance parallel programs for a class of computations encountered in computational chemistry and computational physics. These computations are expressible as a set of tensor contractions and arise in electronic structure calculations. This paper provides an overview of a planned synthesis system that will take as input a high-level specification of the computation and generate high-performance parallel code for a number of target architectures. We focus on an approach to performing data locality optimization in this context. Preliminary experimental results on an SGI Origin 2000 are encouraging and demonstrate that the approach is effective.
Space-Time Trade-Off Optimization for a Class of Electronic Structure Calculations
, 2002
"... The accurate modeling of the electronic structure of atoms and molecules is very computationally intensive. Many models of electronic structure, such as the Coupled Cluster approach, involve collections of tensor contractions. There are usually a large number of alternative ways of implementing the ..."
Abstract
-
Cited by 26 (19 self)
- Add to MetaCart
The accurate modeling of the electronic structure of atoms and molecules is very computationally intensive. Many models of electronic structure, such as the Coupled Cluster approach, involve collections of tensor contractions. There are usually a large number of alternative ways of implementing the tensor contractions, representing different trade-offs between the space required for temporary intermediates and the total number of arithmetic operations. In this paper, we present an algorithm that starts with an operationminimal form of the computation and systematically explores the possible space-time trade-offs to identify the form with lowest cost that fits within a specified memory limit. Its utility is demonstrated by applying it to a computation representative of a component in the CCSD(T) formulation in the NWChem quantum chemistry suite from Pacific Northwest National Laboratory.
Synthesis of High-Performance Parallel Programs for a Class of Ab Initio Quantum Chemistry Models
- PROCEEDINGS OF THE IEEE
, 2005
"... ..."
A High-Level Approach to Synthesis of High-Performance Codes for Quantum Chemistry
- In Proc. of Supercomputing 2002
, 2002
"... This paper discusses an approach to the synthesis of high-performance parallel programs for a class of computations encountered in quantum chemistry and physics. These computations are expressible as a set of tensor contractions and arise in electronic structure modeling. An overview is provided of ..."
Abstract
-
Cited by 24 (13 self)
- Add to MetaCart
This paper discusses an approach to the synthesis of high-performance parallel programs for a class of computations encountered in quantum chemistry and physics. These computations are expressible as a set of tensor contractions and arise in electronic structure modeling. An overview is provided of the synthesis system, that transforms a high-level specification of the computation into high-performance parallel code, tailored to the characteristics of the target architecture. An example from computational chemistry is used to illustrate how different code structures are generated under different assumptions of available memory on the target computer.
Loop Optimizations for a Class of Memory-Constrained Computations
, 2001
"... Compute-intensive multi-dimensional summations that involve products of several arrays arise in the modeling of electronic structure of materials. Sometimes several alternative formulations of a computation, representing different spacetime trade-offs, are possible. By computing and storing some int ..."
Abstract
-
Cited by 21 (16 self)
- Add to MetaCart
Compute-intensive multi-dimensional summations that involve products of several arrays arise in the modeling of electronic structure of materials. Sometimes several alternative formulations of a computation, representing different spacetime trade-offs, are possible. By computing and storing some intermediate arrays, reduction of the number of arithmetic operations is possible, but the size of intermediate temporary arrays may be prohibitively large. Loop fusion can be applied to reduce memory requirements, but that could impede effective tiling to minimize memory access costs. This paper develops an integrated model combining loop tiling for enhancing data reuse, and loop fusion for reduction of memory for intermediate temporary arrays. An algorithm is presented that addresses the selection of tile sizes and choice of loops for fusion, with the objective of minimizing cache misses while keeping the total memory usage within a given limit. Experimental results are reported that demonstrate the effectiveness of the combined loop tiling and fusion transformations performed by using the developed framework.
Optimization of Memory Usage and Communication Requirements for a Class of Loops Implementing Multi-Dimensional Integrals
, 1999
"... Multi-dimensional integrals of products of several arrays arise in certain scientific computations. To optimize the performance of such computations on parallel computers, the total amount of communication needs to be minimized, while staying within the available memory on each processor. In the con ..."
Abstract
-
Cited by 14 (9 self)
- Add to MetaCart
Multi-dimensional integrals of products of several arrays arise in certain scientific computations. To optimize the performance of such computations on parallel computers, the total amount of communication needs to be minimized, while staying within the available memory on each processor. In the context of these integral calculations, this paper addresses a communication minimization problem with memory constraint. We first investigate the memory usage minimization problem for sequential computers. Based on a framework that models the relationship between loop fusion and memory usage, we propose algorithms for finding loop fusion configurations that minimize memory usage under static and dynamic memory allocation models. We suggest ways to further reduce memory usage, when necessary, at the cost of increased arithmetic operations. A practical example shows the performance improvement obtained by our algorithms on an electronic structure computation. The algorithms are then extended to ...
Efficient synthesis of out-of-core algorithms using a nonlinear optimization solver
- In: Proc. of 18th Intl. Parallel & Distributed Processing Symposium (IPDPS). (2004
, 2004
"... We address the problem of efficient out-of-core code generation for a special class of imperfectly nested loops encoding tensor contractions. These loops operate on arrays too large to fit in physical memory. The problem involves determining optimal tiling and placement of disk I/O statements. This ..."
Abstract
-
Cited by 6 (4 self)
- Add to MetaCart
We address the problem of efficient out-of-core code generation for a special class of imperfectly nested loops encoding tensor contractions. These loops operate on arrays too large to fit in physical memory. The problem involves determining optimal tiling and placement of disk I/O statements. This entails a search in an explosively large parameter space. We formulate the problem as a non-linear optimization problem and use a discrete constraint solver to generate optimized out-of-core code. Measurements on sequential and parallel versions of the generated code demonstrate the effectiveness of the proposed approach. 1
High-Performance Reduction Circuits Using Deeply Pipelined Operators on FPGAs
"... Abstract—Field-programmable gate arrays (FPGAs) have become an attractive option for accelerating scientific applications. Many scientific operations such as matrix-vector multiplication and dot product involve the reduction of a sequentially produced stream of values. Unfortunately, because of the ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
Abstract—Field-programmable gate arrays (FPGAs) have become an attractive option for accelerating scientific applications. Many scientific operations such as matrix-vector multiplication and dot product involve the reduction of a sequentially produced stream of values. Unfortunately, because of the pipelining in FPGA-based floating-point units, data hazards may occur during these sequential reduction operations. Improperly designed reduction circuits can adversely impact the performance, impose unrealistic buffer requirements, and consume a significant portion of the FPGA. In this paper, we identify two basic methods for designing serial reduction circuits: the tree-traversal method and the striding method. Using accumulation as an example, we analyze the design trade-offs among the number of adders, buffer size, and latency. We then propose high-performance and area-efficient designs using each method. The proposed designs reduce multiple sets of sequentially delivered floating-point values without stalling the pipeline or imposing unrealistic buffer requirements. Using a Xilinx Virtex-II Pro FPGA as the target device, we implemented our designs and present performance and area results. Index Terms—Parallel algorithms, reconfigurable hardware. Ç 1
A Performance Optimization Framework for Compilation of Tensor Contraction Expressions into Parallel Programs
- IN PROC. INT’L WORKSHOP ON HIGH-LEVEL PARALLEL PROGRAMMING MODELS AND SUPPORTIVE ENVIRONMENTS (HELD IN CONJUNCTION WITH IEEE INT’L PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS
, 2002
"... This paper discusses a program synthesis system to facilitate the generation of high-performance parallel programs for a class of computations encountered in quantum chemistry and physics. These computations are expressible as a set of tensor contractions and arise in electronic structure modeling. ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
This paper discusses a program synthesis system to facilitate the generation of high-performance parallel programs for a class of computations encountered in quantum chemistry and physics. These computations are expressible as a set of tensor contractions and arise in electronic structure modeling. An overview is provided of the synthesis system under development, that will take as input a high-level specification of the computation and generate high-performance parallel code for a number of target architectures. Several components of the synthesis system are described, focusing on compile-time optimization issues that they address.
Optimization of memory usage requirement for a class of loops implementing multi-dimensional integrals
- In Languages and Compilers for Parallel Computing
, 1999
"... Abstract. Multi-dimensional integrals of products of several arrays arise in certain scientific computations. In the context of these integral calculations, this paper addresses a memory usage minimization problem. Based on a framework that models the relationship between loop fusion and memory usag ..."
Abstract
-
Cited by 4 (4 self)
- Add to MetaCart
Abstract. Multi-dimensional integrals of products of several arrays arise in certain scientific computations. In the context of these integral calculations, this paper addresses a memory usage minimization problem. Based on a framework that models the relationship between loop fusion and memory usage, we propose an algorithm for finding a loop fusion configuration that minimizes memory usage. A practical example shows the performance improvement obtained by our algorithm on an electronic structure computation. 1

