Results 11  20
of
22
Performance Optimization of Tensor Contraction Expressions for Many Body Methods in Quantum Chemistry
"... Complex tensor contraction expressions arise in accurate electronic structure models in quantum chemistry, such as the coupled cluster method. This paper addresses two complementary aspects of performance optimization of such tensor contraction expressions. Transformations using algebraic properties ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
Complex tensor contraction expressions arise in accurate electronic structure models in quantum chemistry, such as the coupled cluster method. This paper addresses two complementary aspects of performance optimization of such tensor contraction expressions. Transformations using algebraic properties of commutativity and associativity can be used to significantly decrease the number of arithmetic operations required for evaluation of these expressions. The identification of common subexpressions among a set of tensor contraction expressions can result in a reduction of the total number of operations required to evaluate the tensor contractions. The first part of the paper describes an effective algorithm for operation minimization
FEMS: An Adaptive Finite Element Solver ∗
"... In this paper we investigate how to obtain highlevel adaptivity on complex scientific applications such as Finite Element (FE) simulators by building an adaptive version of their computational kernel, which consists of a sparse linear system solver. We present the software architecture ofFEMS, a pa ..."
Abstract
 Add to MetaCart
(Show Context)
In this paper we investigate how to obtain highlevel adaptivity on complex scientific applications such as Finite Element (FE) simulators by building an adaptive version of their computational kernel, which consists of a sparse linear system solver. We present the software architecture ofFEMS, a parallel multifrontal solver for FE applications whose main feature is an installtime training phase where adaptation to the computing platform takes place. FEMS relies on a simple modeldriven mesh partitioning strategy, which makes it possible to perform efficient static load balancing on both homogeneous and heterogeneous machines. 1
Member of HiPEAC
"... Developing an optimizing compiler for a newly proposed architecture is extremely difficult when there is only a simulator of the machine available. Designing such a compiler requires running many experiments in order to understand how different optimizations interact. Given that simulators are order ..."
Abstract
 Add to MetaCart
(Show Context)
Developing an optimizing compiler for a newly proposed architecture is extremely difficult when there is only a simulator of the machine available. Designing such a compiler requires running many experiments in order to understand how different optimizations interact. Given that simulators are orders of magnitude slower than real processors, such experiments are highly restricted. This paper develops a technique to automatically build a performance model for predicting the impact of program transformations on any architecture, based on a limited number of automatically selected runs. As a result, the time for evaluating the impact of any compiler optimization in early design stages can be drastically reduced such that all selected potential compiler optimizations can be evaluated. This is achieved by first evaluating a small set of sample compiler optimizations on a prior set of benchmarks in order to train a model, followed by a very small number of evaluations, or probes, of the target program. We show that by training on less than 0.7 % of all possible transformations (640 samples collected from 10 benchmarks out of 880000 possible samples, 88000 per training benchmark) and probing the new program on only 4 transformations, we can predict the performance of all program transformations with an error of just 7.3 % on average. As each prediction takes almost no time to generate, this scheme provides an accurate method of evaluating compiler performance, which is several orders of magnitude faster than current approaches.
Probabilistic AutoTuning for Architectures with Complex Constraints
"... It is hard to optimize applications for coprocessor accelerator architectures, like FPGAs and GPUs, because application parameters must be tuned carefully to the size of the target architecture. Moreover, some combinations of parameters simply do not work, because they lead to overuse of a constrain ..."
Abstract
 Add to MetaCart
(Show Context)
It is hard to optimize applications for coprocessor accelerator architectures, like FPGAs and GPUs, because application parameters must be tuned carefully to the size of the target architecture. Moreover, some combinations of parameters simply do not work, because they lead to overuse of a constrained resource. Applying autotuning—the use of search algorithms and empirical feedback to optimize programs—is an attractive solution, but tuning in the presence of unpredictable failures is not addressed well by existing autotuning methods. This paper describes a new autotuning method that is based on probabilistic predictions of multiple program features (run time, memory consumption, etc.). During configuration selection, these predictions are combined to balance the preference for trying configurations that are likely to be high quality against the preference for trying configurations that are likely to satisfy all constraints. In our experiments, our new autotuning method performed substantially better than the simpler approach of treating all failed configurations as if they succeed with a “very low ” quality. In many cases, the simpler strategy required more than twice as many trials to reach the same quality level in our experiments.
FEMS: An Adaptive Finite Element Solver ∗
"... In this paper we investigate how to obtain highlevel adaptivity on complex scientific applications such as Finite Element (FE) simulators by building an adaptive version of their computational kernel, which consists of a sparse linear system solver. We present the software architecture ofFEMS, a pa ..."
Abstract
 Add to MetaCart
(Show Context)
In this paper we investigate how to obtain highlevel adaptivity on complex scientific applications such as Finite Element (FE) simulators by building an adaptive version of their computational kernel, which consists of a sparse linear system solver. We present the software architecture ofFEMS, a parallel multifrontal solver for FE applications whose main feature is an installtime training phase where adaptation to the computing platform takes place. FEMS relies on a simple modeldriven mesh partitioning strategy, which makes it possible to perform efficient static load balancing on both homogeneous and heterogeneous machines. 1
A Comparison of Online and Offline Strategies for Program Adaptation
"... Multithreaded programs executing on modern highend computing systems have many potential avenues to adapt their execution to improve performance, energy consumption, or both. Program adaptation occurs anytime multiple execution modes are available to the application and one is selected based on inf ..."
Abstract
 Add to MetaCart
(Show Context)
Multithreaded programs executing on modern highend computing systems have many potential avenues to adapt their execution to improve performance, energy consumption, or both. Program adaptation occurs anytime multiple execution modes are available to the application and one is selected based on information collected during program execution. As a result, some degree of online or offline analysis is required to come to a decision of how best to adapt and there are a variety of tradeoffs to consider when deciding which form of analysis to use, as the overheads they carry with them can vary widely in degree as well as type, as can their effectiveness. In this paper, we attempt to qualitatively and quantitatively analyze the pros and cons of specific types of online and offline forms of information collection and analysis for use in dynamic program adaptation in the context of high performance computing. We focus on providing recommendations of which strategy to employ for users with specific requirements. To justify our recommendations we use data collected from two offline and three online analysis strategies used with a specific powerperformance adaptation technique, concurrency throttling. We provide a twolevel analysis, comparing online and offline strategies and then comparing strategies within each category. Our results show clear trends in the appropriateness of particular strategies depending on the length of application execution – more specifically the number of iterations in the program – as well as different expected use characteristics ranging from one execution to many, with fixed versus variable program inputs across executions.
A Turing Scholars Honors Thesis
, 2007
"... Matrix multiplication is often treated as a basic unit of computation in terms of which other operations are implemented, yielding high performance. In this paper initial evidence is provided that there is a benefit gained when lower level kernels, from which matrix multiplication is composed, are e ..."
Abstract
 Add to MetaCart
(Show Context)
Matrix multiplication is often treated as a basic unit of computation in terms of which other operations are implemented, yielding high performance. In this paper initial evidence is provided that there is a benefit gained when lower level kernels, from which matrix multiplication is composed, are exposed. In particular it is shown that matrix multiplication itself can be coded at a high level of abstraction as long as interfaces to such lowlevel kernels are exposed. Furthermore, it is shown that higher performance implementations of parallel matrixmultiplication can be achieved by exploiting access to these kernels. Experiments on Itanium2 and Pentium 4 servers support these insights. 1
Combining Model and Iterative Compilation for Program Performance Optimization
"... Abstract—The performance gap for high performance applications has been widening over time. High level program transformations are critical to improve applications ’ performance, many of which concern the determination of optimal values for transformation parameters, such as loop unrolling and block ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract—The performance gap for high performance applications has been widening over time. High level program transformations are critical to improve applications ’ performance, many of which concern the determination of optimal values for transformation parameters, such as loop unrolling and blocking. Static approaches achieve these values based on analytical models that are hard to achieve because of increasing architecture complexity and code structures. Recent iterative compilation approaches achieve it by executing different versions of the program on actual platforms and select the one that renders best performance, outperforming static compilation approaches significantly. But the expensive compilation cost has limited their application scope to embedded applications and a small group of math kernels. This paper proposes a combinative approachCombining Model and Iterative Compilation for Program Performance Optimization (CMIC). Such an approach first constructs a program optimization transformation model based on hardware performance counters to decide how and when to apply transformations, and then selects the optimal transformation parameters using NelderMead simplex algorithm. Experimental results show that our approach can effectively improve programs ’ floatingpoint performance, reducing programs ’ runtime, therefore, lessening the performance gap for highperformance applications.
zur Erlangung des Grades
, 2008
"... For many numerical codes the transport of data from main memory to the registers is commonly considered to be the main limiting factor to achieve high performance on present micro architectures. This fact is referred to as the memory wall. A lot of research is targeting this point on different leve ..."
Abstract
 Add to MetaCart
(Show Context)
For many numerical codes the transport of data from main memory to the registers is commonly considered to be the main limiting factor to achieve high performance on present micro architectures. This fact is referred to as the memory wall. A lot of research is targeting this point on different levels. This covers for example code transformations and architecture aware data structures to achieve an optimal usage of the memory hierarchy found in all present micro architectures. This work shows that on modern micro architectures it is necessary to also take the requirements of the Single Instruction Multiple Data (SIMD) programming paradigm and data prefetching into account to reach high efficiencies. In this thesis the chain from high level algorithmic optimizations over the code generation process involving the compiler and the limitations and influences of the instruction set architecture down to the micro architecture of the underlying hardware are analyzed. As a result we present a strategy to achieve a high efficiency for memory bandwidth limited algorithms on modern architectures. The success of this strategy is shown on the algorithmic class of grid based numerical linear equation solvers: A 2D RedBlack GaussSeidel smoother imple
Automating the Generation of Composed Linear Algebra Kernels
"... Memory bandwidth limits the performance of important kernels in many scientific applications. Such applications often use sequences of Basic Linear Algebra Subprograms (BLAS), and highly efficient implementations of those routines enable scientists to achieve high performance at little cost. However ..."
Abstract
 Add to MetaCart
Memory bandwidth limits the performance of important kernels in many scientific applications. Such applications often use sequences of Basic Linear Algebra Subprograms (BLAS), and highly efficient implementations of those routines enable scientists to achieve high performance at little cost. However, because the BLAS are tuned in isolation, they do not take advantage of opportunities for memory optimization that result from composing multiple subprograms. Because it is not practical to create a library of all BLAS combinations, we have developed a domainspecific compiler that generates them on demand. In this paper, we describe the novel algorithm underlying the compiler that searches for the best combination of optimization choices, and we present a new hybrid analytic/empirical method for quickly evaluating the profitability of each optimization. We report experimental results showing speedups of 5 % to 140 % relative to GotoBLAS and vendortuned BLAS on the Intel Cost 2 and the AMD Opteron. I.