Results 1  10
of
14
Compiler Transformations for HighPerformance Computing
 ACM COMPUTING SURVEYS
, 1994
"... In the last three decades a large number of compiler transformations for optimizing programs have been implemented. Most optimization for uniprocessors reduce the number of instructions executed by the program using transformations based on the analysis of scalar quantities and dataflow techniques. ..."
Abstract

Cited by 416 (3 self)
 Add to MetaCart
In the last three decades a large number of compiler transformations for optimizing programs have been implemented. Most optimization for uniprocessors reduce the number of instructions executed by the program using transformations based on the analysis of scalar quantities and dataflow techniques. In contrast, optimization for
Parallelizing Compilers: Implementation and Effectiveness
, 1993
"... An important thank you goes to one of my undergraduate professors, Ken Kennedy. He proposed the project that led to this thesis, and my desire to know the answer gave me the strength to complete this work. I would like to thank the languages group at Kubota Pacific Computers, Inc. for showing me tha ..."
Abstract

Cited by 9 (0 self)
 Add to MetaCart
(Show Context)
An important thank you goes to one of my undergraduate professors, Ken Kennedy. He proposed the project that led to this thesis, and my desire to know the answer gave me the strength to complete this work. I would like to thank the languages group at Kubota Pacific Computers, Inc. for showing me that I could indeed be productive and that all problems in compilers did not take years to solve. My sanity is thanks to all of my friends from dancing, &quot;O &quot; runs, and everything else. They made it possible to return to work each day and eventually to graduate. I owe my parents a great debt for encouraging me to stay in graduate school even when I thought I would never finish. Last, but certainly not least, I would like to thank Don Ramsey for reading many drafts and listening to many dry runs. His input greatly helped the presentation of this thesis in both oral and written forms.
Serializing Parallel Programs by Removing Redundant Computation
 Master's thesis, MIT
, 1994
"... Programs often exhibit more parallelism than is actually available in the target architecture. This thesis introduces and evaluates three methods  loop unrolling, loop common expression elimination, and loop differencing  for automatically transforming a parallel algorithm into a less parallel o ..."
Abstract

Cited by 9 (0 self)
 Add to MetaCart
(Show Context)
Programs often exhibit more parallelism than is actually available in the target architecture. This thesis introduces and evaluates three methods  loop unrolling, loop common expression elimination, and loop differencing  for automatically transforming a parallel algorithm into a less parallel one that takes advantage of only the parallelism available at run time. The resulting program performs less computation to produce its results; the running time is not just improved via secondorder effects such as improving use of the memory hierarchy or reducing overhead (such optimizations can further improve performance). The asymptotic complexity is not usually reduced, but the constant factors can be lowered significantly, often by a factor of 4 or more. The basis for these methods is the detection of loop common expressions, or common subexpressions in different iterations of a parallel loop. The loop differencing method also permits computation of just the change in an expression from iteration to iteration.
GoalDirected Performance Tuning for Scientific Applications
, 1996
"... ABSTRACT GoalDirected Performance Tuning for Scientific Applications by TienPao Shih Chair: Edward S. Davidson Performance tuning, as carried out by compiler designers and application programmers to close the performance gap between the achievable peak and delivered performance, becomes increasing ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
(Show Context)
ABSTRACT GoalDirected Performance Tuning for Scientific Applications by TienPao Shih Chair: Edward S. Davidson Performance tuning, as carried out by compiler designers and application programmers to close the performance gap between the achievable peak and delivered performance, becomes increasingly important and challenging as the microprocessor speeds and system sizes increase. However, although performance tuning on scientific codes usually deals with relatively small program regions, it is not generally known how to establish a reasonable performance objective and how to efficiently achieve this objective. We suggest a goaldirected approach and develop such an approach for each of three major system performance components: central processor unit (CPU) computation, memory accessing, and communication. For the CPU, we suggest using a machineapplication performance model that characterizes workloads on four key function units (memory, floatingpoint, issue, and a virtual &quot;dependence unit&quot;) to produce an upper bound performance objective, and derive a mechanism to approach this objective. A case study shows an average 1.79x speedup achieved by using this approach for the Livermore Fortran Kernels 112 running on the IBM RS/6000. For memory, as compulsory and capacity misses are relatively easy to characterize, we derive a method for building applicationspecific cache behavior models that report the number of misses for all three types of conflict misses: self, cross, and pingpong. The method uses averaging concepts to determine the expected number of cache misses instead of attempting to count them exactly in each instance, which provides a more rapid, yet realistic assessment of expected cache behavior. For each type of conflict miss, we propose a reduction method that uses one or a combination of three techniques based on modifying or exploiting data layout: array padding, initial address adjustment, and access resequencing.
Bulk Synchronous Parallel Scheduling of Uniform Dags
 EuroPar'96. Parallel Processing, Lecture Notes in Computer Science 1124
, 1996
"... . This paper addresses the dag scheduling problem, proposing the bulk synchronous parallel (BSP) model as a framework for the derivation of general purpose parallel computer schedules of uniform dags, i.e., of dags that stand for tightlynested loops with computable distance vectors. A general techn ..."
Abstract

Cited by 4 (4 self)
 Add to MetaCart
(Show Context)
. This paper addresses the dag scheduling problem, proposing the bulk synchronous parallel (BSP) model as a framework for the derivation of general purpose parallel computer schedules of uniform dags, i.e., of dags that stand for tightlynested loops with computable distance vectors. A general technique for the BSP scheduling of normalised uniform dags is introduced and analysed in terms of the BSP cost model, and methods for the normalisation of generic uniform dags are briefly overviewed in the paper. 1 Introduction During the last two decades a great deal of research effort has been devoted to the identification and scheduling of potential parallelism. Despite the criticism pointing out its slow pace of progress, this research has led to remarkable advances. Data dependence analysis [2], loop transformation [1, 13], potential parallelism identification [6, 13], and dag scheduling [4, 7] are but a few examples of fields whose tremendous development has provided techniques successful...
METHODS FOR LINEAR AND NONLINEAR ARRAY DATA DEPENDENCE ANALYSIS WITH THE CHAINS OF RECURRENCES ALGEBRA
, 2007
"... ..."
THE QUEENâ€™S UNIVERSITY OF BELFAST by
, 1994
"... It is generally a difficult task to construct efficient implementations of numerical mathematical algorithms for execution on highperformance computer systems. The difficulty arises from the need to express an implementation in a form that reflects the nature of the computer system, rather than a f ..."
Abstract
 Add to MetaCart
(Show Context)
It is generally a difficult task to construct efficient implementations of numerical mathematical algorithms for execution on highperformance computer systems. The difficulty arises from the need to express an implementation in a form that reflects the nature of the computer system, rather than a form that reflects the computations performed by the algorithm. This thesis develops the method of program transformation to derive automatically efficient implementations of algorithms from highlevel, machineindependent specifications. The primary system considered is the AMT DAP array processor, but sequential and vector systems are also considered. The transformational method is further extended to automatically tailor implementations to use sparse programming techniques. Acknowledgements My thanks are due to Dr Terence Harmer for his patience and guidance through the four years this thesis has been in the making; to the members of the Department of Computer Science at QUB for a multitude of stimulating coffee breaks; and to my fellow candidates for their support (and for making my working hours look almost sane).
Detecting ValueBased Scalar Dependence
 Computing. Cornell University
, 1994
"... Precise valuebased data dependence analysis for scalars is useful for advanced compiler optimizations. The new method presented here for flow and output dependence uses Factored Use and Def chains (FUD chains), our interpretation and extension of Static Single Assignment. It is precise with resp ..."
Abstract
 Add to MetaCart
Precise valuebased data dependence analysis for scalars is useful for advanced compiler optimizations. The new method presented here for flow and output dependence uses Factored Use and Def chains (FUD chains), our interpretation and extension of Static Single Assignment. It is precise with respect to conditional control flow and dependence vectors. Our method detects dependences which are independent with respect to arbitrary loop nesting, as well as loopcarried dependences. A loopcarried dependence is further classified as being carried by the previous iteration, with distance 1, or by any previous iteration, with direction !. This precision cannot be achieved by traditional analysis, such as dominator information or reaching definitions. To compute antidependence, we use Factored RedefUse chains, which are related to FUD chains. We are not aware of any prior work which explicitly deals with scalar data dependence utilizing a sparse graph representation. 1 Introduction...
A new approach to Parallelizing nested Loops using the wavefront method
, 1992
"... This paper presents a novel approach to the problem of parallelizing nested loops using the wavefront method. Though not as powerful as some existing loop transformation theories, the approach has the advantage of not requiring knowledge of either the loop bounds or the direct dependences (distance ..."
Abstract
 Add to MetaCart
This paper presents a novel approach to the problem of parallelizing nested loops using the wavefront method. Though not as powerful as some existing loop transformation theories, the approach has the advantage of not requiring knowledge of either the loop bounds or the direct dependences (distance vectors). Instead, a representation based on affine sets is used which captures all the dependences, both direct and indirect. The key to the wavefront method itself is finding a function that efficiently maps each loop iteration to a wavefront. Using the dependence information contained in the affine set together with a geometric representation of the execution order, we derive algebraic constraints on such a wavefront function. These constraints define a space of permissible functions from which it is possible to select one that optimizes some chosen criterion. Some existing methods for carrying this out are examined. Finally, the technique is illustrated on a couple of examples. 1 Introdu...