Results 1 
7 of
7
Automatic thread extraction with decoupled software pipelining
 In Proceedings of the 38th IEEE/ACM International Symposium on Microarchitecture
, 2005
"... {ottoni, ram, astoler, august}@princeton.edu Abstract Until recently, a steadily rising clock rate and otheruniprocessor microarchitectural improvements could be relied upon to consistently deliver increasing performance fora wide range of applications. Current difficulties in maintaining this trend ..."
Abstract

Cited by 71 (15 self)
 Add to MetaCart
{ottoni, ram, astoler, august}@princeton.edu Abstract Until recently, a steadily rising clock rate and otheruniprocessor microarchitectural improvements could be relied upon to consistently deliver increasing performance fora wide range of applications. Current difficulties in maintaining this trend have lead microprocessor manufacturersto add value by incorporating multiple processors on a chip. Unfortunately, since decades of compiler research have notsucceeded in delivering automatic threading for prevalent code properties, this approach demonstrates no improvement for a large class of existing codes. To find useful work for chip multiprocessors, we proposean automatic approach to thread extraction, called Decoupled Software Pipelining (DSWP). DSWP exploits the finegrained pipeline parallelism lurking in most applications to extract longrunning, concurrently executing threads. Useof the nonspeculative and truly decoupled threads produced by DSWP can increase execution efficiency and provide significant latency tolerance, mitigating design complexity by reducing intercore communication and percoreresource requirements. Using our initial fully automatic compiler implementation and a validated processor model,we prove the concept by demonstrating significant gains for dualcore chip multiprocessor models running a variety ofcodes. We then explore simple opportunities missed by our initial compiler implementation which suggest a promisingfuture for this approach. 1
Transitive Closure of Infinite Graphs and its Applications
, 1995
"... Integer tuple relations can concisely summarize many types of information gathered from analysis of scientific codes. For example they can be used to precisely describe which iterations of a statement are data dependent of which other iterations. It is generally not possible to represent these tuple ..."
Abstract

Cited by 21 (4 self)
 Add to MetaCart
Integer tuple relations can concisely summarize many types of information gathered from analysis of scientific codes. For example they can be used to precisely describe which iterations of a statement are data dependent of which other iterations. It is generally not possible to represent these tuple relations by enumerating the related pairs of tuples. For example, it is impossible to enumerate the related pairs of tuples in the relation f[i] ! [i + 2] j 1 i n \Gamma 2 g. Even when it is possible to enumerate the related pairs of tuples, such as for the relation f[i; j] ! [i 0 ; j 0 ] j 1 i; j; i 0 ; j 0 100 g, it is often not practical to do so. We instead use a closed form description by specifying a predicate consisting of affine constraints on the related pairs of tuples. As we just saw, these affine constraints can be parameterized, so what we are really describing are infinite families of relations (or graphs). Many of our applications of tuple relations rely heavily ...
Optimally Synchronizing DOACROSS Loops on Shared Memory Multiprocessors
 In Proceedings of the 1997 International Conference on Parallel Architectures and Compilation Techniques
, 1997
"... We present two algorithms to minimize the amount of synchronization added when parallelizing a loop with loopcarried dependences. In contrast to existing schemes, our algorithms add lesser synchronization, while preserving the parallelism that can be extracted from the loop. Our first algorithm us ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
We present two algorithms to minimize the amount of synchronization added when parallelizing a loop with loopcarried dependences. In contrast to existing schemes, our algorithms add lesser synchronization, while preserving the parallelism that can be extracted from the loop. Our first algorithm uses an interval graph representation of the dependence "overlap" to find a synchronization placement in time almost linear in the number of dependences. Although this solution may be suboptimal, it is still better than that obtained using existing methods, which first eliminate redundant dependences and then synchronize the remaining ones. Determining the optimal synchronization is an NPcomplete problem. Our second algorithm therefore uses integer programming to determine the optimal solution. We first use a polynomialtime algorithm to find a minimal search space that must contain the optimal solution. Then, we formulate the problem of choosing the minimal synchronization from the search ...
Global Instruction Scheduling for MultiThreaded Architectures
, 2008
"... Recently, the microprocessor industry has moved toward multicore or chip multiprocessor (CMP) designs as a means of utilizing the increasing transistor counts in the face of physical and microarchitectural limitations. Despite this move, CMPs do not directly improve the performance of singlethre ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Recently, the microprocessor industry has moved toward multicore or chip multiprocessor (CMP) designs as a means of utilizing the increasing transistor counts in the face of physical and microarchitectural limitations. Despite this move, CMPs do not directly improve the performance of singlethreaded codes, a characteristic of most applications. In effect, the move to CMPs has shifted even more the task of improving performance from the hardware to the software. Since developing parallel applications has long been recognized as significantly harder than developing sequential ones, it is very desirable to have automatic tools to extract threadlevel parallelism (TLP) from sequential applications. Unfortunately, automatic parallelization has only been successful in the restricted domains of scientific and dataparallel applications, which usually have regular arraybased memory accesses and little control flow. In order to support parallelization of generalpurpose applications, computer architects have proposed CMPs with lightweight, finegrained (scalar) communication mechanisms. Despite such support, most existing multithreading compilation techniques have
Increasing Parallelism of Loops with the Loop Distribution Technique
, 1995
"... In a loop, the parallelism is bad when the statements in the loop body are involved in a datadependence cycle. How to break datadependence cycles is the key point for increasing the parallelism of loop execution. In this paper, we consider the datadependence relation in the viewpoint of statements ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
In a loop, the parallelism is bad when the statements in the loop body are involved in a datadependence cycle. How to break datadependence cycles is the key point for increasing the parallelism of loop execution. In this paper, we consider the datadependence relation in the viewpoint of statements. We propose two new methods, the modi ed index shift method and the statement substitutionshift method. They have better parallelism and performance than the index shift method in general. The modi ed index shift method is obtained from modifying the index shift method and combining with the loop distribution method. The statement substitutionshift method is obtained from combining the statement substitution method, the index shift method and the unimodular transformation method with the loop distribution method. Moreover, the topological sort can be applied to determine the parallel execution order of statements.
Profiling Dependence Vectors for Loop Parallelization
, 1995
"... A dependence relation between two data references is linear if it generates dependence vectors that are linear functions of the loop indices. A linear dependence relation often induces a large number of dependence vectors. Empirical studies also show that linear dependencies often intermix with unif ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
A dependence relation between two data references is linear if it generates dependence vectors that are linear functions of the loop indices. A linear dependence relation often induces a large number of dependence vectors. Empirical studies also show that linear dependencies often intermix with uniform dependencies in loops [5, 6]. These factors make it difficult to analyze such loops and extract the inherit parallelism. In this paper, we propose to manipulate such dependencies in the dependence vector space and summarize the large number of dependence vectors with their convex hull. The convex hull, as a profile of the dependence vectors, can be used to deduce many important properties of the vectors. We will show how to find the convex hull and then apply it to loop parallelization. The proposed approach will be compared with other schemes. 1. Introduction According to an empirical study of scientific and engineering programs [5], 44% of twodimensional array references have coupled...
Barrier Synchronisation Optimisation
"... This paper describes a new compiler algorithm to reduce the number of barrier synchronisations in parallelised programs. A preliminary technique to rapidly determine critical data dependences is developed. This forms the basis of the Fast First Sink (FFS) algorithm which places, provably, the mi ..."
Abstract
 Add to MetaCart
This paper describes a new compiler algorithm to reduce the number of barrier synchronisations in parallelised programs. A preliminary technique to rapidly determine critical data dependences is developed. This forms the basis of the Fast First Sink (FFS) algorithm which places, provably, the minimal number of barriers in polynomial time for codes with a regular structure. This algorithm is implemented in a prototype compiler and applied to three well known benchmarks. Preliminary results show that it outperforms an existing stateof theart commercial compiler.