Results 1 -
6 of
6
Automatic thread extraction with decoupled software pipelining
- In Proceedings of the 38th IEEE/ACM International Symposium on Microarchitecture
, 2005
"... {ottoni, ram, astoler, august}@princeton.edu Abstract Until recently, a steadily rising clock rate and otheruniprocessor microarchitectural improvements could be relied upon to consistently deliver increasing performance fora wide range of applications. Current difficulties in maintaining this trend ..."
Abstract
-
Cited by 59 (10 self)
- Add to MetaCart
{ottoni, ram, astoler, august}@princeton.edu Abstract Until recently, a steadily rising clock rate and otheruniprocessor microarchitectural improvements could be relied upon to consistently deliver increasing performance fora wide range of applications. Current difficulties in maintaining this trend have lead microprocessor manufacturersto add value by incorporating multiple processors on a chip. Unfortunately, since decades of compiler research have notsucceeded in delivering automatic threading for prevalent code properties, this approach demonstrates no improve-ment for a large class of existing codes. To find useful work for chip multiprocessors, we proposean automatic approach to thread extraction, called Decoupled Software Pipelining (DSWP). DSWP exploits the fine-grained pipeline parallelism lurking in most applications to extract long-running, concurrently executing threads. Useof the non-speculative and truly decoupled threads produced by DSWP can increase execution efficiency and pro-vide significant latency tolerance, mitigating design complexity by reducing inter-core communication and per-coreresource requirements. Using our initial fully automatic compiler implementation and a validated processor model,we prove the concept by demonstrating significant gains for dual-core chip multiprocessor models running a variety ofcodes. We then explore simple opportunities missed by our initial compiler implementation which suggest a promisingfuture for this approach. 1
Transitive Closure of Infinite Graphs and its Applications
, 1995
"... Integer tuple relations can concisely summarize many types of information gathered from analysis of scientific codes. For example they can be used to precisely describe which iterations of a statement are data dependent of which other iterations. It is generally not possible to represent these tuple ..."
Abstract
-
Cited by 18 (4 self)
- Add to MetaCart
Integer tuple relations can concisely summarize many types of information gathered from analysis of scientific codes. For example they can be used to precisely describe which iterations of a statement are data dependent of which other iterations. It is generally not possible to represent these tuple relations by enumerating the related pairs of tuples. For example, it is impossible to enumerate the related pairs of tuples in the relation f[i] ! [i + 2] j 1 i n \Gamma 2 g. Even when it is possible to enumerate the related pairs of tuples, such as for the relation f[i; j] ! [i 0 ; j 0 ] j 1 i; j; i 0 ; j 0 100 g, it is often not practical to do so. We instead use a closed form description by specifying a predicate consisting of affine constraints on the related pairs of tuples. As we just saw, these affine constraints can be parameterized, so what we are really describing are infinite families of relations (or graphs). Many of our applications of tuple relations rely heavily ...
Optimally Synchronizing DOACROSS Loops on Shared Memory Multiprocessors
- In Proceedings of the 1997 International Conference on Parallel Architectures and Compilation Techniques
, 1997
"... We present two algorithms to minimize the amount of synchronization added when parallelizing a loop with loop--carried dependences. In contrast to existing schemes, our algorithms add lesser synchronization, while preserving the parallelism that can be extracted from the loop. Our first algorithm us ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
We present two algorithms to minimize the amount of synchronization added when parallelizing a loop with loop--carried dependences. In contrast to existing schemes, our algorithms add lesser synchronization, while preserving the parallelism that can be extracted from the loop. Our first algorithm uses an interval graph representation of the dependence "overlap" to find a synchronization placement in time almost linear in the number of dependences. Although this solution may be suboptimal, it is still better than that obtained using existing methods, which first eliminate redundant dependences and then synchronize the remaining ones. Determining the optimal synchronization is an NP--complete problem. Our second algorithm therefore uses integer programming to determine the optimal solution. We first use a polynomial--time algorithm to find a minimal search space that must contain the optimal solution. Then, we formulate the problem of choosing the minimal synchronization from the search ...
Increasing Parallelism of Loops with the Loop Distribution Technique
, 1995
"... In a loop, the parallelism is bad when the statements in the loop body are involved in a datadependence cycle. How to break data-dependence cycles is the key point for increasing the parallelism of loop execution. In this paper, we consider the data-dependence relation in the viewpoint of statements ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
In a loop, the parallelism is bad when the statements in the loop body are involved in a datadependence cycle. How to break data-dependence cycles is the key point for increasing the parallelism of loop execution. In this paper, we consider the data-dependence relation in the viewpoint of statements. We propose two new methods, the modi ed index shift method and the statement substitution-shift method. They have better parallelism and performance than the index shift method in general. The modi ed index shift method is obtained from modifying the index shift method and combining with the loop distribution method. The statement substitution-shift method is obtained from combining the statement substitution method, the index shift method and the unimodular transformation method with the loop distribution method. Moreover, the topological sort can be applied to determine the parallel execution order of statements.
Global Instruction Scheduling for Multi-Threaded Architectures
, 2008
"... Recently, the microprocessor industry has moved toward multi-core or chip multipro-cessor (CMP) designs as a means of utilizing the increasing transistor counts in the face of physical and micro-architectural limitations. Despite this move, CMPs do not directly improve the performance of single-thre ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Recently, the microprocessor industry has moved toward multi-core or chip multipro-cessor (CMP) designs as a means of utilizing the increasing transistor counts in the face of physical and micro-architectural limitations. Despite this move, CMPs do not directly improve the performance of single-threaded codes, a characteristic of most applications. In effect, the move to CMPs has shifted even more the task of improving performance from the hardware to the software. Since developing parallel applications has long been recognized as significantly harder than developing sequential ones, it is very desirable to have automatic tools to extract thread-level parallelism (TLP) from sequential applications. Unfortunately, automatic par-allelization has only been successful in the restricted domains of scientific and data-parallel applications, which usually have regular array-based memory accesses and little control flow. In order to support parallelization of general-purpose applications, computer archi-tects have proposed CMPs with light-weight, fine-grained (scalar) communication mech-anisms. Despite such support, most existing multi-threading compilation techniques have
Barrier Synchronisation Optimisation
"... This paper describes a new compiler algorithm to reduce the number of barrier synchronisations in parallelised programs. A preliminary technique to rapidly determine critical data dependences is developed. This forms the basis of the Fast First Sink (FFS) algorithm which places, provably, the mi ..."
Abstract
- Add to MetaCart
This paper describes a new compiler algorithm to reduce the number of barrier synchronisations in parallelised programs. A preliminary technique to rapidly determine critical data dependences is developed. This forms the basis of the Fast First Sink (FFS) algorithm which places, provably, the minimal number of barriers in polynomial time for codes with a regular structure. This algorithm is implemented in a prototype compiler and applied to three well known benchmarks. Preliminary results show that it outperforms an existing state-of the-art commercial compiler.

