Results 1 - 10
of
36
Pointer Analysis for Multithreaded Programs
- ACM SIGPLAN 99
, 1999
"... This paper presents a novel interprocedural, flow-sensitive, and context-sensitive pointer analysis algorithm for multithreaded programs that may concurrently update shared pointers. For each pointer and each program point, the algorithm computes a conservative approximation of the memory locations ..."
Abstract
-
Cited by 125 (13 self)
- Add to MetaCart
This paper presents a novel interprocedural, flow-sensitive, and context-sensitive pointer analysis algorithm for multithreaded programs that may concurrently update shared pointers. For each pointer and each program point, the algorithm computes a conservative approximation of the memory locations to which that pointer may point. The algorithm correctly handles a full range of constructs in multithreaded programs, including recursive functions, function pointers, structures, arrays, nested structures and arrays, pointer arithmetic, casts between pointer variables of different types, heap and stack allocated memory, shared global variables, and thread-private global variables. We have implemented the algorithm in the SUIF compiler system and used the implementation to analyze a sizable set of multithreaded programs written in the Cilk multithreaded programming language. Our experimental results show that the analysis has good precision and converges quickly for our set of Cilk programs.
Commutativity Analysis: A New Analysis Technique for Parallelizing Compilers
- ACM TRANSACTIONS ON PROGRAMMING LANGUAGES AND SYSTEMS
, 1997
"... This article presents a new analysis technique, commutativity analysis, for automatically parallelizing computations that manipulate dynamic, pointer-based data structures. Commutativity analysis views the computation as composed of operations on objects. It then analyzes the program at this granula ..."
Abstract
-
Cited by 61 (7 self)
- Add to MetaCart
This article presents a new analysis technique, commutativity analysis, for automatically parallelizing computations that manipulate dynamic, pointer-based data structures. Commutativity analysis views the computation as composed of operations on objects. It then analyzes the program at this granularity to discover when operations commute (i.e., generate the same final result regardless of the order in which they execute). If all of the operations required to perform a given computation commute, the compiler can automatically generate parallel code. We have implemented a prototype compilation system that uses commutativity analysis as its primary analysis technique
Parallelization in Calculational Forms
- In 25th ACM Symposium on Principles of Programming Languages
, 1998
"... The problems involved in developing efficient parallel programs have proved harder than those in developing efficient sequential ones, both for programmers and for compilers. Although program calculation has been found to be a promising way to solve these problems in the sequential world, we believe ..."
Abstract
-
Cited by 28 (21 self)
- Add to MetaCart
The problems involved in developing efficient parallel programs have proved harder than those in developing efficient sequential ones, both for programmers and for compilers. Although program calculation has been found to be a promising way to solve these problems in the sequential world, we believe that it needs much more effort to study its effective use in the parallel world. In this paper, we propose a calculational framework for the derivation of efficient parallel programs with two main innovations: - We propose a novel inductive synthesis lemma based on which an elementary but powerful parallelization theorem is developed. - We make the first attempt to construct a calculational algorithm for parallelization, deriving associative operators from data type definition and making full use of existing fusion and tupling calculations. Being more constructive, our method is not only helpful in the design of efficient parallel programs in general but also promising in the construc...
Explicit Multi-Threading (XMT) Bridging Models for Instruction Parallelism
- Proc. 10th ACM Symposium on Parallel Algorithms and Architectures (SPAA
, 1998
"... The paper envisions an extension to a standard instruction set which efficiently implements PRAM algorithms using explicit multi-threaded instruction-level parallelism (ILP); that is, Explicit Multi-Threading (XMT), a fine-grained computational paradigm covering the spectrum from algorithms throu ..."
Abstract
-
Cited by 24 (11 self)
- Add to MetaCart
The paper envisions an extension to a standard instruction set which efficiently implements PRAM algorithms using explicit multi-threaded instruction-level parallelism (ILP); that is, Explicit Multi-Threading (XMT), a fine-grained computational paradigm covering the spectrum from algorithms through architecture to implementation is introduced; new elements are added where needed. The more detailed presentation is by way of a bridging model. Among other things, a bridging model provides a design space for algorithm designers and programmers, as well as a design space for computer architects. It is convenient to describe our wider vision regarding "parallel-computing-on-a-chip" as a two-stage development and therefore two bridging models are presented: Spawn-based multi-threading (Spawn-MT) and Elastic multi-threading (EMT). The case for Spawn-MT (or, alternatively, EMT) as a bridging model relies on the following evidence. (1) Spawn-MT comprises an "instruction set level", wh...
Compiler Optimization of Implicit Reductions for Distributed Memory Multiprocessors
, 1998
"... This paper presents reduction recognition and parallel code generation strategies for distributed-memory multiprocessors. We describe techniques to recognize a broad range of implicit reduction operations, including those involving statements at multiple loop nesting levels and intermixed with condi ..."
Abstract
-
Cited by 19 (2 self)
- Add to MetaCart
This paper presents reduction recognition and parallel code generation strategies for distributed-memory multiprocessors. We describe techniques to recognize a broad range of implicit reduction operations, including those involving statements at multiple loop nesting levels and intermixed with conditional control flow. We introduce two new optimizations: factoring which increases data locality for SUM and PRODUCT reductions, and index encoding which enables a single global communication to accomplish both an extreme value reduction and an extreme value location reduction. We have implemented these techniques in the dHPF compiler for High Performance Fortran (HPF). We evaluate their effectiveness experimentally by compiling several reduction benchmarks with dHPF and two commercial HPF compilers, and comparing the performance of the generated code on an IBM SP2. Our results show that our recognition techniques are more powerful and that our index encoding and factoring optimizations can ...
Parallelization via Context Preservation
- In IEEE Intl Conference on Computer Languages
, 1998
"... Abstract program schemes, such as scan or homomorphism, can capture a wide range of data parallel programs. While versatile, these schemes are of limited practical use on their own. A key problem is that the more natural sequential specifications may not have associative combine operators required b ..."
Abstract
-
Cited by 17 (16 self)
- Add to MetaCart
Abstract program schemes, such as scan or homomorphism, can capture a wide range of data parallel programs. While versatile, these schemes are of limited practical use on their own. A key problem is that the more natural sequential specifications may not have associative combine operators required by these schemes. As a result, they often fail to be immediately identified. To resolve this problem, we propose a method to systematically derive parallel programs from sequential definitions. This method is special in that it can automatically invent auxiliary functions needed by associative combine operators. Apart from a formalisation, we also provide new theorems, based on the notion of context preservation, to guarantee parallelization for a precise class of sequential programs. 1 Introduction It is well-recognised that a key problem of parallel computing remains the development of efficient and correct parallel software. This task is further complicated by the variety of parallel arc...
Theory, Techniques, And Experiments In Solving Recurrences In Computer Programs
, 1997
"... ... work. In the sixth chapter, we consider the application of these same techniques focused on obtaining parallelism in outer time-stepping loops. In the final chapter, we draw this work to a conclusion and discuss future directions in parallelizing compiler technology. ..."
Abstract
-
Cited by 14 (2 self)
- Add to MetaCart
... work. In the sixth chapter, we consider the application of these same techniques focused on obtaining parallelism in outer time-stepping loops. In the final chapter, we draw this work to a conclusion and discuss future directions in parallelizing compiler technology.
The Role of Associativity and Commutativity in the Detection and Transformation of Loop-Level Parallelism
"... The study of theoretical and practical issues in automatic parallelization across application and language boundaries is an appropriate and timely task. In this paper, we discuss theory and techniques that wehave determined useful in parallelizing recurrences and reductions in computer programs. We ..."
Abstract
-
Cited by 13 (4 self)
- Add to MetaCart
The study of theoretical and practical issues in automatic parallelization across application and language boundaries is an appropriate and timely task. In this paper, we discuss theory and techniques that wehave determined useful in parallelizing recurrences and reductions in computer programs. We present a framework for understanding such parallelism based on an approachwhich models loop bodies as coalescing loop operators. Within this framework we distinguish between associative coalescing loop operators and associative and commutative coalescing loop operators. We present the result of the application of this theory in a case study of a modern C++ semantic retrieval application drawn from the digital library field.
Eliminating Synchronization Bottlenecks in Object-Based Programs Using Adaptive Replication
- IN PROCEEDINGS OF 1999 ACM INTERNATIONAL CONFERENCE ON SUPERCOMPUTING (ICS '99
, 1999
"... This paper presents a technique, adaptive replication, for automatically eliminating synchronization bottlenecks in multithreaded programs that perform atomic operations on objects. Synchronization bottlenecks occur when multiple threads attempt to concurrently update the same object. It is often po ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
This paper presents a technique, adaptive replication, for automatically eliminating synchronization bottlenecks in multithreaded programs that perform atomic operations on objects. Synchronization bottlenecks occur when multiple threads attempt to concurrently update the same object. It is often possible to eliminate synchronization bottlenecks by replicating objects. Each thread can then update its own local replica without synchronization and without interacting with other threads. When the computation needs to access the original object, it combines the replicas to produce the correct values in the original object. One potential problem is that eagerly replicating all objects may lead to performance degradation and excessive memory consumption.
High-level language support for user-defined reductions
- JOURNAL OF SUPERCOMPUTING
, 2002
"... The optimized handling of reductions on parallel supercomputers or clusters of workstations is critical to high performance because reductions are common in scientific codes and a potential source of bottlenecks. Yet in many high-level languages, a mechanism for writing efficient reductions remain ..."
Abstract
-
Cited by 7 (4 self)
- Add to MetaCart
The optimized handling of reductions on parallel supercomputers or clusters of workstations is critical to high performance because reductions are common in scientific codes and a potential source of bottlenecks. Yet in many high-level languages, a mechanism for writing efficient reductions remains surprisingly absent. Further, when such mechanisms do exist, they often do not provide the flexibility a programmer needs to achieve a desirable level of performance. In this paper, we present a new language construct for arbitrary reductions that lets a programmer achieve a level of performance equal to that achievable with the highly flexible, but low-level combination of Fortran and MPI. We have implemented this construct in the ZPL language and evaluate it in the context of the initialization of the NAS MG benchmark. We show a 45 times speedup over the same code written in ZPL without this construct. In addition, performance on a large number of processors surpasses that achieved in the NAS implementation showing that our mechanism provides programmers with the needed flexibility.

