Results 1 - 10
of
52
Maximizing Multiprocessor Performance with the SUIF Compiler
, 1996
"... This paper presents an overview of the SUIF compiler, which automatically parallelizes and optimizes sequential programs for shared-memory multiprocessors. We describe new technology in this system for locating coarse-grain parallelism and for optimizing multiprocessor memory behavior essential to ..."
Abstract
-
Cited by 229 (22 self)
- Add to MetaCart
This paper presents an overview of the SUIF compiler, which automatically parallelizes and optimizes sequential programs for shared-memory multiprocessors. We describe new technology in this system for locating coarse-grain parallelism and for optimizing multiprocessor memory behavior essential to obtaining good multiprocessor performance. These techniques have a significant impact on the performance of half of the NAS and SPECfp95 benchmark suites. In particular, we achieve the highest SPECfp95 ratio to date of 63.9 on an eight-processor 440MHz Digital AlphaServer. 1 Introduction Affordable shared-memory multiprocessors can potentially deliver supercomputer-like performance to the general public. Today, these machines are mainly used in a multiprogramming mode, increasing system throughput by running several independent applications in parallel. The multiple processors can also be used together to accelerate the execution of single applications. Automatic parallelization is a promis...
Pointer Analysis for Multithreaded Programs
- ACM SIGPLAN 99
, 1999
"... This paper presents a novel interprocedural, flow-sensitive, and context-sensitive pointer analysis algorithm for multithreaded programs that may concurrently update shared pointers. For each pointer and each program point, the algorithm computes a conservative approximation of the memory locations ..."
Abstract
-
Cited by 125 (13 self)
- Add to MetaCart
This paper presents a novel interprocedural, flow-sensitive, and context-sensitive pointer analysis algorithm for multithreaded programs that may concurrently update shared pointers. For each pointer and each program point, the algorithm computes a conservative approximation of the memory locations to which that pointer may point. The algorithm correctly handles a full range of constructs in multithreaded programs, including recursive functions, function pointers, structures, arrays, nested structures and arrays, pointer arithmetic, casts between pointer variables of different types, heap and stack allocated memory, shared global variables, and thread-private global variables. We have implemented the algorithm in the SUIF compiler system and used the implementation to analyze a sizable set of multithreaded programs written in the Cilk multithreaded programming language. Our experimental results show that the analysis has good precision and converges quickly for our set of Cilk programs.
Symbolic Bounds Analysis of Pointers, Array Indices, and Accessed Memory Regions
- PLDI 2000
, 2000
"... This paper presents a novel framework for the symbolic bounds analysis of pointers, array indices, and accessed memory regions. Our framework formulates each analysis problem as a system of inequality constraints between symbolic bound polynomials. It then reduces the constraint system to a linear p ..."
Abstract
-
Cited by 100 (14 self)
- Add to MetaCart
This paper presents a novel framework for the symbolic bounds analysis of pointers, array indices, and accessed memory regions. Our framework formulates each analysis problem as a system of inequality constraints between symbolic bound polynomials. It then reduces the constraint system to a linear program. The solution to the linear program provides symbolic lower and upper bounds for the values of pointer and array index variables and for the regions of memory that each statement and procedure accesses. This approach eliminates fundamental problems associated with applying standard xed-point approaches to symbolic analysis problems. Experimental results from our implemented compiler show that the analysis can solve several important problems, including static race detection, automatic parallelization, static detection of array bounds violations, elimination of array bounds checks, and reduction of the number of bits used to store computed values.
Commutativity Analysis: A New Analysis Technique for Parallelizing Compilers
- ACM TRANSACTIONS ON PROGRAMMING LANGUAGES AND SYSTEMS
, 1997
"... This article presents a new analysis technique, commutativity analysis, for automatically parallelizing computations that manipulate dynamic, pointer-based data structures. Commutativity analysis views the computation as composed of operations on objects. It then analyzes the program at this granula ..."
Abstract
-
Cited by 62 (7 self)
- Add to MetaCart
This article presents a new analysis technique, commutativity analysis, for automatically parallelizing computations that manipulate dynamic, pointer-based data structures. Commutativity analysis views the computation as composed of operations on objects. It then analyzes the program at this granularity to discover when operations commute (i.e., generate the same final result regardless of the order in which they execute). If all of the operations required to perform a given computation commute, the compiler can automatically generate parallel code. We have implemented a prototype compilation system that uses commutativity analysis as its primary analysis technique
Compiler-directed page coloring for multiprocessors
- In Proceedings of the Seventh International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VII
, 1996
"... This paper presents a new technique, compiler-directed page coloring, that eliminates conflict misses in multiprocessor applications. It enables applications to make better use of the increased aggregate cache size available in a multiprocessor. This technique uses the compiler’s knowledge of the ac ..."
Abstract
-
Cited by 55 (9 self)
- Add to MetaCart
This paper presents a new technique, compiler-directed page coloring, that eliminates conflict misses in multiprocessor applications. It enables applications to make better use of the increased aggregate cache size available in a multiprocessor. This technique uses the compiler’s knowledge of the access patterns of the parallelized applications to direct the operating system’s virtual memory page mapping strategy. We demonstrate that this technique can lead to significant performance improvements over two commonly used page mapping strategies for machines with either direct-mapped or two-way set-associative caches. We also show that it is complementary to latency-hiding techniques such as prefetching. We implemented compiler-directed page coloring in the SUIF parallelizing compiler and on two commercial operating systems. We applied the technique to the SPEC95fp benchmark suite, a representative set of numeric programs. We used the SimOS machine simulator to analyze the applications and isolate their performance bottlenecks. We also validated these results on a real machine, an eight-processor 350MHz Digital AlphaServer. Compiler-directed page coloring leads to significant performance improvements for several applications. Overall, our technique improves the SPEC95fp rating for eight processors by 8 % over Digital UNIX’s page mapping policy and by 20 % over a page coloring, a standard page mapping policy. The SUIF compiler achieves a SPEC95fp ratio of 57.4, the highest ratio to date.
SUIF Explorer: an interactive and interprocedural parallelizer
, 1999
"... The SUIF Explorer is an interactive parallelization tool that is more effective than previous systems in minimizing the number of lines of code that require programmer assistance. First, the interprocedural analyses in the SUIF system is successful in parallelizing many coarse-grain loops, thus mini ..."
Abstract
-
Cited by 55 (5 self)
- Add to MetaCart
The SUIF Explorer is an interactive parallelization tool that is more effective than previous systems in minimizing the number of lines of code that require programmer assistance. First, the interprocedural analyses in the SUIF system is successful in parallelizing many coarse-grain loops, thus minimizing the number of spurious dependences requiring attention. Second, the system uses dynamic execution analyzers to identify those important loops that are likely to be parallelizable. Third, the SUIF Explorer is the first to apply program slicing to aid programmers in interactive parallelization. The system guides the programmer in the parallelization process using a set of sophisticated visualization techniques. This paper demonstrates the effectiveness of the SUIF Explorer with three case studies. The programmer was able to speed up all three programs by examining only a small fraction of the program and privatizing a few variables. 1. Introduction Exploiting coarse-grain parallelism i...
Automatic Parallelization of Divide and Conquer Algorithms
- In Proceedings of the 7th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
, 1999
"... Divide and conquer algorithms are a good match for modern parallel machines: they tend to have large amounts of inherent parallelism and they work well with caches and deep memory hierarchies. But these algorithms pose challenging problems for parallelizing compilers. They are usually coded as recur ..."
Abstract
-
Cited by 47 (7 self)
- Add to MetaCart
Divide and conquer algorithms are a good match for modern parallel machines: they tend to have large amounts of inherent parallelism and they work well with caches and deep memory hierarchies. But these algorithms pose challenging problems for parallelizing compilers. They are usually coded as recursive procedures and often use pointers into dynamically allocated memory blocks and pointer arithmetic. All of these features are incompatible with the analysis algorithms in traditional parallelizing compilers. This paper presents the design and implementation of a compiler that is designed to parallelize divide and conquer algorithms whose subproblems access disjoint regions of dynamically allocated arrays. The foundation of the compiler is a flow-sensitive, context-sensitive, and interprocedural pointer analysis algorithm. A range of symbolic analysis algorithms build on the pointer analysis information to extract symbolic bounds for the memory regions accessed by (potentially recursive) procedures that use pointers and pointer arithmetic. The symbolic bounds information allows the compiler to find procedure calls that can execute in parallel without violating the data dependences. The compiler generates code that executes these calls in parallel. We have used the compiler to parallelize several programs that use divide and conquer algorithms. Our results show that the programs perform well and exhibit good speedup. 1
Enhancing Software DSM for Compiler-Parallelized Applications
- In Proceedings of the 11th International Parallel Processing Symposium
, 1997
"... Current parallelizing compilers for message-passing machines only support a limited class of data-parallel applications. One method for eliminating this restriction is to combine powerful shared-memory parallelizing compilers with software distributed-shared-memory (DSM) systems. We demonstrate such ..."
Abstract
-
Cited by 42 (15 self)
- Add to MetaCart
Current parallelizing compilers for message-passing machines only support a limited class of data-parallel applications. One method for eliminating this restriction is to combine powerful shared-memory parallelizing compilers with software distributed-shared-memory (DSM) systems. We demonstrate such a system by combining the SUIF parallelizing compiler and the CVM software DSM. Innovations of the system include compiler-directed techniques that: 1) combine synchronization and parallelism information communication on parallel task invocation, 2) employ customized routines for evaluating reduction operations, and 3) select a hybrid update protocol that pre-sends data by flushing updates at barriers. For applications with sufficient granularity of parallelism, these optimizations yield very good speedups eight processors on an IBM SP-2 and DEC Alpha cluster, usually matching or exceeding the speedup of equivalent HPF and message-passing versions of each program. Based on our experimental ...
Techniques for Speculative Run-Time Parallelization of Loops
- In Supercomputing ’98
, 1998
"... This paper presents a set of new run-time tests for speculative parallelization of loops that defy parallelization based on static analysis alone. It presents a novel method for speculative array privatization that is not only more efficient than previous methods when the speculation is correct, b ..."
Abstract
-
Cited by 32 (1 self)
- Add to MetaCart
This paper presents a set of new run-time tests for speculative parallelization of loops that defy parallelization based on static analysis alone. It presents a novel method for speculative array privatization that is not only more efficient than previous methods when the speculation is correct, but also does not require rolling back the computation in case the variable is found not to be privatizable. We present another method for speculative parallelization which can overcome all loop-carried anti and output dependences, with even lower overhead than previous techniques which could not break such dependences. Again, in order to ameliorate the problem of paying a heavy penalty for speculatively parallelizing loops that turn out to be serial, we present a technique that enables early detection of loop-carried dependences. Our experimental results from a preliminary implementation of these tests on an IBM G30 SMP machine show a significant reduction in the penalty paid for mis-...
Improving Compiler and Run-Time Support for Adaptive Irregular Codes
- In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques
, 1998
"... Irregular reductions form the core of adaptive irregular codes. On distributed-memory multiprocessors, they are parallelized either using sophisticated run-time systems (e.g., CHAOS, PILAR) or the shared-memory interface supported by software DSMs (e.g., CVM, TreadMarks). We introduce LOCALWRITE, a ..."
Abstract
-
Cited by 28 (8 self)
- Add to MetaCart
Irregular reductions form the core of adaptive irregular codes. On distributed-memory multiprocessors, they are parallelized either using sophisticated run-time systems (e.g., CHAOS, PILAR) or the shared-memory interface supported by software DSMs (e.g., CVM, TreadMarks). We introduce LOCALWRITE, a new technique based on the owner-computes rule which eliminates the need for buffers or synchronized writes but may replicate computation. We evaluate its performance for irregular codes while varying connectivity, locality, and adaptivity. LOCALWRITE improves performance by 50--150% compared to using replicated buffers, and can match or exceed gather/scatter for applications with low locality or high adaptivity. 1 Introduction Scientists are beginning to exploit parallelism to provide the computing power they need for research and development. As they attempt to model more complex problems, irregular adaptive computations become increasingly important. The core of these applications is fre...

