Results 1  10
of
83
Optimistic parallelism requires abstractions
 In PLDI
, 2007
"... Irregular applications, which manipulate large, pointerbased data structures like graphs, are difficult to parallelize manually. Automatic tools and techniques such as restructuring compilers and runtime speculative execution have failed to uncover much parallelism in these applications, in spite o ..."
Abstract

Cited by 165 (23 self)
 Add to MetaCart
(Show Context)
Irregular applications, which manipulate large, pointerbased data structures like graphs, are difficult to parallelize manually. Automatic tools and techniques such as restructuring compilers and runtime speculative execution have failed to uncover much parallelism in these applications, in spite of a lot of effort by the research community. These difficulties have even led some researchers to wonder if there is any coarsegrain parallelism worth exploiting in irregular applications. In this paper, we describe two realworld irregular applications: a Delaunay mesh refinement application and a graphics application that performs agglomerative clustering. By studying the algorithms and data structures used in these applications, we show that there is substantial coarsegrain, data parallelism in these applications, but that this parallelism is very dependent on the input data and therefore cannot be uncovered by compiler analysis. In principle, optimistic techniques such as threadlevel speculation can be used to uncover this parallelism, but we argue that current implementations cannot accomplish this because they do not use the proper abstractions for the data structures in these programs. These insights have informed our design of the Galois system, an objectbased optimistic parallelization system for irregular applications. There are three main aspects to Galois: (1) a small number of syntactic constructs for packaging optimistic parallelism as iteration over ordered and unordered sets, (2) assertions about methods in class libraries, and (3) a runtime scheme for detecting and recovering from potentially unsafe accesses to shared memory made by an optimistic computation. We show that Delaunay mesh generation and agglomerative clustering can be parallelized in a straightforward way using the Galois approach, and we present experimental measurements to show that this approach is practical. These results suggest that Galois is a practical approach to exploiting data parallelism in irregular programs.
Pointer Analysis for Multithreaded Programs
 ACM SIGPLAN 99
, 1999
"... This paper presents a novel interprocedural, flowsensitive, and contextsensitive pointer analysis algorithm for multithreaded programs that may concurrently update shared pointers. For each pointer and each program point, the algorithm computes a conservative approximation of the memory locations ..."
Abstract

Cited by 151 (11 self)
 Add to MetaCart
This paper presents a novel interprocedural, flowsensitive, and contextsensitive pointer analysis algorithm for multithreaded programs that may concurrently update shared pointers. For each pointer and each program point, the algorithm computes a conservative approximation of the memory locations to which that pointer may point. The algorithm correctly handles a full range of constructs in multithreaded programs, including recursive functions, function pointers, structures, arrays, nested structures and arrays, pointer arithmetic, casts between pointer variables of different types, heap and stack allocated memory, shared global variables, and threadprivate global variables. We have implemented the algorithm in the SUIF compiler system and used the implementation to analyze a sizable set of multithreaded programs written in the Cilk multithreaded programming language. Our experimental results show that the analysis has good precision and converges quickly for our set of Cilk programs.
Symbolic Bounds Analysis of Pointers, Array Indices, and Accessed Memory Regions
 PLDI 2000
, 2000
"... This paper presents a novel framework for the symbolic bounds analysis of pointers, array indices, and accessed memory regions. Our framework formulates each analysis problem as a system of inequality constraints between symbolic bound polynomials. It then reduces the constraint system to a linear p ..."
Abstract

Cited by 132 (15 self)
 Add to MetaCart
(Show Context)
This paper presents a novel framework for the symbolic bounds analysis of pointers, array indices, and accessed memory regions. Our framework formulates each analysis problem as a system of inequality constraints between symbolic bound polynomials. It then reduces the constraint system to a linear program. The solution to the linear program provides symbolic lower and upper bounds for the values of pointer and array index variables and for the regions of memory that each statement and procedure accesses. This approach eliminates fundamental problems associated with applying standard xedpoint approaches to symbolic analysis problems. Experimental results from our implemented compiler show that the analysis can solve several important problems, including static race detection, automatic parallelization, static detection of array bounds violations, elimination of array bounds checks, and reduction of the number of bits used to store computed values.
A type and effect system for deterministic parallel java
 In Proc. Intl. Conf. on ObjectOriented Programming, Systems, Languages, and Applications
, 2009
"... Today’s sharedmemory parallel programming models are complex and errorprone. While many parallel programs are intended to be deterministic, unanticipated thread interleavings can lead to subtle bugs and nondeterministic semantics. In this paper, we demonstrate that a practical type and effect syst ..."
Abstract

Cited by 112 (14 self)
 Add to MetaCart
(Show Context)
Today’s sharedmemory parallel programming models are complex and errorprone. While many parallel programs are intended to be deterministic, unanticipated thread interleavings can lead to subtle bugs and nondeterministic semantics. In this paper, we demonstrate that a practical type and effect system can simplify parallel programming by guaranteeing deterministic semantics with modular, compiletime type checking even in a rich, concurrent objectoriented language such as Java. We describe an objectoriented type and effect system that provides several new capabilities over previous systems for expressing deterministic parallel algorithms. We also describe a language called Deterministic Parallel Java (DPJ) that incorporates the new type system features, and we show that a core subset of DPJ is sound. We describe an experimental validation showing that DPJ can express a wide range of realistic parallel programs; that the new type system features are useful for such programs; and that the parallel programs exhibit good performance gains (coming close to or beating equivalent, nondeterministic multithreaded programs where those are available).
Full functional verification of linked data structures
 In ACM Conf. Programming Language Design and Implementation (PLDI
, 2008
"... We present the first verification of full functional correctness for a range of linked data structure implementations, including mutable lists, trees, graphs, and hash tables. Specifically, we present the use of the Jahob verification system to verify formal specifications, written in classical high ..."
Abstract

Cited by 100 (18 self)
 Add to MetaCart
(Show Context)
We present the first verification of full functional correctness for a range of linked data structure implementations, including mutable lists, trees, graphs, and hash tables. Specifically, we present the use of the Jahob verification system to verify formal specifications, written in classical higherorder logic, that completely capture the desired behavior of the Java data structure implementations (with the exception of properties involving execution time and/or memory consumption). Given that the desired correctness properties include intractable constructs such as quantifiers, transitive closure, and lambda abstraction, it is a challenge to successfully prove the generated verification conditions. Our Jahob verification system uses integrated reasoning to split each verification condition into a conjunction of simpler subformulas, then apply a diverse collection of specialized decision procedures,
Transactional boosting: A methodology for highlyconcurrent transactional objects
, 2007
"... We describe a methodology for transforming a large class of highlyconcurrent linearizable objects into highlyconcurrent transactional objects. As long as the linearizable implementation satisfies certain regularity properties (informally, that every method has an inverse), we define a simple wrapp ..."
Abstract

Cited by 80 (3 self)
 Add to MetaCart
(Show Context)
We describe a methodology for transforming a large class of highlyconcurrent linearizable objects into highlyconcurrent transactional objects. As long as the linearizable implementation satisfies certain regularity properties (informally, that every method has an inverse), we define a simple wrapper for the linearizable implementation that guarantees that concurrent transactions without inherent conflicts can synchronize at the same granularity as the original linearizable implementation.
Mapping Irregular Applications to DIVA, a PIMbased DataIntensive Architecture
 In Supercomputing
, 1999
"... Processinginmemory (PIM) chips that integrate processor logic into memory devices offer a new opportunity for bridging the growing gap between processor and memory speeds, especially for applications with high memorybandwidth requirements. The DataIntensiVe Architecture (DIVA) system combines PI ..."
Abstract

Cited by 59 (13 self)
 Add to MetaCart
(Show Context)
Processinginmemory (PIM) chips that integrate processor logic into memory devices offer a new opportunity for bridging the growing gap between processor and memory speeds, especially for applications with high memorybandwidth requirements. The DataIntensiVe Architecture (DIVA) system combines PIM memories with one or more external host processors and a PIMtoPIM interconnect. DIVA increases memory bandwidth through two mechanisms: (1) performing selected computation in memory, reducing the quantity of data transferred across the processormemory interface; and (2) providing communication mechanisms called parcels for moving both data and computation throughout memory, further bypassing the processormemory bus. DIVA uniquely supports acceleration of important irregular applications, including sparsematrix and pointerbased computations. In this paper, we focus on several aspects of DIVA designed to effectively support such computations at very high performance levels: (1) the mem...
Effective FineGrain Synchronization For Automatically Parallelized Programs Using Optimistic Synchronization Primitives
 ACM TRANSACTIONS ON COMPUTER SYSTEMS
, 1999
"... This paper presents our experience using optimistic synchronization to implement finegrain atomic operations in the context of a parallelizing compiler for irregular, objectbased computations. Our experience shows that the synchronization requirements of these programs differ significantly from th ..."
Abstract

Cited by 40 (5 self)
 Add to MetaCart
This paper presents our experience using optimistic synchronization to implement finegrain atomic operations in the context of a parallelizing compiler for irregular, objectbased computations. Our experience shows that the synchronization requirements of these programs differ significantly from those of traditional parallel computations, which use loop nests to access dense matrices using affine access functions. In addition to coarsegrain barrier synchronization, our irregular computations require synchronization primitives that support efficient finegrain atomic operations
The Tao of Parallelism in Algorithms
 In PLDI
, 2011
"... For more than thirty years, the parallel programming community has used the dependence graph as the main abstraction for reasoning about and exploiting parallelism in “regular ” algorithms that use dense arrays, such as finitedifferences and FFTs. In this paper, we argue that the dependence graph i ..."
Abstract

Cited by 36 (11 self)
 Add to MetaCart
(Show Context)
For more than thirty years, the parallel programming community has used the dependence graph as the main abstraction for reasoning about and exploiting parallelism in “regular ” algorithms that use dense arrays, such as finitedifferences and FFTs. In this paper, we argue that the dependence graph is not a suitable abstraction for algorithms in new application areas like machine learning and network analysis in which the key data structures are “irregular ” data structures like graphs, trees, and sets. To address the need for better abstractions, we introduce a datacentric formulation of algorithms called the operator formulation in which an algorithm is expressed in terms of its action on data structures. This formulation is the basis for a structural analysis of algorithms that we call taoanalysis. Taoanalysis can be viewed as an abstraction of algorithms that distills out algorithmic properties
A Comparison of Locality Transformations for Irregular Codes
, 2000
"... Researchers have proposed several data and computation transformations to improve locality in irregular scientific codes. We experimentally compare their performance and present GPART, a new technique based on hierarchical clustering. Quality partitions are constructed quickly by clustering multiple ..."
Abstract

Cited by 34 (8 self)
 Add to MetaCart
Researchers have proposed several data and computation transformations to improve locality in irregular scientific codes. We experimentally compare their performance and present GPART, a new technique based on hierarchical clustering. Quality partitions are constructed quickly by clustering multiple neighboring nodes with priority on nodes with high degree, and repeating a few passes. Overhead is kept low by clustering multiple nodes in each pass and considering only edges between partitions. Experimental results show GPART matches the performance of more sophisticated partitioning algorithms to with 6%–8%, with a small fraction of the overhead. It is thus useful for optimizing programs whose running times are not known.