Results 1 - 10
of
45
Parallel Programming Must Be Deterministic by Default
"... In today’s widely used parallel programming models, subtle programming errors can lead to unintended nondeterministic behavior and hard to catch bugs. In contrast, we argue for a parallel programming model that is deterministic by default: deterministic behavior is guaranteed unless the programmer e ..."
Abstract
-
Cited by 21 (4 self)
- Add to MetaCart
In today’s widely used parallel programming models, subtle programming errors can lead to unintended nondeterministic behavior and hard to catch bugs. In contrast, we argue for a parallel programming model that is deterministic by default: deterministic behavior is guaranteed unless the programmer explicitly uses nondeterministic constructs. This goal is particularly challenging for modern object-oriented languages with expressive use of reference aliasing and updates to shared mutable state. We propose a broad research agenda in support of this goal, and we describe some of our own work to further that agenda. 1
Piccolo: Building Fast, Distributed Programs with Partitioned Tables
"... Piccolo is a new data-centric programming model for writing parallel in-memory applications in data centers. Unlike existing data-flow models, Piccolo allows computation running on different machines to share distributed, mutable state via a key-value table interface. Piccolo enables efficient appli ..."
Abstract
-
Cited by 17 (0 self)
- Add to MetaCart
Piccolo is a new data-centric programming model for writing parallel in-memory applications in data centers. Unlike existing data-flow models, Piccolo allows computation running on different machines to share distributed, mutable state via a key-value table interface. Piccolo enables efficient application implementations. In particular, applications can specify locality policies to exploit the locality of shared state access and Piccolo’s run-time automatically resolves write-write conflicts using userdefined accumulation functions. Using Piccolo, we have implemented applications for several problem domains, including the PageRank algorithm, k-means clustering and a distributed crawler. Experiments using 100 Amazon EC2 instances and a 12 machine cluster show Piccolo to be faster than existing data flow models for many problems, while providing similar fault-tolerance guarantees and a convenient programming interface. 1
Memory Models: A Case for Rethinking Parallel Languages and Hardware
- COMMUNICATIONS OF THE ACM
, 2010
"... The era of parallel computing for the masses is here, but writing correct parallel programs remains far more difficult than writing sequential programs. Aside from a few domains, most parallel programs are written using a shared-memory approach. The memory model, which specifies the meaning of share ..."
Abstract
-
Cited by 16 (3 self)
- Add to MetaCart
The era of parallel computing for the masses is here, but writing correct parallel programs remains far more difficult than writing sequential programs. Aside from a few domains, most parallel programs are written using a shared-memory approach. The memory model, which specifies the meaning of shared variables, is at the heart of this programming model. Unfortunately, it has involved a tradeoff between programmability and performance, and has arguably been one of the most challenging and contentious areas in both hardware architecture and programming language specification. Recent broad community-scale efforts have finally led to a convergence in this debate, with popular languages such as Java and C++ and most hardware vendors publishing compatible memory model specifications. Although this convergence is a dramatic improvement, it has exposed fundamental shortcomings in current
A Common Substrate for Cluster Computing
"... The success of MapReduce has sparked many efforts to design cluster computing frameworks. We argue that no single framework will be optimal for all applications, and that we should instead enable organizations to run multiple frameworks efficiently in the same cloud. Furthermore, to ease development ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
The success of MapReduce has sparked many efforts to design cluster computing frameworks. We argue that no single framework will be optimal for all applications, and that we should instead enable organizations to run multiple frameworks efficiently in the same cloud. Furthermore, to ease development of new frameworks, it is critical to identify common abstractions and modularize their architectures. To achieve these goals, we propose Nexus, a low-level substrate that provides isolation and efficient resource sharing across frameworks running on the same cluster, while giving each framework freedom to implement its own programming model and fully control the execution of its jobs. Nexus fosters innovation in the cloud by letting organizations run new frameworks alongside existing ones and by letting framework developers focus on specific applications rather than building onesize-fits-all frameworks. 1
Multifrontal multithreaded rank-revealing sparse QR factorization
"... SuiteSparseQR is a sparse QR factorization package based on the multifrontal method. Within each frontal matrix, LAPACK and the multithreaded BLAS enable the method to obtain high performance on multicore architectures. Parallelism across different frontal matrices is handled with Intel’s Threading ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
SuiteSparseQR is a sparse QR factorization package based on the multifrontal method. Within each frontal matrix, LAPACK and the multithreaded BLAS enable the method to obtain high performance on multicore architectures. Parallelism across different frontal matrices is handled with Intel’s Threading Building Blocks library. The symbolic analysis and ordering phase preeliminates singletons by permuting the input matrix into the form [R11 R12; 0 A22] where R11 is upper triangular with diagonal entries above a given tolerance. Next, the fill-reducing ordering, column elimination tree, and frontal matrix structures are found without requiring the formation of the pattern of A T A. Rank-detection is performed within each frontal matrix using Heath’s method, which does not require column pivoting. The resulting sparse QR factorization obtains a substantial fraction of the theoretical peak performance of a multicore computer.
Lithe: Enabling Efficient Composition of Parallel Libraries
"... For the software industry to take advantage of multicore processors, we must allow programmers to arbitrarily compose parallel libraries without sacrificing performance. We argue that high-level task or thread abstractions and a common global scheduler cannot provide effective library composition. I ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
For the software industry to take advantage of multicore processors, we must allow programmers to arbitrarily compose parallel libraries without sacrificing performance. We argue that high-level task or thread abstractions and a common global scheduler cannot provide effective library composition. Instead, the operating system should expose unvirtualized processing resources that can be shared cooperatively between parallel libraries within an application. In this paper, we describe a system that standardizes and facilitates the exchange of these unvirtualized processing resources between libraries. 1
K.: Self-Replicating Objects for Multicore Platforms
- the 24th European Conference on Object-Oriented Programming (ECOOP 2010
"... Abstract. The paper introduces Self-Replicating Objects (SROs), a new concurrent programming abstraction. An SRO is implemented and used much like an ordinary.NET object and can expose arbitrary user-defined APIs, but it is aggressive about automatically exploiting multicore CPUs. It does so by spon ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
Abstract. The paper introduces Self-Replicating Objects (SROs), a new concurrent programming abstraction. An SRO is implemented and used much like an ordinary.NET object and can expose arbitrary user-defined APIs, but it is aggressive about automatically exploiting multicore CPUs. It does so by spontaneously and transparently partitioning its state into a set of replicas that handle method calls in parallel and automatically merging replicas before processing calls that cannot execute in the replicated state. Developers need not be concerned about protecting access to shared data; each replica is a monitor and has its own state. The runtime ensures proper synchronization, scheduling, decides when to split/merge, and can transparently migrate replicas to other processes to decrease contention. Compared to threads/locks or toolkits such as.NET Parallel Extensions, SROs offer a simpler, more versatile programming model while delivering comparable, and in some cases even higher performance.
Composing parallel software efficiently with Lithe
- In Proc. of the SIGPLAN 2010 Conference on Programming Language Design and Implementation (PLDI
, 2010
"... Applications composed of multiple parallel libraries perform poorly when those libraries interfere with one another by obliviously using the same physical cores, leading to destructive resource oversubscription. This paper presents the design and implementation of Lithe, a low-level substrate that p ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
Applications composed of multiple parallel libraries perform poorly when those libraries interfere with one another by obliviously using the same physical cores, leading to destructive resource oversubscription. This paper presents the design and implementation of Lithe, a low-level substrate that provides the basic primitives and a standard interface for composing parallel codes efficiently. Lithe can be inserted underneath the runtimes of legacy parallel libraries to provide bolt-on composability without needing to change existing application code. Lithe can also serve as the foundation for building new parallel abstractions and libraries that automatically interoperate with one another. In this paper, we show versions of Threading Building Blocks (TBB) and OpenMP perform competitively with their original implementations when ported to Lithe. Furthermore, for two applications composed of multiple parallel libraries, we show that leveraging our substrate outperforms their original, even expertly tuned, implementations.
Capturing and Composing Parallel Patterns with Intel CnC
"... The most accessible and successful parallel tools today are those that ask programmers to write only isolated serial kernels, hiding parallelism behind a library interface. Examples include Google’s Map-Reduce [5], CUDA [13], and STAPL [12]. This encapsulation approach applies to a wide range of str ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
The most accessible and successful parallel tools today are those that ask programmers to write only isolated serial kernels, hiding parallelism behind a library interface. Examples include Google’s Map-Reduce [5], CUDA [13], and STAPL [12]. This encapsulation approach applies to a wide range of structured, well-understood algorithms, which we call parallel patterns. Today’s highlevel systems tend to encapsulate only a single pattern. Thus we explore the use of Intel CnC as a single framework for capturing and composing multiple patterns. 1
OpenMP 3.0 Tasking Implementation in OpenUH ∗
"... As multicore technology dominates the processor market, new methodologies are being explored to exploit the parallelism inherent to these architectures and shared memory programming models are gaining in popularity. The ratification of the OpenMP 3.0 API has provided compiler developers with another ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
As multicore technology dominates the processor market, new methodologies are being explored to exploit the parallelism inherent to these architectures and shared memory programming models are gaining in popularity. The ratification of the OpenMP 3.0 API has provided compiler developers with another challenge as the multicore revolution reshapes the landscape in scientific computing. The introduction of explicit tasking in this latest revision of the de facto standard for shared memory programming introduces new capabilities for parallel programming. Tasking abilities in OpenMP now allow irregular applications with pointer based data and recursive algorithms to be executed in parallel, as well as providing alternative parallelization techniques for traditional loop-centric codes. This paper outlines the implementation of OpenMP 3.0 tasking features in OpenUH, a branch of Open64 compiler suite. 1

