Results 1 - 10
of
79
Obstruction-free synchronization: Double-ended queues as an example
- In preparation
, 2003
"... We introduce obstruction-freedom, a new nonblocking property for shared data structure implementations. This property is strong enough to avoid the problems associated with locks, but it is weaker than previous nonblocking properties—specifically lock-freedom and wait-freedom— allowing greater flexi ..."
Abstract
-
Cited by 150 (17 self)
- Add to MetaCart
We introduce obstruction-freedom, a new nonblocking property for shared data structure implementations. This property is strong enough to avoid the problems associated with locks, but it is weaker than previous nonblocking properties—specifically lock-freedom and wait-freedom— allowing greater flexibility in the design of efficient implementations. Obstruction-freedom admits substantially simpler implementations, and we believe that in practice it provides the benefits of wait-free and lock-free implementations. To illustrate the benefits of obstruction-freedom, we present two obstruction-free CAS-based implementations of double-ended queues (deques); the first is implemented on a linear array, the second on a circular array. To our knowledge, all previous nonblocking deque implementations are based on unrealistic assumptions about hardware support for synchronization, have restricted functionality, or have operations that interfere with operations at the opposite end of the deque even when the deque has many elements in it. Our obstruction-free implementations have none of these drawbacks, and thus suggest that it is much easier to design obstruction-free implementations than lock-free and waitfree ones. We also briefly discuss other obstruction-free data structures and operations that we have implemented. 1.
Transactional Lock-Free Execution of Lock-Based Programs
- In Proceedings of the Tenth International Conference on Architectural Support for Programming Languages and Operating Systems
, 2002
"... This paper is motivated by the difficulty in writing correct high-performance programs. Writing shared-memory multithreaded programs imposes a complex trade-off between programming ease and performance, largely due to subtleties in coordinating access to shared data. To ensure correctness programmer ..."
Abstract
-
Cited by 148 (9 self)
- Add to MetaCart
This paper is motivated by the difficulty in writing correct high-performance programs. Writing shared-memory multithreaded programs imposes a complex trade-off between programming ease and performance, largely due to subtleties in coordinating access to shared data. To ensure correctness programmers often rely on conservative locking at the expense of performance. The resulting serialization of threads is a performance bottleneck. Locks also interact poorly with thread scheduling and faults, resulting in poor system performance.
Optimistic parallelism requires abstractions
- In PLDI
, 2007
"... Irregular applications, which manipulate large, pointer-based data structures like graphs, are difficult to parallelize manually. Automatic tools and techniques such as restructuring compilers and runtime speculative execution have failed to uncover much parallelism in these applications, in spite o ..."
Abstract
-
Cited by 65 (8 self)
- Add to MetaCart
Irregular applications, which manipulate large, pointer-based data structures like graphs, are difficult to parallelize manually. Automatic tools and techniques such as restructuring compilers and runtime speculative execution have failed to uncover much parallelism in these applications, in spite of a lot of effort by the research community. These difficulties have even led some researchers to wonder if there is any coarse-grain parallelism worth exploiting in irregular applications. In this paper, we describe two real-world irregular applications: a Delaunay mesh refinement application and a graphics application that performs agglomerative clustering. By studying the algorithms and data structures used in these applications, we show that there is substantial coarse-grain, data parallelism in these applications, but that this parallelism is very dependent on the input data and therefore cannot be uncovered by compiler analysis. In principle, optimistic techniques such as thread-level speculation can be used to uncover this parallelism, but we argue that current implementations cannot accomplish this because they do not use the proper abstractions for the data structures in these programs. These insights have informed our design of the Galois system, an object-based optimistic parallelization system for irregular applications. There are three main aspects to Galois: (1) a small number of syntactic constructs for packaging optimistic parallelism as iteration over ordered and unordered sets, (2) assertions about methods in class libraries, and (3) a runtime scheme for detecting and recovering from potentially unsafe accesses to shared memory made by an optimistic computation. We show that Delaunay mesh generation and agglomerative clustering can be parallelized in a straight-forward way using the Galois approach, and we present experimental measurements to show that this approach is practical. These results suggest that Galois is a practical approach to exploiting data parallelism in irregular programs.
Non-blocking Algorithms and Preemption-Safe Locking on Multiprogrammed Shared Memory Multiprocessors
- JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING
, 1998
"... Most multiprocessors are multiprogrammed in order to achieve acceptable response time and to increase their uti-lization. Unfortunately, inopportune preemption may significantly degrade the performance of synchronized parallel applications. To address this problem, researchers have developed two pri ..."
Abstract
-
Cited by 65 (1 self)
- Add to MetaCart
Most multiprocessors are multiprogrammed in order to achieve acceptable response time and to increase their uti-lization. Unfortunately, inopportune preemption may significantly degrade the performance of synchronized parallel applications. To address this problem, researchers have developed two principal strategies for concurrent, atomic update of shared data structures: (1) preemption-safe locking and (2) non-blocking (lock-free) algorithms. Preemption-safe locking requires kernel support. Non-blocking algorithms generally require a universal atomic primitive such as compare-and-swap orload-linked/store-conditional, and are widely regarded as inefficient. We evaluate the performance of preemption-safe lock-based and non-blocking implementations of important data structures—queues, stacks, heaps, and counters—including non-blocking and lock-based queue algorithms of our own, in micro-benchmarks and real applications on a 12-processor SGI Challenge multiprocessor. Our results indicate that our non-blocking queue consistently outperforms the best known alternatives, and that data-structure-specific non-blocking algorithms, which exist for queues, stacks, and counters, can work extremely well. Not only do they outperform preemption-safe lock-based algorithms on multiprogrammed machines, they also outperform ordinary locks on dedicated machines. At the same time, since general-purpose non-blocking techniques do not yet appear to be practical, preemption-safe locks remain the preferred alternative for complex data structures: they outperform
Autolocker: Synchronization Inference for Atomic Sections
- In POPL
, 2006
"... The movement to multi-core processors increases the need for simpler, more robust parallel programming models. Atomic sections have been widely recognized for their ease of use. They are simpler and safer to use than manual locking and they increase modularity. But existing proposals have several pr ..."
Abstract
-
Cited by 56 (2 self)
- Add to MetaCart
The movement to multi-core processors increases the need for simpler, more robust parallel programming models. Atomic sections have been widely recognized for their ease of use. They are simpler and safer to use than manual locking and they increase modularity. But existing proposals have several practical problems, including high overhead and poor interaction with I/O. We present pessimistic atomic sections, a fresh approach that retains many of the advantages of optimistic atomic sections as seen in “transactional memory ” without sacrificing performance or compatibility. Pessimistic atomic sections employ the locking mechanisms familiar to programmers while relieving them of most burdens of lock-based programming, including deadlocks. Significantly, pessimistic atomic sections separate correctness from performance: they allow programmers to extract more parallelism via finergrained locking without fear of introducing bugs. We believe this property is crucial for exploiting multi-core processor designs. We describe a tool, Autolocker, that automatically converts pessimistic atomic sections into standard lock-based code. Autolocker relies extensively on program analysis to determine a correct locking policy free of deadlocks and race conditions. We evaluate the expressiveness of Autolocker by modifying a 50,000 line highperformance web server to use atomic sections while retaining the original locking policy. We analyze Autolocker’s performance using microbenchmarks, where Autolocker outperforms software transactional memory by more than a factor of 3. Categories and Subject Descriptors D.3.3 [Programming Languages]: Language Constructs and Features—Concurrent programming
The repeat offender problem: A mechanism for supporting dynamic-sized, lock-free data structures
- In Proceedings of the 16th International Symposium on Distributed Computing
, 2002
"... We define the Repeat Offender Problem (ROP). Elsewhere, we have presented the first dynamic-sized lock-free data structures that can free memory to any standard memory allocator—even after thread failures—without requiring special support from the operating system, the memory allocator, or the hardw ..."
Abstract
-
Cited by 44 (10 self)
- Add to MetaCart
We define the Repeat Offender Problem (ROP). Elsewhere, we have presented the first dynamic-sized lock-free data structures that can free memory to any standard memory allocator—even after thread failures—without requiring special support from the operating system, the memory allocator, or the hardware. These results depend on a solution to the ROP problem. Here we present the first solution to the ROP problem and its correctness proof. Our solution is implementable in most modern shared memory multiprocessors. M/S MTV29-01
Pragmatic Nonblocking Synchronization for Real-Time Systems
, 2001
"... We present a pragmatic methodology for designing nonblocking real-time systems. Our methodology uses a combination of lock-free and wait-free synchronization techniques and clearly states which technique should be applied in which situation. ..."
Abstract
-
Cited by 36 (12 self)
- Add to MetaCart
We present a pragmatic methodology for designing nonblocking real-time systems. Our methodology uses a combination of lock-free and wait-free synchronization techniques and clearly states which technique should be applied in which situation.
Effective Fine-Grain Synchronization For Automatically Parallelized Programs Using Optimistic Synchronization Primitives
- ACM TRANSACTIONS ON COMPUTER SYSTEMS
, 1999
"... This paper presents our experience using optimistic synchronization to implement fine-grain atomic operations in the context of a parallelizing compiler for irregular, object-based computations. Our experience shows that the synchronization requirements of these programs differ significantly from th ..."
Abstract
-
Cited by 33 (5 self)
- Add to MetaCart
This paper presents our experience using optimistic synchronization to implement fine-grain atomic operations in the context of a parallelizing compiler for irregular, object-based computations. Our experience shows that the synchronization requirements of these programs differ significantly from those of traditional parallel computations, which use loop nests to access dense matrices using affine access functions. In addition to coarsegrain barrier synchronization, our irregular computations require synchronization primitives that support efficient fine-grain atomic operations
Two-Handed Emulation: How to build non-blocking implementations of complex data-structures using DCAS
- In Proceedings of the 21st Annual Symposium on Principles of Distributed Computing
, 2002
"... This paper partly addresses the question of whether, in principle, there is any point in adding richer hardware synchronization primitives when the existing set is \universal", and therefore sucient to synchronize any data structure in a non-blocking manner. The context of this paper is the ongoing ..."
Abstract
-
Cited by 25 (0 self)
- Add to MetaCart
This paper partly addresses the question of whether, in principle, there is any point in adding richer hardware synchronization primitives when the existing set is \universal", and therefore sucient to synchronize any data structure in a non-blocking manner. The context of this paper is the ongoing investigation of the utility of adding a DCAS instruction to modern processors to aid the design and performance of non-blocking algorithms. We add one more piece of evidence in support of this instruction.
Delaunay Triangulation with Transactions and Barriers
- IEEE Intl. Symp. on Workload Characterization
, 2007
"... Transactional memory has been widely hailed as a simpler alternative to locks in multithreaded programs, but few nontrivial transactional programs are currently available. We describe an open-source implementation of Delaunay triangulation that uses transactions as one component of a larger parallel ..."
Abstract
-
Cited by 23 (9 self)
- Add to MetaCart
Transactional memory has been widely hailed as a simpler alternative to locks in multithreaded programs, but few nontrivial transactional programs are currently available. We describe an open-source implementation of Delaunay triangulation that uses transactions as one component of a larger parallelization strategy. The code is written in C++, for use with the RSTM software transactional memory library (also open source). It employs one of the fastest known sequential algorithms to triangulate geometrically partitioned regions in parallel; it then employs alternating, barrier-separated phases of transactional and partitioned work to stitch those regions together. Experiments on multiprocessor and multicore machines confirm excellent single-thread performance and good speedup with increasing thread count. Since execution time is dominated by geometrically partitioned computation, performance is largely insensitive to the overhead of transactions, but highly sensitive to any costs imposed on sharable data that are currently “privatized”. 1.

