Results 1 - 10
of
42
Transactional Lock-Free Execution of Lock-Based Programs
- In Proceedings of the Tenth International Conference on Architectural Support for Programming Languages and Operating Systems
, 2002
"... This paper is motivated by the difficulty in writing correct high-performance programs. Writing shared-memory multithreaded programs imposes a complex trade-off between programming ease and performance, largely due to subtleties in coordinating access to shared data. To ensure correctness programmer ..."
Abstract
-
Cited by 148 (9 self)
- Add to MetaCart
This paper is motivated by the difficulty in writing correct high-performance programs. Writing shared-memory multithreaded programs imposes a complex trade-off between programming ease and performance, largely due to subtleties in coordinating access to shared data. To ensure correctness programmers often rely on conservative locking at the expense of performance. The resulting serialization of threads is a performance bottleneck. Locks also interact poorly with thread scheduling and faults, resulting in poor system performance.
Lock-Free Linked Lists Using Compare-and-Swap
- In Proceedings of the Fourteenth Annual ACM Symposium on Principles of Distributed Computing
, 1995
"... Lock-free data structures implement concurrent objects without the use of mutual exclusion. This approach can avoid performance problems due to unpredictable delays while processes are within critical sections. Although universal methods are known that give lock-free data structures for any abstract ..."
Abstract
-
Cited by 84 (1 self)
- Add to MetaCart
Lock-free data structures implement concurrent objects without the use of mutual exclusion. This approach can avoid performance problems due to unpredictable delays while processes are within critical sections. Although universal methods are known that give lock-free data structures for any abstract data type, the overhead of these methods makes them inefficient when compared to conventional techniques using mutual exclusion, such as spin locks. We give lock-free data structures and algorithms for implementing a shared singly-linked list, allowing concurrent traversal, insertion, and deletion by any number of processes. We also show how the basic data structure can be used as a building block for other lock-free data structures. Our algorithms use the single word Compare-and-Swap synchronization primitive to implement the linked list directly, avoiding the overhead of universal methods, and are thus a practical alternative to using spin locks. 1 Introduction A concurrent object is an...
Non-blocking Algorithms and Preemption-Safe Locking on Multiprogrammed Shared Memory Multiprocessors
- JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING
, 1998
"... Most multiprocessors are multiprogrammed in order to achieve acceptable response time and to increase their uti-lization. Unfortunately, inopportune preemption may significantly degrade the performance of synchronized parallel applications. To address this problem, researchers have developed two pri ..."
Abstract
-
Cited by 65 (1 self)
- Add to MetaCart
Most multiprocessors are multiprogrammed in order to achieve acceptable response time and to increase their uti-lization. Unfortunately, inopportune preemption may significantly degrade the performance of synchronized parallel applications. To address this problem, researchers have developed two principal strategies for concurrent, atomic update of shared data structures: (1) preemption-safe locking and (2) non-blocking (lock-free) algorithms. Preemption-safe locking requires kernel support. Non-blocking algorithms generally require a universal atomic primitive such as compare-and-swap orload-linked/store-conditional, and are widely regarded as inefficient. We evaluate the performance of preemption-safe lock-based and non-blocking implementations of important data structures—queues, stacks, heaps, and counters—including non-blocking and lock-based queue algorithms of our own, in micro-benchmarks and real applications on a 12-processor SGI Challenge multiprocessor. Our results indicate that our non-blocking queue consistently outperforms the best known alternatives, and that data-structure-specific non-blocking algorithms, which exist for queues, stacks, and counters, can work extremely well. Not only do they outperform preemption-safe lock-based algorithms on multiprogrammed machines, they also outperform ordinary locks on dedicated machines. At the same time, since general-purpose non-blocking techniques do not yet appear to be practical, preemption-safe locks remain the preferred alternative for complex data structures: they outperform
Correction of a Memory Management Method for Lock-Free Data Structures
, 1995
"... Memory reuse in link-based lock-free data structures requires special care. Many lock-free algorithms require deleted nodes not to be reused until no active pointers point to them. Also, most lock-free algorithms use the compare and swap atomic primitive, which can suffer from the "ABA problem" [1] ..."
Abstract
-
Cited by 31 (1 self)
- Add to MetaCart
Memory reuse in link-based lock-free data structures requires special care. Many lock-free algorithms require deleted nodes not to be reused until no active pointers point to them. Also, most lock-free algorithms use the compare and swap atomic primitive, which can suffer from the "ABA problem" [1] associated with memory reuse. Valois [3] proposed a memory management method for link-based data structures that addresses these problems. The method associates a reference count with each node of reusable memory. A node is reused only when no processes or data structures point to it. The method solves the ABA problem for acyclic link-based data structures, and allows lock-free algorithms more flexibility as nodes are not required to be freed immediately after a delete operation (e.g. dequeue, pop, delete min, etc.). However, there are race conditions that may corrupt data structure that use this method. In this report we correct these race conditions and present a corrected version of Valoi...
A Simple, Fast and Scalable Non-Blocking Concurrent FIFO Queue for Shared Memory Multiprocessor Systems
- in Proceedings of the 13th ACM Symposium on Parallel Algorithms and Architectures
, 2001
"... A non-blocking FIFO queue algorithm for multiprocessor shared memory systems is presented in this paper. The algorithm is very simple, fast and scales very well in both symmetric and non-symmetric multiprocessor shared memory systems. Experiments on a 64-node SUN Enterprise 10000 -- a symmetric mult ..."
Abstract
-
Cited by 22 (6 self)
- Add to MetaCart
A non-blocking FIFO queue algorithm for multiprocessor shared memory systems is presented in this paper. The algorithm is very simple, fast and scales very well in both symmetric and non-symmetric multiprocessor shared memory systems. Experiments on a 64-node SUN Enterprise 10000 -- a symmetric multiprocessor system -- and on a 64-node SGI Origin 2000 -- a cache coherent non uniform memory access multiprocessor system -- indicate that our algorithm considerably outperforms the best of the known alternatives in both multiprocessors in any level of multiprogramming. This work introduces two new, simple algorithmic mechanisms. The first lowers the contention to key variables used by the concurrent enqueue and/or dequeue operations which consequently results in the good performance of the algorithm, the second deals with the pointer recycling problem, an inconsistency problem that all non-blocking algorithms based on the compare-and-swap synchronisation primitive have to address. In our construction we selected to use compare-and-swap since compare-and-swap is an atomic primitive that scales well under contention and either is supported by modern multiprocessors or can be implemented efficiently on them.
Fastforward for efficient pipeline parallelism: A cache-optimized concurrent lock-free queue
- In PPoPP ’08: Proceedings of the The 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
, 2008
"... A Cache-Optimized Concurrent Lock-Free Queue Low overhead core-to-core communication is critical for efficient pipeline-parallel software applications. This paper presents Fast-Forward, a cache-optimized single-producer/single-consumer concurrent lock-free queue for pipeline parallelism on multicore ..."
Abstract
-
Cited by 21 (2 self)
- Add to MetaCart
A Cache-Optimized Concurrent Lock-Free Queue Low overhead core-to-core communication is critical for efficient pipeline-parallel software applications. This paper presents Fast-Forward, a cache-optimized single-producer/single-consumer concurrent lock-free queue for pipeline parallelism on multicore architectures, with weak to strongly ordered consistency models. Enqueue and dequeue times on a 2.66 GHz Opteron 2218 based system are as low as 28.5 ns, up to 5x faster than the next best solution. FastForward’s effectiveness is demonstrated for real applications by applying it to line-rate soft network processing on Gigabit Ethernet with general purpose commodity hardware.
A Lock-Free Approach to Object Sharing in Real-Time Systems
, 1997
"... This work aims to establish the viability of lock-free object sharing in uniprocessor real-time systems. Naive usage of conventional lock-based object-sharing schemes in real-time systems leads to unbounded priority inversion. A priority inversion occurs when a task is blocked by a lower-priority ta ..."
Abstract
-
Cited by 14 (0 self)
- Add to MetaCart
This work aims to establish the viability of lock-free object sharing in uniprocessor real-time systems. Naive usage of conventional lock-based object-sharing schemes in real-time systems leads to unbounded priority inversion. A priority inversion occurs when a task is blocked by a lower-priority task that is inside a critical section. Mechanisms that bound priority inversion usually entail kernel overhead that is sometimes excessive. We propose that lock-free objects offer an attractive alternative to lock-based schemes because they eliminate priority inversion and its associated problems. On the surface, lock-free objects may seem to be unsuitable for hard real-time systems because accesses to such objects are not guaranteed to complete in bounded time. Nonetheless, we present scheduling conditions that demonstrate the applicability of lock-free objects in hard real-time systems. Our scheduling conditions are applicable to schemes such as rate-monotonic scheduling and earliest-deadline-...
A Fully Asynchronous Reader/Writer Mechanism for Multiprocessor Real-Time Systems
, 1997
"... Data sharing among tasks within multiprocessor real-time systems is a crucial issue. This report presents a fully asynchronous mechanism of sharing data between a single writer and multiple readers. The writer and all the readers are allowed to access the shared data asynchronously in a loop-free an ..."
Abstract
-
Cited by 13 (1 self)
- Add to MetaCart
Data sharing among tasks within multiprocessor real-time systems is a crucial issue. This report presents a fully asynchronous mechanism of sharing data between a single writer and multiple readers. The writer and all the readers are allowed to access the shared data asynchronously in a loop-free and wait-free manner because neither locking operations nor repeated actions of read-and-check are involved. Its implementation uses only (n + 2) buffer slots for n readers, and employs an atomic `Store-IfZero ' operation which can be easily simulated with the Compare-and-Swap instruction. Since neither writing nor reading the shared data imposes any effect upon other tasks in the system, this mechanism introduces no impact upon the timing behaviour of tasks. When employed by real-time applications, it helps to reduce blocking and priority inversion problems incurred by the commonly used lock-based synchronization mechanisms. 1 Introduction Data sharing is a basic approach to achieving inter...
A Three-Slot Asynchronous Reader/Writer Mechanism for Multiprocessor Real-Time Systems
, 1997
"... This report presents an approach to realizing a three-slot asynchronous reader/writer mechanism in multiprocessor real-time systems. The mechanism allows both the reader and the writer to access the shared data object at any time without employing any conventional synchronization protocol. Its imple ..."
Abstract
-
Cited by 11 (2 self)
- Add to MetaCart
This report presents an approach to realizing a three-slot asynchronous reader/writer mechanism in multiprocessor real-time systems. The mechanism allows both the reader and the writer to access the shared data object at any time without employing any conventional synchronization protocol. Its implementation takes advantage of the hardware supported Compare-and-Swap instruction to coordinate buffer accessing. Since there is no locking mechanism adopted in the approach, it is a non-blocking mechanism such that delay of either the reader or the writer imposes no effect upon the other. In addition, no repeated reading is needed by the reader. For real-time applications, this mechanism introduces no impact upon timing behaviour of tasks. It helps to reduce blocking and priority inversion problems incurred by the commonly used lock-based synchronization mechanisms. 1 Introduction Data sharing, which is a basic approach to achieving intertask communication within a variety of applications,...
The Architectural and Operating System Implications on the Performance of Synchronization on ccNUMA Multiprocessors
- International Journal of Parallel Programming Volume
, 2001
"... This paper investigates the performance of synchronization algorithms on ccNUMA multiprocessors, from the perspectives of the architecture and the operating system. In contrast with previous related studies that emphasized the relative performance of synchronization algorithms, this paper takes a ne ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
This paper investigates the performance of synchronization algorithms on ccNUMA multiprocessors, from the perspectives of the architecture and the operating system. In contrast with previous related studies that emphasized the relative performance of synchronization algorithms, this paper takes a new approach by analyzing the sources of synchronization latency on ccNUMA architectures and how can this latency be reduced by leveraging hardware and software schemes in both dedicated and multiprogrammed execution environments. From the architectural perspective, the paper identifies the implications of directory-based cache coherence on the latency and scalability of synchronization primitives and examines if and how can simple hardware that accelerates synchronization instructions be leveraged to reduce synchronization latency. From the operating system’s perspective, the paper evaluates in a unified framework, user-level, kernel-level and hybrid algorithms for implementing scalable synchronization in multiprogrammed execution environments. Along with visiting the aforementioned issues, the paper contributes a new methodology for implementing fast synchronization algorithms on ccNUMA multiprocessors. The relevant experiments are conducted on the SGI Origin2000, a popular commercial ccNUMA multiprocessor.

