Results 1 - 10
of
26
Transactional Lock-Free Execution of Lock-Based Programs
- In Proceedings of the Tenth International Conference on Architectural Support for Programming Languages and Operating Systems
, 2002
"... This paper is motivated by the difficulty in writing correct high-performance programs. Writing shared-memory multithreaded programs imposes a complex trade-off between programming ease and performance, largely due to subtleties in coordinating access to shared data. To ensure correctness programmer ..."
Abstract
-
Cited by 148 (9 self)
- Add to MetaCart
This paper is motivated by the difficulty in writing correct high-performance programs. Writing shared-memory multithreaded programs imposes a complex trade-off between programming ease and performance, largely due to subtleties in coordinating access to shared data. To ensure correctness programmers often rely on conservative locking at the expense of performance. The resulting serialization of threads is a performance bottleneck. Locks also interact poorly with thread scheduling and faults, resulting in poor system performance.
Speculative Synchronization: Applying Thread-Level Speculation to Explicitly Parallel Applications
- ASPLOS X
, 2002
"... Barriers, locks, and flags are synchronizing operations widely used by programmers and parallelizing compilers to produce race-free parallel programs. Often times, these operations are placed suboptimally, either because of conservative assumptions about the program, or merely for code simplicity. W ..."
Abstract
-
Cited by 75 (7 self)
- Add to MetaCart
Barriers, locks, and flags are synchronizing operations widely used by programmers and parallelizing compilers to produce race-free parallel programs. Often times, these operations are placed suboptimally, either because of conservative assumptions about the program, or merely for code simplicity. We propose
Non-blocking Algorithms and Preemption-Safe Locking on Multiprogrammed Shared Memory Multiprocessors
- JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING
, 1998
"... Most multiprocessors are multiprogrammed in order to achieve acceptable response time and to increase their uti-lization. Unfortunately, inopportune preemption may significantly degrade the performance of synchronized parallel applications. To address this problem, researchers have developed two pri ..."
Abstract
-
Cited by 65 (1 self)
- Add to MetaCart
Most multiprocessors are multiprogrammed in order to achieve acceptable response time and to increase their uti-lization. Unfortunately, inopportune preemption may significantly degrade the performance of synchronized parallel applications. To address this problem, researchers have developed two principal strategies for concurrent, atomic update of shared data structures: (1) preemption-safe locking and (2) non-blocking (lock-free) algorithms. Preemption-safe locking requires kernel support. Non-blocking algorithms generally require a universal atomic primitive such as compare-and-swap orload-linked/store-conditional, and are widely regarded as inefficient. We evaluate the performance of preemption-safe lock-based and non-blocking implementations of important data structures—queues, stacks, heaps, and counters—including non-blocking and lock-based queue algorithms of our own, in micro-benchmarks and real applications on a 12-processor SGI Challenge multiprocessor. Our results indicate that our non-blocking queue consistently outperforms the best known alternatives, and that data-structure-specific non-blocking algorithms, which exist for queues, stacks, and counters, can work extremely well. Not only do they outperform preemption-safe lock-based algorithms on multiprogrammed machines, they also outperform ordinary locks on dedicated machines. At the same time, since general-purpose non-blocking techniques do not yet appear to be practical, preemption-safe locks remain the preferred alternative for complex data structures: they outperform
Shared-memory mutual exclusion: Major research trends since
- Distributed Computing
, 1986
"... * Exclusion: At most one process executes its critical section at any time. ..."
Abstract
-
Cited by 38 (7 self)
- Add to MetaCart
* Exclusion: At most one process executes its critical section at any time.
Towards Scalable Multiprocessor Virtual Machines
, 2004
"... A multiprocessor virtual machine benefits its guest operating system in supporting scalable job throughput and request latency-useful properties in server consolidation where servers require several of the system processors for steady state or to handle load bursts. Typical operating systems, optimi ..."
Abstract
-
Cited by 20 (1 self)
- Add to MetaCart
A multiprocessor virtual machine benefits its guest operating system in supporting scalable job throughput and request latency-useful properties in server consolidation where servers require several of the system processors for steady state or to handle load bursts. Typical operating systems, optimized for multiprocessor systems in their use of spin-locks for critical sections, can defeat flexible virtual machine scheduling due to lock-holder preemption and misbalanced load. The virtual machine must assist the guest operating system to avoid lock-holder preemption and to schedule jobs with knowledge of asymmetric processor allocation. We want to support a virtual machine environment with flexible scheduling policies, while maximizing guest performance. This paper presents solutions to avoid lock-holder preemption for both fully virtualized and paravirtualized environments. Experiments show that we can nearly eliminate the effects of lock-holder preemption. Furthermore, the paper presents a scheduler feedback mechanism that despite the presence of asymmetric processor allocation achieves optimal and fair load balancing in the guest operating system.
High Performance Synchronization Algorithms for Multiprogrammed Multiprocessors
- IN PROCEEDINGS OF THE FIFTH ACM SYMPOSIUM ON PRINCIPLES AND PRACTICES OF PARALLEL PROGRAMMING
, 1995
"... Scalable busy-wait synchronization algorithms are essential for achieving good parallel program performance on large scale multiprocessors. Such algorithms include mutual exclusion locks, reader-writer locks, and barrier synchronization. Unfortunately, scalable synchronization algorithms are particu ..."
Abstract
-
Cited by 20 (2 self)
- Add to MetaCart
Scalable busy-wait synchronization algorithms are essential for achieving good parallel program performance on large scale multiprocessors. Such algorithms include mutual exclusion locks, reader-writer locks, and barrier synchronization. Unfortunately, scalable synchronization algorithms are particularly sensitive to the effects of multiprogramming: their performance degrades sharply when processors are shared among different applications, or even among processes of the same application. In this paper we describe the design and evaluation of scalable scheduler-conscious mutual exclusion locks, reader-writer locks, and barriers, and show that by sharing information across the kernel/application interface we can improve the performance of scheduler-oblivious implementations by more than an order of magnitude.
Scalable Queue-Based Spin Locks with Timeout
, 2001
"... Queue-based spi n locks allow programs wi th busy-wai t synchroni zati on to scale to very large multi processors, wi thout fear of starvati on or performance-destroyi ng contenti on. Socalled try locks, onally based on non-scalable test-andset locks, allow a process to abandoni ts attempt to acqu ..."
Abstract
-
Cited by 15 (1 self)
- Add to MetaCart
Queue-based spi n locks allow programs wi th busy-wai t synchroni zati on to scale to very large multi processors, wi thout fear of starvati on or performance-destroyi ng contenti on. Socalled try locks, onally based on non-scalable test-andset locks, allow a process to abandoni ts attempt to acqui re a lock after a gi ven amount of ti me. The process can then pursue an alternati ve code path, or yi eld the processor to some other process. We demonstrate thati ti s possi ble to obtai n both scalabiNX y and bounded wai ti ng, usi ng vari ants of the queuebased locks ofCrai g, Landi n, and Hagersten, and of MellorCrummey and Scott. A process that deci des to stop wai ti ng for one of these new locks can "li nki tself out of li ne" atomi - cally. Si ngle-processor experi ments reveal performance penalti es of 50--100% for the CLH and MCS try locksi n compari son to thei r standard versi ons; thi s margi nal cost decreases wi th larger numbers of processors. We have also compared our queue-based locks to a tradi - ti onal test-and-test and set lockwi h exponenti l backo# and ti meout. At modest (non-zero) levels of contenti on, the queued locks sacri ce cache locali y for fai ness, resulti ngi n a worst-case 3X performance penalty. At hi gh levels of contenti on, however, they di splay a 1.5--2X performance adyA5 , wi th si gni ficantly more regular ti mi ngs andsi gni ficantly hi gher rates ofacqui si ti on pri or to ti meout.
The Architectural and Operating System Implications on the Performance of Synchronization on ccNUMA Multiprocessors
- International Journal of Parallel Programming Volume
, 2001
"... This paper investigates the performance of synchronization algorithms on ccNUMA multiprocessors, from the perspectives of the architecture and the operating system. In contrast with previous related studies that emphasized the relative performance of synchronization algorithms, this paper takes a ne ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
This paper investigates the performance of synchronization algorithms on ccNUMA multiprocessors, from the perspectives of the architecture and the operating system. In contrast with previous related studies that emphasized the relative performance of synchronization algorithms, this paper takes a new approach by analyzing the sources of synchronization latency on ccNUMA architectures and how can this latency be reduced by leveraging hardware and software schemes in both dedicated and multiprogrammed execution environments. From the architectural perspective, the paper identifies the implications of directory-based cache coherence on the latency and scalability of synchronization primitives and examines if and how can simple hardware that accelerates synchronization instructions be leveraged to reduce synchronization latency. From the operating system’s perspective, the paper evaluates in a unified framework, user-level, kernel-level and hybrid algorithms for implementing scalable synchronization in multiprogrammed execution environments. Along with visiting the aforementioned issues, the paper contributes a new methodology for implementing fast synchronization algorithms on ccNUMA multiprocessors. The relevant experiments are conducted on the SGI Origin2000, a popular commercial ccNUMA multiprocessor.
Program transformation and runtime support for threaded MPI execution on shared-memory machines
- ACM Transactions on Programming Languages and Systems
, 2000
"... Parallel programs written in MPI have been widely used for developing high-performance applications on various platforms. Because of a restriction of the MPI computation model, conventional MPI implementations on shared memory machines map each MPI node to an OS process, which can suffer serious per ..."
Abstract
-
Cited by 11 (1 self)
- Add to MetaCart
Parallel programs written in MPI have been widely used for developing high-performance applications on various platforms. Because of a restriction of the MPI computation model, conventional MPI implementations on shared memory machines map each MPI node to an OS process, which can suffer serious performance degradation in the presence of multiprogramming. This paper studies compile-time and runtime techniques for enhancing performance portability of MPI code running on multiprogrammed shared memory machines. The proposed techniques allow MPI nodes to be executed safely and efficiently as threads. Compile-time transformation eliminates global and static variables in C code using node-specific data. The runtime support includes an efficient and provablycorrect communication protocol that uses lock-free data structure and takes advantage of address space sharing among threads. The experiments on SGI Origin 2000 show that our MPI prototype called TMPI using the proposed techniques is competitive with SGI’s native MPI implementation in a dedicated environment, and that it has significant performance advantages in a multiprogrammed environment.
Preemption Adaptivity in Time-Published Queue-Based Spin Locks
- In Proceedings of the 11th International Conference on High Performance Computing
, 2005
"... The proliferation of multiprocessor servers and multithreaded applications has increased the demand for high-performance synchronization. Traditional scheduler-based locks incur the overhead of a full context switch between threads and are thus unacceptably slow for many applications. Spin locks off ..."
Abstract
-
Cited by 10 (3 self)
- Add to MetaCart
The proliferation of multiprocessor servers and multithreaded applications has increased the demand for high-performance synchronization. Traditional scheduler-based locks incur the overhead of a full context switch between threads and are thus unacceptably slow for many applications. Spin locks offer low overhead, but they either scale poorly on large-scale SMPs (test-and-set style locks) or behave poorly in the presence of preemption (queue-based locks). Previous work has shown how to build preemption-tolerant locks using an extended kernel interface, but such locks are neither portable to nor even compatible with most operating systems. In this work, we propose a time-publishing heuristic in which each thread periodically records its current timestamp to a shared memory location. Given the high resolution, roughly synchronized clocks of modern processors, this convention allows threads to guess accurately which peers are active based on the currency of their timestamps. We implement two queuebased locks, MCS-TP and CLH-TP, and evaluate their performance relative to both traditional spin locks and preemption-safe locks on a 32-processor IBM p690 multiprocessor. Experimental results indicate that time-published locks make it feasible, for the first time, to use queue-based spin locks on multiprogrammed systems with a standard kernel interface. 1

