Results 1 - 10
of
30
Virtualizing Transactional Memory
, 2005
"... Writing concurrent programs is difficult because of the complexity of ensuring proper synchronization. Conventional lock-based synchronization suffers from wellknown limitations, so researchers have considered nonblocking transactions as an alternative. Recent hardware proposals have demonstrated ho ..."
Abstract
-
Cited by 224 (2 self)
- Add to MetaCart
Writing concurrent programs is difficult because of the complexity of ensuring proper synchronization. Conventional lock-based synchronization suffers from wellknown limitations, so researchers have considered nonblocking transactions as an alternative. Recent hardware proposals have demonstrated how transactions can achieve high performance while not suffering limitations of lock-based mechanisms. However, current hardware proposals require programmers to be aware of platform-specific resource limitations such as buffer sizes, scheduling quanta, as well as events such as page faults, and process migrations. If the transactional model is to gain wide acceptance, hardware support for transactions must be virtualized to hide these limitations in much the same way that virtual memory shields the programmer from platform-specific limitations of physical memory. This paper proposes Virtual Transactional Memory (VTM), a user-transparent system that shields the programmer from various platform-specific resource limitations. VTM maintains the performance advantage of hardware transactions, incurs low overhead in time, and has modest costs in hardware support. While many system-level challenges remain, VTM takes a step toward making transactional models more widely acceptable.
Logtm: Log-based transactional memory
- in HPCA
, 2006
"... Transactional memory (TM) simplifies parallel programming by guaranteeing that transactions appear to execute atomically and in isolation. Implementing these properties includes providing data version management for the simultaneous storage of both new (visible if the transaction commits) and old (r ..."
Abstract
-
Cited by 173 (8 self)
- Add to MetaCart
Transactional memory (TM) simplifies parallel programming by guaranteeing that transactions appear to execute atomically and in isolation. Implementing these properties includes providing data version management for the simultaneous storage of both new (visible if the transaction commits) and old (retained if the transaction aborts) values. Most (hardware) TM systems leave old values “in place” (the target memory address) and buffer new values elsewhere until commit. This makes aborts fast, but penalizes (the much more frequent) commits. In this paper, we present a new implementation of transactional memory, Log-based Transactional Memory (LogTM), that makes commits fast by storing old values to a per-thread log in cacheable virtual memory and storing new values in place. LogTM makes two additional contributions. First, LogTM extends a MOESI directory protocol to enable both fast conflict detection on evicted blocks and fast commit (using lazy cleanup). Second, LogTM handles aborts in (library) software with little performance penalty. Evaluations running micro- and SPLASH-2 benchmarks on a 32way multiprocessor support our decision to optimize for commit by showing that only 1-2 % of transactions abort. 1.
Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution
, 2001
"... Serialization of threads due to critical sections is a fundamental bottleneck to achieving high performance in multithreaded programs. Dynamically, such serialization may be unnecessary because these critical sections could have safely executed concurrently without locks. Current processors cannot f ..."
Abstract
-
Cited by 161 (9 self)
- Add to MetaCart
Serialization of threads due to critical sections is a fundamental bottleneck to achieving high performance in multithreaded programs. Dynamically, such serialization may be unnecessary because these critical sections could have safely executed concurrently without locks. Current processors cannot fully exploit such parallelism because they do not have mechanisms to dynamically detect such false inter-thread dependences. We propose Speculative Lock Elision (SLE), a novel micro-architectural technique to remove dynamically unnecessary lock-induced serialization and enable highly concurrent multithreaded execution. The key insight is that locks do not always have to be acquired for a correct execution. Synchronization instructions are predicted as being unnecessary and elided. This allows multiple threads to concurrently execute critical sections protected by the same lock. Misspeculation due to inter-thread data conflicts is detected using existing cache mechanisms and rollback is used for recovery. Successful speculative elision is validated and committed without acquiring the lock. SLE can be implemented entirely in microarchitecture without instruction set support and without system-level modifications, is transparent to programmers, and requires only trivial additional hardware support. SLE can provide programmers a fast path to writing correct high-performance multithreaded programs.
Transactional Lock-Free Execution of Lock-Based Programs
- In Proceedings of the Tenth International Conference on Architectural Support for Programming Languages and Operating Systems
, 2002
"... This paper is motivated by the difficulty in writing correct high-performance programs. Writing shared-memory multithreaded programs imposes a complex trade-off between programming ease and performance, largely due to subtleties in coordinating access to shared data. To ensure correctness programmer ..."
Abstract
-
Cited by 148 (9 self)
- Add to MetaCart
This paper is motivated by the difficulty in writing correct high-performance programs. Writing shared-memory multithreaded programs imposes a complex trade-off between programming ease and performance, largely due to subtleties in coordinating access to shared data. To ensure correctness programmers often rely on conservative locking at the expense of performance. The resulting serialization of threads is a performance bottleneck. Locks also interact poorly with thread scheduling and faults, resulting in poor system performance.
McRT-STM: a High Performance Software Transactional Memory System for a Multi-Core Runtime
- In Proc. of the 11th ACM Symp. on Principles and Practice of Parallel Programming
, 2006
"... Applications need to become more concurrent to take advantage of the increased computational power provided by chip level multiprocessing. Programmers have traditionally managed this concurrency using locks (mutex based synchronization). Unfortunately, lock based synchronization often leads to deadl ..."
Abstract
-
Cited by 138 (9 self)
- Add to MetaCart
Applications need to become more concurrent to take advantage of the increased computational power provided by chip level multiprocessing. Programmers have traditionally managed this concurrency using locks (mutex based synchronization). Unfortunately, lock based synchronization often leads to deadlocks, makes fine-grained synchronization difficult, hinders composition of atomic primitives, and provides no support for error recovery. Transactions avoid many of these problems, and therefore, promise to ease concurrent programming. We describe a software transactional memory (STM) system that is part of McRT, an experimental Multi-Core RunTime. The McRT-STM implementation uses a number of novel algorithms, and supports advanced features such as nested transactions with partial aborts, conditional signaling within a transaction, and object based conflict detection for C/C++ applications. The McRT-STM exports interfaces that can be used from C/C++ programs directly or as a target for compilers translating higher level linguistic constructs. We present a detailed performance analysis of various STM design tradeoffs such as pessimistic versus optimistic concurrency, undo logging versus write buffering, and cache line based versus object based conflict detection. We also show a MCAS implementation that works on arbitrary values, coexists with the STM, and can be used as a more efficient form of transactional memory. To provide a baseline we compare the performance of the STM with that of fine-grained and coarsegrained locking using a number of concurrent data structures on a 16-processor SMP system. We also show our STM performance on a non-synthetic workload – the Linux sendmail application.
The Synergy Between Non-blocking Synchronization and Operating System Structure
, 1996
"... Non-blocking synchronization has significant advantages over blocking synchronization: however, it has not been used to a significant degree in practice. We designed and implemented a multiprocessor operating system kernel and run-time library for high-performance, reliability and modularity. We use ..."
Abstract
-
Cited by 98 (1 self)
- Add to MetaCart
Non-blocking synchronization has significant advantages over blocking synchronization: however, it has not been used to a significant degree in practice. We designed and implemented a multiprocessor operating system kernel and run-time library for high-performance, reliability and modularity. We used nonblocking synchronization, not because it was an objective in itself, but because it became the approach of choice. It was an attractive approach because of the synergy between other structuring techniques we used to achieve our primary goals and the benefits of non-blocking synchronization. This paper
Adaptive Software Transactional Memory
- In Proc. of the 19th Intl. Symp. on Distributed Computing
, 2005
"... Abstract. Software Transactional Memory (STM) is a generic synchronization construct that enables automatic conversion of correct sequential objects into correct nonblocking concurrent objects. Recent STM systems, though significantly more practical than their predecessors, display inconsistent perf ..."
Abstract
-
Cited by 94 (15 self)
- Add to MetaCart
Abstract. Software Transactional Memory (STM) is a generic synchronization construct that enables automatic conversion of correct sequential objects into correct nonblocking concurrent objects. Recent STM systems, though significantly more practical than their predecessors, display inconsistent performance: differing design decisions cause different systems to perform best in different circumstances, often by dramatic margins. In this paper we consider four dimensions of the STM design space: (i) when concurrent objects are acquired by transactions for modification; (ii) how they are acquired; (iii) what they look like when not acquired; and (iv) the non-blocking semantics for transactions (lock-freedom vs. obstruction-freedom). In this 4-dimensional space we highlight the locations of two leading STM systems: the DSTM of Herlihy et al. and the OSTM of Fraser and Harris. Drawing motivation from the performance of a series of application benchmarks, we then present a new Adaptive STM (ASTM) system that adjusts to the offered workload, allowing it to match the performance of the best known existing system on every tested workload. 1
Speculative Synchronization: Applying Thread-Level Speculation to Explicitly Parallel Applications
- ASPLOS X
, 2002
"... Barriers, locks, and flags are synchronizing operations widely used by programmers and parallelizing compilers to produce race-free parallel programs. Often times, these operations are placed suboptimally, either because of conservative assumptions about the program, or merely for code simplicity. W ..."
Abstract
-
Cited by 75 (7 self)
- Add to MetaCart
Barriers, locks, and flags are synchronizing operations widely used by programmers and parallelizing compilers to produce race-free parallel programs. Often times, these operations are placed suboptimally, either because of conservative assumptions about the program, or merely for code simplicity. We propose
Toward a Theory of Transactional Contention Managers
- In Proceedings of the 24th Annual ACM Symposium on Principles of Distributed Computing (PODC
, 2005
"... In recent software transactional memory proposals, a contention manager module is responsible for ensuring that the system as a whole makes progress. A number of contention manager algorithms have been proposed and empirically evaluated. In this paper we lay some foundations for a theory of contenti ..."
Abstract
-
Cited by 66 (13 self)
- Add to MetaCart
In recent software transactional memory proposals, a contention manager module is responsible for ensuring that the system as a whole makes progress. A number of contention manager algorithms have been proposed and empirically evaluated. In this paper we lay some foundations for a theory of contention management. We present the greedy contention manager, the first to combine non-trivial provable properties with good practical performance. In a model where transaction delays are finite, the greedy manager guarantees that every transaction commits within a bounded time, and the time to complete n concurrent transactions that share s objects is within a factor of s(s + 1) + 2 of the time that would have been taken by an optimal offline list scheduler. No contention manager reviewed in the literature satisfies both the properties. Benchmark results convey our claim of the practicality of the greedy manager.
Non-blocking Algorithms and Preemption-Safe Locking on Multiprogrammed Shared Memory Multiprocessors
- JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING
, 1998
"... Most multiprocessors are multiprogrammed in order to achieve acceptable response time and to increase their uti-lization. Unfortunately, inopportune preemption may significantly degrade the performance of synchronized parallel applications. To address this problem, researchers have developed two pri ..."
Abstract
-
Cited by 65 (1 self)
- Add to MetaCart
Most multiprocessors are multiprogrammed in order to achieve acceptable response time and to increase their uti-lization. Unfortunately, inopportune preemption may significantly degrade the performance of synchronized parallel applications. To address this problem, researchers have developed two principal strategies for concurrent, atomic update of shared data structures: (1) preemption-safe locking and (2) non-blocking (lock-free) algorithms. Preemption-safe locking requires kernel support. Non-blocking algorithms generally require a universal atomic primitive such as compare-and-swap orload-linked/store-conditional, and are widely regarded as inefficient. We evaluate the performance of preemption-safe lock-based and non-blocking implementations of important data structures—queues, stacks, heaps, and counters—including non-blocking and lock-based queue algorithms of our own, in micro-benchmarks and real applications on a 12-processor SGI Challenge multiprocessor. Our results indicate that our non-blocking queue consistently outperforms the best known alternatives, and that data-structure-specific non-blocking algorithms, which exist for queues, stacks, and counters, can work extremely well. Not only do they outperform preemption-safe lock-based algorithms on multiprogrammed machines, they also outperform ordinary locks on dedicated machines. At the same time, since general-purpose non-blocking techniques do not yet appear to be practical, preemption-safe locks remain the preferred alternative for complex data structures: they outperform

