Results 1 - 10
of
341
Transactional Memory: Architectural Support for Lock-Free Data Structures
"... A shared data structure is lock-free if its operations do not require mutual exclusion. If one process is interrupted in the middle of an operation, other processes will not be prevented from operating on that object. In highly concurrent systems, lock-free data structures avoid common problems asso ..."
Abstract
-
Cited by 597 (19 self)
- Add to MetaCart
A shared data structure is lock-free if its operations do not require mutual exclusion. If one process is interrupted in the middle of an operation, other processes will not be prevented from operating on that object. In highly concurrent systems, lock-free data structures avoid common problems associated with conventional locking techniques, including priority inversion, convoying, and difficulty of avoiding deadlock. This paper introduces transactional memory, a new multiprocessor architecture intended to make lock-free synchronization as efficient (and easy to use) as conventional techniques based on mutual exclusion. Transactional memory allows programmers to define customized read-modify-write operations that apply to multiple, independently-chosen words of memory. It is implemented by straightforward extensions to any multiprocessor cache-coherence protocol. Simulation results show that transactional memory matches or outperforms the best known locking techniques for simple benchmarks, even in the absence of priority inversion, convoying, and deadlock.
A methodology for implementing highly concurrent data structures
- In 2nd Symp. Principles & Practice of Parallel Programming
, 1990
"... A con.curren.t object is a data structure shared by concurrent processes. Conventional techniques for implementing concurrent objects typically rely on criticaI sections: ensuring that only one process at a time can operate on the object. Nevertheless, critical sections are poorly suited for asynchr ..."
Abstract
-
Cited by 295 (12 self)
- Add to MetaCart
A con.curren.t object is a data structure shared by concurrent processes. Conventional techniques for implementing concurrent objects typically rely on criticaI sections: ensuring that only one process at a time can operate on the object. Nevertheless, critical sections are poorly suited for asynchronous systems: if one process is halted or delayed in a critical section, other, non-faulty processes will be unable to progress. By contrast, a concurrent object implementation is non-blocking if it always guarantees that some process will complete an operation in a finite number of steps, and it is wait-free if it guarantees that each process will complete an operation in a finite number of steps. This paper proposes a new methodology for constructing non-blocking aud wait-free implementations of concurrent objects. The object’s representation and operations are written as st,ylized sequential programs, with no explicit synchronization. Each sequential operation is automatically transformed into a non-blocking or wait-free operation usiug novel synchronization and memory management algorithms. These algorithms are presented for a multiple instruction/multiple data (MIM D) architecture in which n processes communicate by applying read, write, and comparekYswa,p operations to a shared memory. 1
Logtm: Log-based transactional memory
- in HPCA
, 2006
"... Transactional memory (TM) simplifies parallel programming by guaranteeing that transactions appear to execute atomically and in isolation. Implementing these properties includes providing data version management for the simultaneous storage of both new (visible if the transaction commits) and old (r ..."
Abstract
-
Cited by 173 (8 self)
- Add to MetaCart
Transactional memory (TM) simplifies parallel programming by guaranteeing that transactions appear to execute atomically and in isolation. Implementing these properties includes providing data version management for the simultaneous storage of both new (visible if the transaction commits) and old (retained if the transaction aborts) values. Most (hardware) TM systems leave old values “in place” (the target memory address) and buffer new values elsewhere until commit. This makes aborts fast, but penalizes (the much more frequent) commits. In this paper, we present a new implementation of transactional memory, Log-based Transactional Memory (LogTM), that makes commits fast by storing old values to a per-thread log in cacheable virtual memory and storing new values in place. LogTM makes two additional contributions. First, LogTM extends a MOESI directory protocol to enable both fast conflict detection on evicted blocks and fast commit (using lazy cleanup). Second, LogTM handles aborts in (library) software with little performance penalty. Evaluations running micro- and SPLASH-2 benchmarks on a 32way multiprocessor support our decision to optimize for commit by showing that only 1-2 % of transactions abort. 1.
Better Verification Through Symmetry
, 1996
"... A fundamental difficulty in automatic formal verification of finite-state systems is the state explosion problem -- even relatively simple systems can produce very large state spaces, causing great difficulties for methods that rely on explicit state enumeration. We address the problem by exploiting ..."
Abstract
-
Cited by 173 (8 self)
- Add to MetaCart
A fundamental difficulty in automatic formal verification of finite-state systems is the state explosion problem -- even relatively simple systems can produce very large state spaces, causing great difficulties for methods that rely on explicit state enumeration. We address the problem by exploiting structural symmetries in the description of the system to be verified. We make symmetries easy to detect by introducing a new data type scalarset, a finite and unordered set, to our description language. The operations on scalarsets are restricted so that states are guaranteed to have the same future behaviors, up to permutation of the elements of the scalarsets. Using the symmetries implied by scalarsets, a verifier can automatically generate a reduced state space, on the fly. We provide a proof of the soundness of the new symmetry-based verification algorithm based on a definition of the formal semantics of a simple description language with scalarsets. The algorithm has been implemented ...
Transactional Lock-Free Execution of Lock-Based Programs
- In Proceedings of the Tenth International Conference on Architectural Support for Programming Languages and Operating Systems
, 2002
"... This paper is motivated by the difficulty in writing correct high-performance programs. Writing shared-memory multithreaded programs imposes a complex trade-off between programming ease and performance, largely due to subtleties in coordinating access to shared data. To ensure correctness programmer ..."
Abstract
-
Cited by 148 (9 self)
- Add to MetaCart
This paper is motivated by the difficulty in writing correct high-performance programs. Writing shared-memory multithreaded programs imposes a complex trade-off between programming ease and performance, largely due to subtleties in coordinating access to shared data. To ensure correctness programmers often rely on conservative locking at the expense of performance. The resulting serialization of threads is a performance bottleneck. Locks also interact poorly with thread scheduling and faults, resulting in poor system performance.
Synchronization and Communication in the T3E Multiprocessor
, 1996
"... This paper describes the synchronization and communication primitives of the Cray T3E multiprocessor, a shared memory system scalable to 2048 processors. We discuss what we have learned from the T3D project (the predecessor to the T3E) and the rationale behind changes made for the T3E. We include pe ..."
Abstract
-
Cited by 127 (1 self)
- Add to MetaCart
This paper describes the synchronization and communication primitives of the Cray T3E multiprocessor, a shared memory system scalable to 2048 processors. We discuss what we have learned from the T3D project (the predecessor to the T3E) and the rationale behind changes made for the T3E. We include performance measurements for various aspects of communication and synchronization. The T3E augments the memory interface of the DEC 21164 microprocessor with a large set of explicitly-managed, external registers (E-registers). E-registers are used as the source or target for all remote communication. They provide a highly pipelined interface to global memory that allows dozens of requests per processor to be outstanding. Through E-registers, the T3E provides a rich set of atomic memory operations and a flexible, user-level messaging facility. The T3E also provides a set of virtual hardware barrier/ eureka networks that can be arbitrarily embedded into the 3D torus interconnect.
Thin Locks: Featherweight Synchronization for Java
, 1998
"... Language-supported synchronization is a source of serious performance problems in many Java programs. Even singlethreaded applications may spend up to half their time performing useless synchronization due to the thread-safe nature of the Java libraries. We solve this performance problem with a new ..."
Abstract
-
Cited by 105 (5 self)
- Add to MetaCart
Language-supported synchronization is a source of serious performance problems in many Java programs. Even singlethreaded applications may spend up to half their time performing useless synchronization due to the thread-safe nature of the Java libraries. We solve this performance problem with a new algorithm that allows lock and unlock operations to be performed with only a few machine instructions in the most common cases. Our locks only require a partial word per object, and were implemented without increasing object size. We present measurements from our implementation in the JDK 1.1.2 for AIX, demonstrating speedups of up to a factor of 5 in micro-benchmarks and up to a factor of 1.7 in real programs. 1 Introduction Monitors [5] are a language-level construct for providing mutually exclusive access to shared data structures in a multithreaded environment. However, the overhead required by the necessary locking has generally restricted their use to relatively "heavy-weight" object...
A Unified Formalization of Four Shared-Memory Models
- IEEE Transactions on Parallel and Distributed Systems
, 1993
"... This paper presents a shared-memory model, data-race-free-1, that unifies four earlier models: weak ordering, release consistency (with sequentially consistent special operations), the VAX memory model, and datarace -free-0. The most intuitive and commonly assumed shared-memory model, sequential con ..."
Abstract
-
Cited by 105 (9 self)
- Add to MetaCart
This paper presents a shared-memory model, data-race-free-1, that unifies four earlier models: weak ordering, release consistency (with sequentially consistent special operations), the VAX memory model, and datarace -free-0. The most intuitive and commonly assumed shared-memory model, sequential consistency, limits performance. The models of weak ordering, release consistency, the VAX, and data-race-free-0 are based on the common intuition that if programs synchronize explicitly and correctly, then sequential consistency can be guaranteed with high performance. However, each model formalizes this intuition differently and has different advantages and disadvantages with respect to the other models. Data-race-free-1 unifies the models of weak ordering, release consistency, the VAX, and data-race-free-0 by formalizing the above intuition in a manner that retains the advantages of each of the four models. A multiprocessor is data-race-free-1 if it guarantees sequential consistency to data-...
Integrating Message-Passing and Shared-Memory: Early Experience
, 1993
"... This paper discusses some of the issues involved in implementing a shared-address space programming model on large-scale, distributed-memory multiprocessors. While such a programming model can be implemented on both shared-memory and messagepassing architectures, we argue that the transparent, coher ..."
Abstract
-
Cited by 85 (14 self)
- Add to MetaCart
This paper discusses some of the issues involved in implementing a shared-address space programming model on large-scale, distributed-memory multiprocessors. While such a programming model can be implemented on both shared-memory and messagepassing architectures, we argue that the transparent, coherent caching of global data provided by many shared-memory architectures is of crucial importance. Because message-passing mechanisms are much more efficient than shared-memory loads and stores for certain types of interprocessor communication and synchronization operations, however, we argue for building multiprocessors that efficiently support both shared-memory and message-passing mechanisms. We describe an architecture, Alewife, that integrates support for shared-memory and message-passing through a simple interface; we expect the compiler and runtime system to cooperate in using appropriate hardware mechanisms that are most efficient for specific operations. We report on both integrated and exclusively shared-memory implementations of our runtime system and two applications. The integrated runtime system drastically cuts down the cost of communication incurred by the scheduling, load balancing, and certain synchronization operations. We also present preliminary performance results comparing the two systems.
Application-Specific Protocols for User-Level Shared Memory
- In Proceedings of Supercomputing '94
, 1994
"... Recent distributed shared memory (DSM) systems and proposed shared-memory machines have implemented some or all of their cache coherence protocols in software. One way to exploit the flexibility of this software is to tailor a coherence protocol to match an application's communication patterns and m ..."
Abstract
-
Cited by 84 (24 self)
- Add to MetaCart
Recent distributed shared memory (DSM) systems and proposed shared-memory machines have implemented some or all of their cache coherence protocols in software. One way to exploit the flexibility of this software is to tailor a coherence protocol to match an application's communication patterns and memory semantics. This paper presents evidence that this approach can lead to large performance improvements. It shows that application-specific protocols substantially improved the performance of three application programs---appbt, em3d, and barnes---over carefully tuned transparent shared memory implementations. The speed-ups were obtained on Blizzard, a fine-grained DSM system running on a 32-node Thinking Machines CM-5. 1 Introduction A shared address space is central to many parallel languages and models of parallel computation. It provides the global names for data that enable a proces- This work is supported in part by NSF PYI/NYI Awards MIP-8957278, CCR-9157366, and CCR-9357779,...

