Results 1 - 10
of
191
Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors
- ACM Transactions on Computer Systems
, 1991
"... Busy-wait techniques are heavily used for mutual exclusion and barrier synchronization in shared-memory parallel programs. Unfortunately, typical implementations of busy-waiting tend to produce large amounts of memory and interconnect contention, introducing performance bottlenecks that become marke ..."
Abstract
-
Cited by 433 (29 self)
- Add to MetaCart
Busy-wait techniques are heavily used for mutual exclusion and barrier synchronization in shared-memory parallel programs. Unfortunately, typical implementations of busy-waiting tend to produce large amounts of memory and interconnect contention, introducing performance bottlenecks that become markedly more pronounced as applications scale. We argue that this problem is not fundamental, and that one can in fact construct busy-wait synchronization algorithms that induce no memory or interconnect contention. The key to these algorithms is for every processor to spin on separate locally-accessible ag variables, and for some other processor to terminate the spin with a single remote write operation at an appropriate time. Flag variables may be locally-accessible as a result of coherent caching, or by virtue of allocation in the local portion of physically distributed shared memory. We present a new scalable algorithm for spin locks that generates O(1) remote references per lock acquisition, independent of the number of processors attempting to acquire the lock. Our algorithm provides reasonable latency in the absence of contention, requires only a constant amount of space per lock, and requires no hardware support other than
The Implementation of the Cilk-5 Multithreaded Language
- In Proceedings of the SIGPLAN '98 Conference on Program Language Design and Implementation
, 1998
"... The fifth release of the multithreaded language Cilk uses a provably good "work-stealing" scheduling algorithm similar to the first system, but the language has been completely redesigned and the runtime system completely reengineered. The efficiency of the new implementation was aided by a clear st ..."
Abstract
-
Cited by 248 (20 self)
- Add to MetaCart
The fifth release of the multithreaded language Cilk uses a provably good "work-stealing" scheduling algorithm similar to the first system, but the language has been completely redesigned and the runtime system completely reengineered. The efficiency of the new implementation was aided by a clear strategy that arose from a theoretical analysis of the scheduling algorithm: concentrate on minimizing overheads that contribute to the work, even at the expense of overheads that contribute to the critical path. Although it may seem counterintuitive to move overheads onto the critical path, this "work-first" principle has led to a portable Cilk-5 implementation in which the typical cost of spawning a parallel thread is only between 2 and 6 times the cost of a C function call on a variety of contemporary machines. Many Cilk programs run on one processor with virtually no degradation compared to equivalent C programs. This paper describes how the work-first principle was exploited in the design...
A Fast Mutual Exclusion Algorithm
, 1986
"... A new solution to the mutual exclusion problem is presented that, in the absence of contention, requires only seven memory accesses. It assumes atomic reads and atomic writes to shared registers. Capsule Review To build a useful computing system from a collection of processors that communicate by ..."
Abstract
-
Cited by 202 (6 self)
- Add to MetaCart
A new solution to the mutual exclusion problem is presented that, in the absence of contention, requires only seven memory accesses. It assumes atomic reads and atomic writes to shared registers. Capsule Review To build a useful computing system from a collection of processors that communicate by sharing memory, but lack any atomic operation more complex than a memory read or write, it is necessary to implement mutual exclusion using only these operations. Solutions to this problem have been known for twenty years, but they are linear in the number of processors. Lamport presents a new algorithm which takes constant time (five writes and two reads) in the absence of contention, which is the normal case. To achieve this performance it sacrifices fairness, which is probably unimportant in practical applications. The paper gives an informal argument that the algorithm's performance in the absence of contention is optimal, and a fairly formal proof of safety and freedom from deadlock, u...
Logtm: Log-based transactional memory
- in HPCA
, 2006
"... Transactional memory (TM) simplifies parallel programming by guaranteeing that transactions appear to execute atomically and in isolation. Implementing these properties includes providing data version management for the simultaneous storage of both new (visible if the transaction commits) and old (r ..."
Abstract
-
Cited by 173 (8 self)
- Add to MetaCart
Transactional memory (TM) simplifies parallel programming by guaranteeing that transactions appear to execute atomically and in isolation. Implementing these properties includes providing data version management for the simultaneous storage of both new (visible if the transaction commits) and old (retained if the transaction aborts) values. Most (hardware) TM systems leave old values “in place” (the target memory address) and buffer new values elsewhere until commit. This makes aborts fast, but penalizes (the much more frequent) commits. In this paper, we present a new implementation of transactional memory, Log-based Transactional Memory (LogTM), that makes commits fast by storing old values to a per-thread log in cacheable virtual memory and storing new values in place. LogTM makes two additional contributions. First, LogTM extends a MOESI directory protocol to enable both fast conflict detection on evicted blocks and fast commit (using lazy cleanup). Second, LogTM handles aborts in (library) software with little performance penalty. Evaluations running micro- and SPLASH-2 benchmarks on a 32way multiprocessor support our decision to optimize for commit by showing that only 1-2 % of transactions abort. 1.
Thin Locks: Featherweight Synchronization for Java
, 1998
"... Language-supported synchronization is a source of serious performance problems in many Java programs. Even singlethreaded applications may spend up to half their time performing useless synchronization due to the thread-safe nature of the Java libraries. We solve this performance problem with a new ..."
Abstract
-
Cited by 105 (5 self)
- Add to MetaCart
Language-supported synchronization is a source of serious performance problems in many Java programs. Even singlethreaded applications may spend up to half their time performing useless synchronization due to the thread-safe nature of the Java libraries. We solve this performance problem with a new algorithm that allows lock and unlock operations to be performed with only a few machine instructions in the most common cases. Our locks only require a partial word per object, and were implemented without increasing object size. We present measurements from our implementation in the JDK 1.1.2 for AIX, demonstrating speedups of up to a factor of 5 in micro-benchmarks and up to a factor of 1.7 in real programs. 1 Introduction Monitors [5] are a language-level construct for providing mutually exclusive access to shared data structures in a multithreaded environment. However, the overhead required by the necessary locking has generally restricted their use to relatively "heavy-weight" object...
Closure and Convergence: A Foundation of Fault-Tolerant Computing
- IEEE Transactions on Software Engineering
, 1993
"... We give a formal definition of what it means for a system to "tolerate" a class of "faults". The definition consists of two conditions: One, if a fault occurs when the system state is within a set of "legal" states, the resulting state is within some larger set and, if faults continue occurring, the ..."
Abstract
-
Cited by 103 (28 self)
- Add to MetaCart
We give a formal definition of what it means for a system to "tolerate" a class of "faults". The definition consists of two conditions: One, if a fault occurs when the system state is within a set of "legal" states, the resulting state is within some larger set and, if faults continue occurring, the system state remains within that larger set (Closure). And two, if faults stop occurring, the system eventually reaches a state within the legal set (Convergence). We demonstrate the applicability of our definition for specifying and verifying the fault-tolerance properties of a variety of digital and computer systems. Further, using the definition, we obtain a simple classification of fault-tolerant systems and discuss methods for their systematic design. as traditionally been studied in the context of specifi...
Empirical Studies of Competitive Spinning for a Shared-Memory Multiprocessor
- IN PROCEEDINGS OF THE 13TH ACM SYMPOSIUM ON OPERATING SYSTEMS PRINCIPLES
, 1991
"... A common operation in multiprocessor programs is acquiring a lock to protect access to shared data. Typically, the requesting thread is blocked if the lock it needs is held by another thread. The cost of blocking one thread and activating another can be a substantial part of program execution time. ..."
Abstract
-
Cited by 99 (1 self)
- Add to MetaCart
A common operation in multiprocessor programs is acquiring a lock to protect access to shared data. Typically, the requesting thread is blocked if the lock it needs is held by another thread. The cost of blocking one thread and activating another can be a substantial part of program execution time. Alternatively, the thread could spin until the lock is free, or spin for a while and then block. This may avoid context-switch overhead, but processor cycles may be wasted in unproductive spinning. This paper studies seven strategies for determining whether and how long to spin before blocking. Of particular interest are competitive strategies, for which the performance can be shown to be no worse than some constant factor times an optimal off-line strategy. The performance of five competitive strategies is compared with that of always blocking, always spinning, or using the optimal off-line algorithm. Measurements of lock-waiting time distributions for five parallel programs were used to co...
Basic Techniques for the Efficient Coordination of Very Large Numbers of Cooperating Sequential Processors
, 1981
"... In this paper we implement several basic operating system primitives by using a "replace-add" operation, which can supersede the standard "test and set", and which appears to be a universal primitive for efficiently coordinating large numbers of independently acting sequential processors. We also pr ..."
Abstract
-
Cited by 84 (2 self)
- Add to MetaCart
In this paper we implement several basic operating system primitives by using a "replace-add" operation, which can supersede the standard "test and set", and which appears to be a universal primitive for efficiently coordinating large numbers of independently acting sequential processors. We also present a hardware implementation of replace-add that permits multiple replace-adds to be processed nearly as efficiently as loads and stores. Moreover, the crucial special case of concurrent replace-adds updating the same variable is handled particularly well: If every PE simultaneously addresses a replace-add at the same variable, all these requests are satisfied in the time required to process just one request.
Secure and Scalable Replication in Phalanx
- In Proceedings of the 17th IEEE Symposium on Reliable Distributed Systems
, 1998
"... ) Dahlia Malkhi Michael K. Reiter AT&T Labs Research, Florham Park, NJ, USA fdalia,reiterg@research.att.com Abstract Phalanx is a software system for building a persistent, survivable data repository that supports shared data abstractions (e.g., variables, mutual exclusion) for clients. Phalanx ..."
Abstract
-
Cited by 83 (8 self)
- Add to MetaCart
) Dahlia Malkhi Michael K. Reiter AT&T Labs Research, Florham Park, NJ, USA fdalia,reiterg@research.att.com Abstract Phalanx is a software system for building a persistent, survivable data repository that supports shared data abstractions (e.g., variables, mutual exclusion) for clients. Phalanx implements data abstractions that ensure useful properties without trusting the servers supporting these abstractions or the clients accessing them, i.e., Phalanx can survive even the arbitrarily malicious corruption of clients and (some number of) servers. At the core of the system are survivable replication techniques that enable efficient scaling to hundreds of Phalanx servers. In this paper we describe the implementation of some of the data abstractions provided by Phalanx, discuss their ability to scale to large systems, and describe an example application. 1. Introduction In this paper we introduce Phalanx, a software system for building persistent services that support shared data ab...
A Correctness Condition for High-Performance Multiprocessors (Extended Abstract)
, 1992
"... Hybrid consistency, a new consistency condition for shared memory multiprocessors, attempts to cap ture the guarantees provided by contemporary high-performance architectures. It combines the exprea-siveneas of strong consistency conditions (e.g., sequen-tial consistency, linearizability) and the ef ..."
Abstract
-
Cited by 65 (7 self)
- Add to MetaCart
Hybrid consistency, a new consistency condition for shared memory multiprocessors, attempts to cap ture the guarantees provided by contemporary high-performance architectures. It combines the exprea-siveneas of strong consistency conditions (e.g., sequen-tial consistency, linearizability) and the efficiency of weak consistency conditions (e.g., Pipelined RAM, causal memory). Memory access operations are classified as either strong or weak. A global ordering of strong operations at different processes is guaran-teed, but there is very little guarantee on the ordering of weak operations at different processes, except for what is implied by their interleaving with the strong operations. A formal and precise definition of this condition is given. An efficient implementation of hy-brid consistency on distributed memory machines is presented. In this implementation, weak operations are executed instantaneously, while the response time for strong operations is linear in the network delay. (It is proven that this is within a constant factor of the optimal time bounds.) To motivate hybrid consistency it is shown that weakly consistent memories do not support non-cooperative (in particular, non-centralized) alg~ rithms for mutual exclusion.

