Results 1 - 10
of
25
Contention in Shared Memory Algorithms
, 1993
"... Most complexitymeasures for concurrent algorithms for asynchronous sharedmemory architectures focus on process steps and memory consumption. In practice, however, performance of multiprocessor algorithms is heavily influenced by contention, the extent to which processes access the same location at t ..."
Abstract
-
Cited by 57 (1 self)
- Add to MetaCart
Most complexitymeasures for concurrent algorithms for asynchronous sharedmemory architectures focus on process steps and memory consumption. In practice, however, performance of multiprocessor algorithms is heavily influenced by contention, the extent to which processes access the same location at the same time. Nevertheless, even though contention is one of the principal considerations affecting the performance of real algorithms on real multiprocessors, there are no formal tools for analyzing the contention of asynchronous shared-memory algorithms. This paper introduces the first formal complexity model for contention in multiprocessors. We focus on the standard multiprocessor architecture in which n asynchronous processes communicate by applying read, write, and read-modify-write operations to a shared memory. We use our model to derive two kinds of results: (1) lower bounds on contention for well known basic problems such as agreement and mutual exclusion, and (2) trade-offs betwe...
Diffracting trees
- In Proceedings of the 5th Annual ACM Symposium on Parallel Algorithms and Architectures. ACM
, 1994
"... Shared counters are among the most basic coordination structures in multiprocessor computation, with applications ranging from barrier synchronization to concurrent-data-structure design. This article introduces diffracting trees, novel data structures for shared counting and load balancing in a dis ..."
Abstract
-
Cited by 52 (10 self)
- Add to MetaCart
Shared counters are among the most basic coordination structures in multiprocessor computation, with applications ranging from barrier synchronization to concurrent-data-structure design. This article introduces diffracting trees, novel data structures for shared counting and load balancing in a distributed/parallel environment. Empirical evidence, collected on a simulated distributed shared-memory machine and several simulated message-passing architectures, shows that diffracting trees scale better and are more robust than both combining trees and counting networks, currently the most effective known methods for implementing concurrent counters in software. The use of a randomized coordination method together with a combinatorial data structure overcomes the resiliency drawbacks of combining trees. Our simulations show that to handle the same load, diffracting trees and counting networks should have a similar width w, yet the depth of a diffracting tree is O(log w), whereas counting networks have depth O(log 2 w). Diffracting trees have already been used to implement highly efficient producer/consumer queues, and we believe diffraction will prove to be an effective alternative paradigm to combining and queue-locking in the design of many concurrent data structures.
Reactive Synchronization Algorithms for Multiprocessors
"... Synchronization algorithms that are efficient across a wide range of applications and operating conditions are hard to design because their performance depends on unpredictable run-time factors. The designer of a synchronization algorithm has a choice of protocols to use for implementing the synchro ..."
Abstract
-
Cited by 49 (2 self)
- Add to MetaCart
Synchronization algorithms that are efficient across a wide range of applications and operating conditions are hard to design because their performance depends on unpredictable run-time factors. The designer of a synchronization algorithm has a choice of protocols to use for implementing the synchronization operation. For example, candidate protocols for locks include test-and-set protocols and queueing protocols. Frequently, the best choice of protocols depends on the level of contention: previous research has shown that test-and-set protocols for locks outperform queueing protocols at low contention, while the opposite is true at high contention. This paper investigates reactive synchronization algorithms that dynamically choose protocols in response to the level of contention. We describe reactive algorithms for spin locks and fetch-and-op that choose among several shared-memory and message-passing protocols. Dynamically choosing protocols presents a challenge: a reactive algorithm needs to select and change protocols efficiently, and has to allow for the possibility that multiple processes may be executing different protocols at the same time. We describe the notion of consensus objects that the reactive algorithms use to preserve correctness in the face of dynamic protocol changes. Experimental measurements demonstrate that reactive algorithms perform close to the best static choice of protocols at all levels of contention. Furthermore, with mixed levels of contention, reactive algorithms outperform passive algorithms with fixed protocols, provided that contention levels do not change too frequently. Measurements of several parallel applications show that reactive algorithms result in modest performance gains for spin locks and significant gains for fetch-and-op.
Computation Migration: Enhancing Locality for Distributed-Memory Parallel Systems
"... We describe computation migration, a new technique that is based on compile-time program transformations, for accessing remote data in a distributed-memory parallel system. In contrast with RPC-style access, where the access is performed remotely, and with data migration, where the data is moved so ..."
Abstract
-
Cited by 47 (4 self)
- Add to MetaCart
We describe computation migration, a new technique that is based on compile-time program transformations, for accessing remote data in a distributed-memory parallel system. In contrast with RPC-style access, where the access is performed remotely, and with data migration, where the data is moved so that it is local, computation migration moves part of the current thread to the processor where the data resides. The access is performed at the remote processor, and the migrated thread portion continues to run on that same processor; this makes subsequent accesses in the thread portion local. We describe an implementation of computation migration that consists of two parts: an implementation that migrates single activation frames, and a high-level language annotation that allows a programmer to express when migration is desired. We performed experiments using two applications; these experiments demonstrate that computation migration is a valuable alternative to RPC and data migration.
TIGHT ANALYSES OF TWO LOCAL LOAD BALANCING ALGORITHMS
- SIAM J. COMPUT.
, 1999
"... This paper presents an analysis of the following load balancing algorithm. At each step, each node in a network examines the number of tokens at each of its neighbors and sends a token to each neighbor with at least 2d + 1 fewer tokens, where d is the maximum degree of any node in the network. We ..."
Abstract
-
Cited by 45 (5 self)
- Add to MetaCart
This paper presents an analysis of the following load balancing algorithm. At each step, each node in a network examines the number of tokens at each of its neighbors and sends a token to each neighbor with at least 2d + 1 fewer tokens, where d is the maximum degree of any node in the network. We show that within O(∆/α) steps, the algorithm reduces the maximum difference in tokens between any two nodes to at most O((d 2 log n)/α), where ∆ is the global imbalance in tokens (i.e., the maximum difference between the number of tokens at any node initially and the average number of tokens), n is the number of nodes in the network, and α is the edge expansion of the network. The time bound is tight in the sense that for any graph with edge expansion α, and for any value ∆, there exists an initial distribution of tokens with imbalance ∆ for which the time to reduce the imbalance to even ∆/2 is at least Ω(∆/α). The bound on the final imbalance is tight in the sense that there exists a class of networks that can be locally balanced everywhere (i.e., the maximum difference in tokens between any two neighbors is at most 2d), while the global imbalance remains Ω((d 2 log n)/α). Furthermore, we show that upon reaching a state with a global imbalance of O((d 2 log n)/α), the time for this algorithm to locally balance the network can be as large as Ω(n 1/2). We extend our analysis to a variant of this algorithm for dynamic and asynchronous
Waiting Algorithms for Synchronization in Large-Scale Multiprocessors
- ACM Transactions on Computer Systems
, 1991
"... Through analysis and experiments, this paper investigates two-phase waiting algorithms to minimize the cost of waiting for synchronization in large-scale multiprocessors. In a two-phase algorithm, a thread #rst waits by polling a synchronization variable. If the cost of polling reaches a limit L ..."
Abstract
-
Cited by 42 (4 self)
- Add to MetaCart
Through analysis and experiments, this paper investigates two-phase waiting algorithms to minimize the cost of waiting for synchronization in large-scale multiprocessors. In a two-phase algorithm, a thread #rst waits by polling a synchronization variable. If the cost of polling reaches a limit L poll and further waiting is necessary, the thread is blocked, incurring an additional #xed cost, B. The choice of L poll is a critical determinant of the performance of two-phase algorithms.
Approximate Load Balancing on Dynamic and Asynchronous Networks
- In Proceedings of the 25th Annual ACM Symposium on Theory of Computing
, 1993
"... This paper presents a simple local algorithm for load balancing in a distributed network. The algorithm makes no assumption about the structure of the network. It can be executed on a synchronous network with fixed topology, a synchronous network with dynamically changing topology, or an asynchronou ..."
Abstract
-
Cited by 39 (3 self)
- Add to MetaCart
This paper presents a simple local algorithm for load balancing in a distributed network. The algorithm makes no assumption about the structure of the network. It can be executed on a synchronous network with fixed topology, a synchronous network with dynamically changing topology, or an asynchronous network. It works quickly and balances well when the network has an expansion property. In particular, we show that in an n-node networkwith maximumdegree d whose live edges, at every time step, form a ¯-expander, the algorithm will balance the load to within an additive O(d log n=¯) term in O(\Delta log(n\Delta)=¯) time, where \Delta is the initial imbalance. The algorithm improves upon previous approaches that yield O(n) time bounds in dynamic and asynchronous networks. 1 Introduction One of the most fundamental problems to solve on a parallel computer or distributed network is to balance the load or work that must be performed among the various processors. This paper analyzes a sim...
Small-Depth Counting Networks
, 1992
"... Generalizing the notion of a sorting network, Aspnes, Herlihy, and Shavit recently introduced a class of so-called "counting" networks, and established an O(lg 2 n) upper bound on the depth complexity of such networks. Their work was motivated by a number of practical applications arising in the dom ..."
Abstract
-
Cited by 36 (2 self)
- Add to MetaCart
Generalizing the notion of a sorting network, Aspnes, Herlihy, and Shavit recently introduced a class of so-called "counting" networks, and established an O(lg 2 n) upper bound on the depth complexity of such networks. Their work was motivated by a number of practical applications arising in the domain of asynchronous shared memory machines. This paper continues the analysis of counting networks, providing a number of new upper bounds. In particular, we present an explicit construction of an O(c lg* lg n)- depth counting network, a randomized construction of an O(lgn)-depth network (that works with extremely high probability), and using the random con- struction we present an existential proof of a de- terministic O(lgn)-depth network. The latter result matches the trivial (lgn)-depth lower bound to within a constant factor.
Local Divergence of Markov Chains and the Analysis of Iterative Load-Balancing Schemes
- IN PROCEEDINGS OF THE 39TH IEEE SYMPOSIUM ON FOUNDATIONS OF COMPUTER SCIENCE (FOCS ’98
, 1998
"... We develop a general technique for the quantitative analysis of iterative distributed load balancing schemes. We illustrate the technique by studying two simple, intuitively appealing models that are prevalent in the literature: the diffusive paradigm, and periodic balancing circuits (or the dimensi ..."
Abstract
-
Cited by 34 (0 self)
- Add to MetaCart
We develop a general technique for the quantitative analysis of iterative distributed load balancing schemes. We illustrate the technique by studying two simple, intuitively appealing models that are prevalent in the literature: the diffusive paradigm, and periodic balancing circuits (or the dimension exchange paradigm). It is well known that such load balancing schemes can be roughly modeled by Markov chains, but also that this approximation can be quite inaccurate. Our main contribution is an effective way of characterizing the deviation between the actual loads and the distribution generated by a related Markov chain, in terms of a natural quantity which we call the local divergence. We apply this technique to obtain bounds on the number of rounds required to achieve coarse balancing in general networks, cycles and meshes in these models. For balancing circuits, we also present bounds for the stronger requirement of perfect balancing, or counting.
Scalable Concurrent Counting
- ACM Transactions on Computer Systems
, 1995
"... The notion of counting is central to a number of basic multiprocessor coordination problems, such as dynamic load balancing, barrier synchronization, and concurrent data structure design. In this paper, we investigate the scalability of a variety of counting techniques for large-scale multiprocessor ..."
Abstract
-
Cited by 22 (10 self)
- Add to MetaCart
The notion of counting is central to a number of basic multiprocessor coordination problems, such as dynamic load balancing, barrier synchronization, and concurrent data structure design. In this paper, we investigate the scalability of a variety of counting techniques for large-scale multiprocessors. We compare counting techniques based on: (1) spin locks, (2) message passing, (3) distributed queues, (4) software combining trees, and (5) counting networks. Our comparison is based on a series of simple benchmarks on a simulated 64-processor Alewife machine, a distributed-memory multiprocessor currently under development at MIT. Although locking techniques are known to perform well on small-scale, bus-based multiprocessors, serialization limits performance and contention can degrade performance. Both counting networks and combining trees substantially outperform the other methods by avoiding serialization and alleviating contention, although combining tree throughput is more sensitive t...

