Results 1  10
of
32
Contention in Shared Memory Algorithms
, 1993
"... Most complexitymeasures for concurrent algorithms for asynchronous sharedmemory architectures focus on process steps and memory consumption. In practice, however, performance of multiprocessor algorithms is heavily influenced by contention, the extent to which processes access the same location at t ..."
Abstract

Cited by 63 (1 self)
 Add to MetaCart
Most complexitymeasures for concurrent algorithms for asynchronous sharedmemory architectures focus on process steps and memory consumption. In practice, however, performance of multiprocessor algorithms is heavily influenced by contention, the extent to which processes access the same location at the same time. Nevertheless, even though contention is one of the principal considerations affecting the performance of real algorithms on real multiprocessors, there are no formal tools for analyzing the contention of asynchronous sharedmemory algorithms. This paper introduces the first formal complexity model for contention in multiprocessors. We focus on the standard multiprocessor architecture in which n asynchronous processes communicate by applying read, write, and readmodifywrite operations to a shared memory. We use our model to derive two kinds of results: (1) lower bounds on contention for well known basic problems such as agreement and mutual exclusion, and (2) tradeoffs betwe...
Diffracting trees
 In Proceedings of the 5th Annual ACM Symposium on Parallel Algorithms and Architectures. ACM
, 1994
"... Shared counters are among the most basic coordination structures in multiprocessor computation, with applications ranging from barrier synchronization to concurrentdatastructure design. This article introduces diffracting trees, novel data structures for shared counting and load balancing in a dis ..."
Abstract

Cited by 58 (11 self)
 Add to MetaCart
Shared counters are among the most basic coordination structures in multiprocessor computation, with applications ranging from barrier synchronization to concurrentdatastructure design. This article introduces diffracting trees, novel data structures for shared counting and load balancing in a distributed/parallel environment. Empirical evidence, collected on a simulated distributed sharedmemory machine and several simulated messagepassing architectures, shows that diffracting trees scale better and are more robust than both combining trees and counting networks, currently the most effective known methods for implementing concurrent counters in software. The use of a randomized coordination method together with a combinatorial data structure overcomes the resiliency drawbacks of combining trees. Our simulations show that to handle the same load, diffracting trees and counting networks should have a similar width w, yet the depth of a diffracting tree is O(log w), whereas counting networks have depth O(log 2 w). Diffracting trees have already been used to implement highly efficient producer/consumer queues, and we believe diffraction will prove to be an effective alternative paradigm to combining and queuelocking in the design of many concurrent data structures.
TIGHT ANALYSES OF TWO LOCAL LOAD BALANCING ALGORITHMS
 SIAM J. COMPUT.
, 1999
"... This paper presents an analysis of the following load balancing algorithm. At each step, each node in a network examines the number of tokens at each of its neighbors and sends a token to each neighbor with at least 2d + 1 fewer tokens, where d is the maximum degree of any node in the network. We ..."
Abstract

Cited by 51 (5 self)
 Add to MetaCart
This paper presents an analysis of the following load balancing algorithm. At each step, each node in a network examines the number of tokens at each of its neighbors and sends a token to each neighbor with at least 2d + 1 fewer tokens, where d is the maximum degree of any node in the network. We show that within O(∆/α) steps, the algorithm reduces the maximum difference in tokens between any two nodes to at most O((d 2 log n)/α), where ∆ is the global imbalance in tokens (i.e., the maximum difference between the number of tokens at any node initially and the average number of tokens), n is the number of nodes in the network, and α is the edge expansion of the network. The time bound is tight in the sense that for any graph with edge expansion α, and for any value ∆, there exists an initial distribution of tokens with imbalance ∆ for which the time to reduce the imbalance to even ∆/2 is at least Ω(∆/α). The bound on the final imbalance is tight in the sense that there exists a class of networks that can be locally balanced everywhere (i.e., the maximum difference in tokens between any two neighbors is at most 2d), while the global imbalance remains Ω((d 2 log n)/α). Furthermore, we show that upon reaching a state with a global imbalance of O((d 2 log n)/α), the time for this algorithm to locally balance the network can be as large as Ω(n 1/2). We extend our analysis to a variant of this algorithm for dynamic and asynchronous
Reactive Synchronization Algorithms for Multiprocessors
"... Synchronization algorithms that are efficient across a wide range of applications and operating conditions are hard to design because their performance depends on unpredictable runtime factors. The designer of a synchronization algorithm has a choice of protocols to use for implementing the synchro ..."
Abstract

Cited by 50 (2 self)
 Add to MetaCart
Synchronization algorithms that are efficient across a wide range of applications and operating conditions are hard to design because their performance depends on unpredictable runtime factors. The designer of a synchronization algorithm has a choice of protocols to use for implementing the synchronization operation. For example, candidate protocols for locks include testandset protocols and queueing protocols. Frequently, the best choice of protocols depends on the level of contention: previous research has shown that testandset protocols for locks outperform queueing protocols at low contention, while the opposite is true at high contention. This paper investigates reactive synchronization algorithms that dynamically choose protocols in response to the level of contention. We describe reactive algorithms for spin locks and fetchandop that choose among several sharedmemory and messagepassing protocols. Dynamically choosing protocols presents a challenge: a reactive algorithm needs to select and change protocols efficiently, and has to allow for the possibility that multiple processes may be executing different protocols at the same time. We describe the notion of consensus objects that the reactive algorithms use to preserve correctness in the face of dynamic protocol changes. Experimental measurements demonstrate that reactive algorithms perform close to the best static choice of protocols at all levels of contention. Furthermore, with mixed levels of contention, reactive algorithms outperform passive algorithms with fixed protocols, provided that contention levels do not change too frequently. Measurements of several parallel applications show that reactive algorithms result in modest performance gains for spin locks and significant gains for fetchandop.
Computation Migration: Enhancing Locality for DistributedMemory Parallel Systems
"... We describe computation migration, a new technique that is based on compiletime program transformations, for accessing remote data in a distributedmemory parallel system. In contrast with RPCstyle access, where the access is performed remotely, and with data migration, where the data is moved so ..."
Abstract

Cited by 49 (4 self)
 Add to MetaCart
We describe computation migration, a new technique that is based on compiletime program transformations, for accessing remote data in a distributedmemory parallel system. In contrast with RPCstyle access, where the access is performed remotely, and with data migration, where the data is moved so that it is local, computation migration moves part of the current thread to the processor where the data resides. The access is performed at the remote processor, and the migrated thread portion continues to run on that same processor; this makes subsequent accesses in the thread portion local. We describe an implementation of computation migration that consists of two parts: an implementation that migrates single activation frames, and a highlevel language annotation that allows a programmer to express when migration is desired. We performed experiments using two applications; these experiments demonstrate that computation migration is a valuable alternative to RPC and data migration.
Waiting Algorithms for Synchronization in LargeScale Multiprocessors
 ACM Transactions on Computer Systems
, 1991
"... Through analysis and experiments, this paper investigates twophase waiting algorithms to minimize the cost of waiting for synchronization in largescale multiprocessors. In a twophase algorithm, a thread #rst waits by polling a synchronization variable. If the cost of polling reaches a limit L ..."
Abstract

Cited by 45 (4 self)
 Add to MetaCart
Through analysis and experiments, this paper investigates twophase waiting algorithms to minimize the cost of waiting for synchronization in largescale multiprocessors. In a twophase algorithm, a thread #rst waits by polling a synchronization variable. If the cost of polling reaches a limit L poll and further waiting is necessary, the thread is blocked, incurring an additional #xed cost, B. The choice of L poll is a critical determinant of the performance of twophase algorithms.
Local Divergence of Markov Chains and the Analysis of Iterative LoadBalancing Schemes
 IN PROCEEDINGS OF THE 39TH IEEE SYMPOSIUM ON FOUNDATIONS OF COMPUTER SCIENCE (FOCS ’98
, 1998
"... We develop a general technique for the quantitative analysis of iterative distributed load balancing schemes. We illustrate the technique by studying two simple, intuitively appealing models that are prevalent in the literature: the diffusive paradigm, and periodic balancing circuits (or the dimensi ..."
Abstract

Cited by 42 (0 self)
 Add to MetaCart
We develop a general technique for the quantitative analysis of iterative distributed load balancing schemes. We illustrate the technique by studying two simple, intuitively appealing models that are prevalent in the literature: the diffusive paradigm, and periodic balancing circuits (or the dimension exchange paradigm). It is well known that such load balancing schemes can be roughly modeled by Markov chains, but also that this approximation can be quite inaccurate. Our main contribution is an effective way of characterizing the deviation between the actual loads and the distribution generated by a related Markov chain, in terms of a natural quantity which we call the local divergence. We apply this technique to obtain bounds on the number of rounds required to achieve coarse balancing in general networks, cycles and meshes in these models. For balancing circuits, we also present bounds for the stronger requirement of perfect balancing, or counting.
Approximate Load Balancing on Dynamic and Asynchronous Networks
 In Proceedings of the 25th Annual ACM Symposium on Theory of Computing
, 1993
"... This paper presents a simple local algorithm for load balancing in a distributed network. The algorithm makes no assumption about the structure of the network. It can be executed on a synchronous network with fixed topology, a synchronous network with dynamically changing topology, or an asynchronou ..."
Abstract

Cited by 41 (3 self)
 Add to MetaCart
This paper presents a simple local algorithm for load balancing in a distributed network. The algorithm makes no assumption about the structure of the network. It can be executed on a synchronous network with fixed topology, a synchronous network with dynamically changing topology, or an asynchronous network. It works quickly and balances well when the network has an expansion property. In particular, we show that in an nnode networkwith maximumdegree d whose live edges, at every time step, form a ¯expander, the algorithm will balance the load to within an additive O(d log n=¯) term in O(\Delta log(n\Delta)=¯) time, where \Delta is the initial imbalance. The algorithm improves upon previous approaches that yield O(n) time bounds in dynamic and asynchronous networks. 1 Introduction One of the most fundamental problems to solve on a parallel computer or distributed network is to balance the load or work that must be performed among the various processors. This paper analyzes a sim...
SmallDepth Counting Networks
, 1992
"... Generalizing the notion of a sorting network, Aspnes, Herlihy, and Shavit recently introduced a class of socalled "counting" networks, and established an O(lg 2 n) upper bound on the depth complexity of such networks. Their work was motivated by a number of practical applications arising in the dom ..."
Abstract

Cited by 41 (2 self)
 Add to MetaCart
Generalizing the notion of a sorting network, Aspnes, Herlihy, and Shavit recently introduced a class of socalled "counting" networks, and established an O(lg 2 n) upper bound on the depth complexity of such networks. Their work was motivated by a number of practical applications arising in the domain of asynchronous shared memory machines. This paper continues the analysis of counting networks, providing a number of new upper bounds. In particular, we present an explicit construction of an O(c lg* lg n) depth counting network, a randomized construction of an O(lgn)depth network (that works with extremely high probability), and using the random con struction we present an existential proof of a de terministic O(lgn)depth network. The latter result matches the trivial (lgn)depth lower bound to within a constant factor.
Scalable Concurrent Counting
 ACM Transactions on Computer Systems
, 1995
"... The notion of counting is central to a number of basic multiprocessor coordination problems, such as dynamic load balancing, barrier synchronization, and concurrent data structure design. In this paper, we investigate the scalability of a variety of counting techniques for largescale multiprocessor ..."
Abstract

Cited by 24 (11 self)
 Add to MetaCart
The notion of counting is central to a number of basic multiprocessor coordination problems, such as dynamic load balancing, barrier synchronization, and concurrent data structure design. In this paper, we investigate the scalability of a variety of counting techniques for largescale multiprocessors. We compare counting techniques based on: (1) spin locks, (2) message passing, (3) distributed queues, (4) software combining trees, and (5) counting networks. Our comparison is based on a series of simple benchmarks on a simulated 64processor Alewife machine, a distributedmemory multiprocessor currently under development at MIT. Although locking techniques are known to perform well on smallscale, busbased multiprocessors, serialization limits performance and contention can degrade performance. Both counting networks and combining trees substantially outperform the other methods by avoiding serialization and alleviating contention, although combining tree throughput is more sensitive t...