Results 1 - 10
of
44
Diffracting trees
- In Proceedings of the 5th Annual ACM Symposium on Parallel Algorithms and Architectures. ACM
, 1994
"... Shared counters are among the most basic coordination structures in multiprocessor computation, with applications ranging from barrier synchronization to concurrent-data-structure design. This article introduces diffracting trees, novel data structures for shared counting and load balancing in a dis ..."
Abstract
-
Cited by 52 (10 self)
- Add to MetaCart
Shared counters are among the most basic coordination structures in multiprocessor computation, with applications ranging from barrier synchronization to concurrent-data-structure design. This article introduces diffracting trees, novel data structures for shared counting and load balancing in a distributed/parallel environment. Empirical evidence, collected on a simulated distributed shared-memory machine and several simulated message-passing architectures, shows that diffracting trees scale better and are more robust than both combining trees and counting networks, currently the most effective known methods for implementing concurrent counters in software. The use of a randomized coordination method together with a combinatorial data structure overcomes the resiliency drawbacks of combining trees. Our simulations show that to handle the same load, diffracting trees and counting networks should have a similar width w, yet the depth of a diffracting tree is O(log w), whereas counting networks have depth O(log 2 w). Diffracting trees have already been used to implement highly efficient producer/consumer queues, and we believe diffraction will prove to be an effective alternative paradigm to combining and queue-locking in the design of many concurrent data structures.
Scheduler-Conscious Synchronization
- ACM Transactions on Computer Systems
, 1994
"... Efficient synchronization is important for achieving good performance in parallel programs, especially on large-scale multiprocessors. Most synchronization algorithms have been designed to run on a dedicated machine, with one application process per processor, and can suffer serious performance degr ..."
Abstract
-
Cited by 35 (7 self)
- Add to MetaCart
Efficient synchronization is important for achieving good performance in parallel programs, especially on large-scale multiprocessors. Most synchronization algorithms have been designed to run on a dedicated machine, with one application process per processor, and can suffer serious performance degradation in the presence of multiprogramming. Problems arise when running processes block or, worse, busy-wait for action on the part of a process that the scheduler has chosen not to run. In this paper we describe and evaluate a set of scheduler-conscious synchronization algorithms that perform well in the presence of multiprogramming while maintaining good performance on dedicated machines. We consider both large and small machines, with a particular focus on scalability, and examine mutual-exclusion locks, reader-writer locks, and barriers. The algorithms we study fall into two classes: those that heuristically determine appropriate behavior and those that use scheduler information to guid...
Fine-Grain Producer-Initiated Communication in Cache-Coherent Multiprocessors
, 1997
"... Shared-memory multiprocessors are becoming increasingly popular as a highperformance, easy to program, and relatively inexpensive choice for parallel computation. However, the performance of shared-memory multiprocessors is limited by memory latency. Memory latencies are higher in multiprocessors du ..."
Abstract
-
Cited by 26 (5 self)
- Add to MetaCart
Shared-memory multiprocessors are becoming increasingly popular as a highperformance, easy to program, and relatively inexpensive choice for parallel computation. However, the performance of shared-memory multiprocessors is limited by memory latency. Memory latencies are higher in multiprocessors due to physical constraints and cache coherence overheads. In addition, synchronization operations, which are necessary to ensure correctness in parallel programs, add further communication overhead in shared-memory multiprocessors. Software-controlled non-binding data prefetching is a widely used consumerinitiated mechanism to hide communication latency and is currently supported on most architectures. However, on an invalidation-based cache-coherent multiprocessor, prefetching is inapplicable or insufficient for some communication patterns such as irregular communication, fine-grain pipelined loops, and synchronization. For these cases, a combination of two fine-grain, producer-initiated pr...
Synchronization Transformations for Parallel Computing
- In Proceedings of the 24th Annual ACM Symposium on the Principles of Programming Languages
, 1997
"... ion Transformations Since the synchronization transformations deal primarily with the movement and manipulation of synchronization nodes, it is appropriate for the compiler to use an abstract, simplified representation of the actual computation in the ICFG. The compiler can therefore apply several t ..."
Abstract
-
Cited by 24 (8 self)
- Add to MetaCart
ion Transformations Since the synchronization transformations deal primarily with the movement and manipulation of synchronization nodes, it is appropriate for the compiler to use an abstract, simplified representation of the actual computation in the ICFG. The compiler can therefore apply several transformations that replace concrete representations of computation with more abstract representations. The end result is a simpler and smaller ICFG, which improves the performance and functionality of the synchronization optimization algorithms. The transformations are as follows: ---Node Abstraction: A connected set of assignment, conditional nodes or summary nodes with a single incoming edge and a single outgoing edge is replaced by a single summary node. Figure 2 presents this transformation. \Delta 5 ---Procedure Abstraction: The invocation of a procedure that consists only of assignment, conditional nodes or summary nodes is replaced with a single node summarizing the execution of t...
Evaluating Synchronization on Shared Address Space Multiprocessors: Methodology and Performance
, 1999
"... Synchronization is an area that exhibits rich hardware-software interactions in multiprocessors. It was studied extensively using microbenchmarks a decade ago. However, its performance implications are not well understood on modern systems or on real applications. We study the impact of synchronizat ..."
Abstract
-
Cited by 23 (2 self)
- Add to MetaCart
Synchronization is an area that exhibits rich hardware-software interactions in multiprocessors. It was studied extensively using microbenchmarks a decade ago. However, its performance implications are not well understood on modern systems or on real applications. We study the impact of synchronization primitives and algorithms on a modern, 64processor, hardware-coherent shared address space multiprocessor: the SGI Origin 2000. In addition to the actual results on a modern system, we examine the key methodological issues in studying synchronization, for both microbenchmarks and applications. We find that although the efficient hardware support (Fetch&Op) for synchronization provided on our machine usually helps lock and barrier microbenchmarks, it does not help in improving application performance when compared to good software algorithms that use the processor-provided LL-SC instructions. This is true even in applications that spend a significant amount of time in synchronization operations. More elaborate hardware support is unlikely to have a significant benefit either. From the applications’ perspective, it is usually the waiting time due to load imbalance or serialization that dominates synchronization time, not the overhead of the synchronization operations themselves, even in apparently balanced cases where the overhead may be expected to be substantial.
Lock Coarsening: Eliminating Lock Overhead in Automatically Parallelized Object-Based Programs
- Journal of Parallel and Distributed Computing
, 1996
"... Atomic operations are a key primitive in parallel computing systems. The standard implementation mechanism for atomic operations uses mutual exclusion locks. In an object-based programming system the natural granularity is to give each object its own lock. Each operation can then make its execution ..."
Abstract
-
Cited by 23 (7 self)
- Add to MetaCart
Atomic operations are a key primitive in parallel computing systems. The standard implementation mechanism for atomic operations uses mutual exclusion locks. In an object-based programming system the natural granularity is to give each object its own lock. Each operation can then make its execution atomic by acquiring and releasing the lock for the object that it accesses. But this fine lock granularity may have high synchronization overhead because it maximizes the number of executed acquire and release constructs. To achieve good performance it may be necessary to reduce the overhead by coarsening the granularity at which the computation locks objects. In this paper we describe a static analysis technique --- lock coarsening --- designed to automatically increase the lock granularity in object-based programs with atomic operations. We have implemented this technique in the context of a parallelizing compiler for irregular, object-based programs and used it to improve the generated pa...
Integrating Non-blocking Synchronisation in Parallel Applications: Performance Advantages and Methodologies
- In Proceedings of the 3rd ACM Workshop on Software and Performance (WOSP'02
, 2002
"... In this paper we investigate how performance and speedup of applications would be aoeected by using non-blocking rather than blocking synchronisation. The results obtained show that for many applications, non-blocking synchronisation lead to significant speedups for a fairly large number of processo ..."
Abstract
-
Cited by 18 (2 self)
- Add to MetaCart
In this paper we investigate how performance and speedup of applications would be aoeected by using non-blocking rather than blocking synchronisation. The results obtained show that for many applications, non-blocking synchronisation lead to significant speedups for a fairly large number of processors, while it never slows the applications down. As part of this investigation this paper also provides a set of efficient and simple translations that show how typical blocking operations found in parallel applications, such as simple locks, queues and lock trees can be translated into non-blocking equivalents that use hardware primitives common in modern multiprocessor systems. With these translations this paper clearly demonstrates that it is easy for the application designer/programmer to replace the blocking operations commonly found on with nonblocking equivalents ones. For the empirical results a set of representative applications running on a large-scale ccNUMA machine were used.
Non-Blocking Timeout in Scalable Queue-Based Spin Locks
- IN PROCEEDINGS OF THE 21TH ANNUAL ACM SYMPOSIUM ON PRINCIPLES OF DISTRIBUTED COMPUTING
, 2002
"... Queue-based spinlocks allow programs with busy-wait synchronization to scale to very large multiprocessors, without fear of starvation orperformance-destroyingcontention. Timeoutcapable spin locks allow a thread to abandon its attempt to acquire a lock; they are used widely in real-time systems to a ..."
Abstract
-
Cited by 13 (4 self)
- Add to MetaCart
Queue-based spinlocks allow programs with busy-wait synchronization to scale to very large multiprocessors, without fear of starvation orperformance-destroyingcontention. Timeoutcapable spin locks allow a thread to abandon its attempt to acquire a lock; they are used widely in real-time systems to avoid overshooting a deadline, and in database systems to recover from transaction deadlock and totolerate preemption of the thread that holds a lock. In previous work we showed how to incorporate timeout in scalable queue-based locks. Technological trends suggest that this combination will be of increasing commercial importance. Our previous solutions, however, require a thread that is timing out to handshake with its neighbors in the queue, a requirement that maylead to indefinite delay in a preemptively multiprogrammed system. In the current paper we present new queue-based locks in which the timeout code is nonblocking. These locks sacrifice the constant worst-case space per thread of our previous algorithms, but allow us to bound the time that a thread may be delayed by preemption of its peers. We p r e sent empirical results indicating that space needs are modest in practice, and that performance scales well to large machines. We also argue that constant per-thread space cannot be guaranteed together with non-blocking timeout in a queue-based lock.
Scalable Concurrent Priority Queue Algorithms
- In Proceedings of the eighteenth annual ACM symposium on Principles of distributed computing
, 1999
"... This paper addresses the problem of designing bounded range priority queues, that is, queues that support a fixed range of priorities. Bounded range priority queues are fundamental in the design of modern multiprocessor algorithms -- from the application level to lowest levels of the operating sy ..."
Abstract
-
Cited by 11 (3 self)
- Add to MetaCart
This paper addresses the problem of designing bounded range priority queues, that is, queues that support a fixed range of priorities. Bounded range priority queues are fundamental in the design of modern multiprocessor algorithms -- from the application level to lowest levels of the operating system kernel. While most of the available priority queue literature is directed at existing small-scale machines, we chose to evaluate algorithms on a broader concurrency scale using a simulated 256 node shared memory multiprocessor architecture similar to the MIT Alewife. Our empirical evidence suggests that the priority queue algorithms currently available in the literature do not scale. Based on these findings, we present two simple new algorithms, LinearFunnels and FunnelTree, that provide true scalability throughout the concurrency range. 1 Introduction Priority queues are a fundamental class of data structures used in the design of modern multiprocessor algorithms. Their uses r...
The Architectural and Operating System Implications on the Performance of Synchronization on ccNUMA Multiprocessors
- International Journal of Parallel Programming Volume
, 2001
"... This paper investigates the performance of synchronization algorithms on ccNUMA multiprocessors, from the perspectives of the architecture and the operating system. In contrast with previous related studies that emphasized the relative performance of synchronization algorithms, this paper takes a ne ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
This paper investigates the performance of synchronization algorithms on ccNUMA multiprocessors, from the perspectives of the architecture and the operating system. In contrast with previous related studies that emphasized the relative performance of synchronization algorithms, this paper takes a new approach by analyzing the sources of synchronization latency on ccNUMA architectures and how can this latency be reduced by leveraging hardware and software schemes in both dedicated and multiprogrammed execution environments. From the architectural perspective, the paper identifies the implications of directory-based cache coherence on the latency and scalability of synchronization primitives and examines if and how can simple hardware that accelerates synchronization instructions be leveraged to reduce synchronization latency. From the operating system’s perspective, the paper evaluates in a unified framework, user-level, kernel-level and hybrid algorithms for implementing scalable synchronization in multiprogrammed execution environments. Along with visiting the aforementioned issues, the paper contributes a new methodology for implementing fast synchronization algorithms on ccNUMA multiprocessors. The relevant experiments are conducted on the SGI Origin2000, a popular commercial ccNUMA multiprocessor.

