Results 1 - 10
of
30
The MIT Alewife Machine: A Large-Scale Distributed-Memory Multiprocessor
- In Proceedings of Workshop on Scalable Shared Memory Multiprocessors
, 1991
"... The Alewife multiprocessor project focuses on the architecture and design of a large-scale parallel machine. The machine uses a low-dimensional direct interconnection network to provide scalable communication bandwidth, while allowing the exploitation of locality. Despite its distributed-memory arch ..."
Abstract
-
Cited by 138 (22 self)
- Add to MetaCart
The Alewife multiprocessor project focuses on the architecture and design of a large-scale parallel machine. The machine uses a low-dimensional direct interconnection network to provide scalable communication bandwidth, while allowing the exploitation of locality. Despite its distributed-memory architecture, Alewife allows efficient shared-memory programming through a multilayered approach to locality management. A new scalable cache-coherence scheme called LimitLESS directories allows the use of caches for reducing communication latency and network bandwidth requirements. Alewife also employs run-time and compile-time methods for partitioning and placement of data and processes to enhance communication locality. While the above methods attempt to minimize communication latency, communication with distant processors cannot be completely avoided. Alewife's processor, Sparcle, is designed to tolerate these latencies by rapidly switching between threads of computation. This paper describe...
Reactive Synchronization Algorithms for Multiprocessors
"... Synchronization algorithms that are efficient across a wide range of applications and operating conditions are hard to design because their performance depends on unpredictable run-time factors. The designer of a synchronization algorithm has a choice of protocols to use for implementing the synchro ..."
Abstract
-
Cited by 49 (2 self)
- Add to MetaCart
Synchronization algorithms that are efficient across a wide range of applications and operating conditions are hard to design because their performance depends on unpredictable run-time factors. The designer of a synchronization algorithm has a choice of protocols to use for implementing the synchronization operation. For example, candidate protocols for locks include test-and-set protocols and queueing protocols. Frequently, the best choice of protocols depends on the level of contention: previous research has shown that test-and-set protocols for locks outperform queueing protocols at low contention, while the opposite is true at high contention. This paper investigates reactive synchronization algorithms that dynamically choose protocols in response to the level of contention. We describe reactive algorithms for spin locks and fetch-and-op that choose among several shared-memory and message-passing protocols. Dynamically choosing protocols presents a challenge: a reactive algorithm needs to select and change protocols efficiently, and has to allow for the possibility that multiple processes may be executing different protocols at the same time. We describe the notion of consensus objects that the reactive algorithms use to preserve correctness in the face of dynamic protocol changes. Experimental measurements demonstrate that reactive algorithms perform close to the best static choice of protocols at all levels of contention. Furthermore, with mixed levels of contention, reactive algorithms outperform passive algorithms with fixed protocols, provided that contention levels do not change too frequently. Measurements of several parallel applications show that reactive algorithms result in modest performance gains for spin locks and significant gains for fetch-and-op.
Implicit Coscheduling: Coordinated Scheduling with Implicit Information in Distributed Systems
- ACM TRANSACTIONS ON COMPUTER SYSTEMS
, 1998
"... In this thesis, we formalize the concept of an implicitly-controlled system, also referred to as an implicit system. In an implicit system, cooperating components do not explicitly contact other components for control or state information; instead, components infer remote state by observing natural ..."
Abstract
-
Cited by 44 (2 self)
- Add to MetaCart
In this thesis, we formalize the concept of an implicitly-controlled system, also referred to as an implicit system. In an implicit system, cooperating components do not explicitly contact other components for control or state information; instead, components infer remote state by observing naturally-occurring local events and their corresponding implicit information, i.e., information available outside of a defined interface. Many systems, particularly in distributed and networked environments, have leveraged implicit control to simplify the implementation of services with autonomous components. To concretely demonstrate the advantages of implicit control, we propose and implement implicit coscheduling, an algorithm for dynamically coordinating the time...
Register Relocation: Flexible Contexts for Multithreading
- In 20th Annual International Symposium on Computer Architecture
, 1993
"... Multithreading is an important technique that improves processor utilization by allowing computation to be overlapped with the long latency operations that commonly occur in multiprocessor systems. This paper presents register relocation, a new mechanism that efficiently supports flexible partitioni ..."
Abstract
-
Cited by 42 (1 self)
- Add to MetaCart
Multithreading is an important technique that improves processor utilization by allowing computation to be overlapped with the long latency operations that commonly occur in multiprocessor systems. This paper presents register relocation, a new mechanism that efficiently supports flexible partitioning of the register file into variable-size contexts with minimal hardware support. Since the number of registers required by thread contexts varies, this flexibility permits a better utilization of scarce registers, allowing more contexts to be resident, which in turn allows applications to tolerate shorter run lengths and longer latencies. Our experiments show that compared to fixed-size hardware contexts, register relocation can improve processor utilization by a factor of two for many workloads. 1 Introduction Multithreading is an important technique for tolerating latency in multiprocessor systems [3, 7, 19, 21]. Support for multiple contexts and rapid context switching permits high lat...
Scheduler-Conscious Synchronization
- ACM Transactions on Computer Systems
, 1994
"... Efficient synchronization is important for achieving good performance in parallel programs, especially on large-scale multiprocessors. Most synchronization algorithms have been designed to run on a dedicated machine, with one application process per processor, and can suffer serious performance degr ..."
Abstract
-
Cited by 35 (7 self)
- Add to MetaCart
Efficient synchronization is important for achieving good performance in parallel programs, especially on large-scale multiprocessors. Most synchronization algorithms have been designed to run on a dedicated machine, with one application process per processor, and can suffer serious performance degradation in the presence of multiprogramming. Problems arise when running processes block or, worse, busy-wait for action on the part of a process that the scheduler has chosen not to run. In this paper we describe and evaluate a set of scheduler-conscious synchronization algorithms that perform well in the presence of multiprogramming while maintaining good performance on dedicated machines. We consider both large and small machines, with a particular focus on scalability, and examine mutual-exclusion locks, reader-writer locks, and barriers. The algorithms we study fall into two classes: those that heuristically determine appropriate behavior and those that use scheduler information to guid...
A scalable lock-free stack algorithm
- In SPAA’04: Symposium on Parallelism in Algorithms and Architectures
, 2004
"... The literature describes two high performance concurrent stack algorithms based on combining funnels and elimination trees. Unfortunately, the funnels are linearizable but blocking, and the elimination trees are non-blocking but not linearizable. Neither is used in practice since they perform well o ..."
Abstract
-
Cited by 30 (5 self)
- Add to MetaCart
The literature describes two high performance concurrent stack algorithms based on combining funnels and elimination trees. Unfortunately, the funnels are linearizable but blocking, and the elimination trees are non-blocking but not linearizable. Neither is used in practice since they perform well only at exceptionally high loads. The literature also describes a simple lock-free linearizable stack algorithm that works at low loads but does not scale as the load increases. The question of designing a stack algorithm that is non-blocking, linearizable, and scales well throughout the concurrency range, has thus remained open. This paper presents such a concurrent stack algorithm. It is based on the following simple observation: that a single elimination array used as a backoff scheme for a simple lock-free stack is lock-free, linearizable, and scalable. As our empirical results show, the resulting eliminationbackoff stack performs as well as the simple stack at low loads, and increasingly outperforms all other methods (lock-based and non-blocking) as concurrency increases. We believe its simplicity and scalability make it a viable practical alternative to existing constructions for implementing concurrent stacks.
The Performance Implications of Locality Information Usage in Shared-Memory . . .
, 1996
"... This paper examines the performance implications of locality information usage in thread scheduling algorithms for scalable shared-memory multiprocessors. A prototype implementation shows that a locality-conscious scheduler outperforms approaches ignoring locality information. 1 Introduction Cache ..."
Abstract
-
Cited by 21 (8 self)
- Add to MetaCart
This paper examines the performance implications of locality information usage in thread scheduling algorithms for scalable shared-memory multiprocessors. A prototype implementation shows that a locality-conscious scheduler outperforms approaches ignoring locality information. 1 Introduction Cache-coherent multiprocessors with non uniform memory access (NUMA architectures) have become quite attractive as compute servers for parallel applications in the field of scientific computing. They combine scalability and the sharedmemory programming model, relieving the application designer of data distribution and coherency maintenance. But locality of reference, load balancing and scheduling are still of crucial importance. One goal of software development is a high degree of locality of reference from the system up to the application level. Even if application designers develop code with high lo
Time/Contention Trade-offs for Multiprocessor Synchronization
- Information and Computation
, 1996
"... We establish trade-offs between time complexity and write- and access-contention for solutions to the mutual exclusion problem. The write-contention (access-contention) of a concurrent program is the number of processes that may be simultaneously enabled to write (access by reading and/or writing) t ..."
Abstract
-
Cited by 20 (6 self)
- Add to MetaCart
We establish trade-offs between time complexity and write- and access-contention for solutions to the mutual exclusion problem. The write-contention (access-contention) of a concurrent program is the number of processes that may be simultaneously enabled to write (access by reading and/or writing) the same shared variable. Our notion of time complexity distinguishes between local and remote accesses of shared memory. We show that, for any N-process mutual exclusion algorithm, if write-contention is w, and if at most v remote variables can be accessed by a single atomic operation, then there exists an execution involving only one process in which that process executes\Omega\Gammaecu vw N) remote operations for entry into its critical section. We further show that, among these operations,\Omega\Gamma p log vw N) distinct remote variables are accessed. For algorithms with access-contention c, we show that the latter bound can be improved to \Omega\Gamma/51 vc N ). The last two of thes...
The Named-State Register File: Implementation and Performance
, 1995
"... Context switches are slow in conventional processors because the entire processor state must be saved and restored, even if much of the state is not used before the next context switch. This paper introduces the Named-State Register File, a fine-grain associative register file. The NSF uses hardware ..."
Abstract
-
Cited by 14 (1 self)
- Add to MetaCart
Context switches are slow in conventional processors because the entire processor state must be saved and restored, even if much of the state is not used before the next context switch. This paper introduces the Named-State Register File, a fine-grain associative register file. The NSF uses hardware and software techniques to efficiently manage registers among sequential or parallel procedure activations. The NSF holds more live data per register than conventional register files, and requires much less spill and reload traffic to switch between concurrent contexts. The NSF speeds execution of some sequential and parallel programs by 9% to 17% over alternative register file organizations. The NSF has access time comparable to a conventional register file and only adds 5% to the area of a typical processor chip. Keywords: multithreaded, processor, register, context switch. NOTE: This is a draft copy of a paper that has been submitted for publication. Please do not reference or redistrib...

