Results 1 - 10
of
55
Wait-Free Synchronization
- ACM Transactions on Programming Languages and Systems
, 1993
"... A wait-free implementation of a concurrent data object is one that guarantees that any process can complete any operation in a finite number of steps, regardless of the execution speeds of the other processes. The problem of constructing a wait-free implementation of one data object from another lie ..."
Abstract
-
Cited by 660 (26 self)
- Add to MetaCart
A wait-free implementation of a concurrent data object is one that guarantees that any process can complete any operation in a finite number of steps, regardless of the execution speeds of the other processes. The problem of constructing a wait-free implementation of one data object from another lies at the heart of much recent work in concurrent algorithms, concurrent data structures, and multiprocessor architectures. In the first part of this paper, we introduce a simple and general technique, based on reduction to a consensus protocol, for proving statements of the form "there is no wait-free implementation of X by Y ." We derive a hierarchy of objects such that no object at one level has a wait-free implementation in terms of objects at lower levels. In particular, we show that atomic read/write registers, which have been the focus of much recent attention, are at the bottom of the hierarchy: they cannot be used to construct wait-free implementations of many simple and familiar da...
Memory Consistency and Event Ordering in Scalable Shared-Memory Multiprocessors
- In Proceedings of the 17th Annual International Symposium on Computer Architecture
, 1990
"... Scalable shared-memory multiprocessors distribute memory among the processors and use scalable interconnection networks to provide high bandwidth and low latency communication. In addition, memory accesses are cached, buffered, and pipelined to bridge the gap between the slow shared memory and the f ..."
Abstract
-
Cited by 628 (17 self)
- Add to MetaCart
Scalable shared-memory multiprocessors distribute memory among the processors and use scalable interconnection networks to provide high bandwidth and low latency communication. In addition, memory accesses are cached, buffered, and pipelined to bridge the gap between the slow shared memory and the fast processors. Unless carefully controlled, such architectural optimizations can cause memory accesses to be executed in an order different from what the programmer expects. The set of allowable memory access orderings forms the memory consistency model or event ordering model for an architecture.
Performance Evaluation of Memory Consistency Models for Shared-Memory Multiprocessors
"... The memory consistency model supported by a multiprocessor architecture determines the amount of buffering and pipelining that may be used to hide or reduce the latency of memory accesses. Several different consistency models have been proposed. These range from sequential consistency on one end, al ..."
Abstract
-
Cited by 147 (9 self)
- Add to MetaCart
The memory consistency model supported by a multiprocessor architecture determines the amount of buffering and pipelining that may be used to hide or reduce the latency of memory accesses. Several different consistency models have been proposed. These range from sequential consistency on one end, allowing very limited buffering, to release consistency on the other end, allowing extensive buffering and pipelining. The processor consistency and weak consistency models fall in between. The advantage of the less strict models is increased performance potential. The disadvantage is increased hardware complexity and a more complex programming model. To make an informed decision on the above tradeoff requires performance data for the various models. This paper addresses the issue of performance benefits from the above four consistency models. Our results are based on simulation studies done for three applications. The results show that in an environment where processor reads are blocking and writes are buffered, a significant performance increase is achieved from allowing reads to bypass previous writes. Pipelining of writes, which determines the rate at which writes are retired from the write buffer, is of secondary importance. As a result, we show that the sequential consistency model performs poorly relative to all other models, while the processor consistency model provides most of the benefits of the weak and release consistency models.
False Sharing and Spatial Locality in Multiprocessor Caches
- IEEE Transactions on Computers
, 1992
"... The performance of the data cache in shared-memory multiprocessors has been shown to be different from that in uniprocessors. In particular, cache miss rates in multiprocessors do not show the sharp drop typical of uniprocessors when the size of the cache block increases. The resulting high cache mi ..."
Abstract
-
Cited by 115 (4 self)
- Add to MetaCart
The performance of the data cache in shared-memory multiprocessors has been shown to be different from that in uniprocessors. In particular, cache miss rates in multiprocessors do not show the sharp drop typical of uniprocessors when the size of the cache block increases. The resulting high cache miss rate is a cause of concern, since it can significantly limit the performance of multiprocessors. Some researchers have speculated that this effect is due to false sharing, the coherence transactions that result when different processors update different words of the same cache block in an interleaved fashion. While the analysis of six applications in this paper confirms that false sharing has a significant impact on the miss rate, the measurements also show that poor spatial locality among accesses to shared data has an even larger impact. To mitigate false sharing and to enhance spatial locality, we optimize the layout of shared data in cache blocks in a programmer-transparent manner. We...
Comparative Evaluation of Latency Reducing and Tolerating Techniques
- In Proceedings of the 18th Annual International Symposium on Computer Architecture
, 1991
"... Techniques that can cope with the large latency of memory accesses are essential for achieving high processor utilization in large-scale shared-memory multiprocessors. In this paper, we consider four architectural techniques that address the latency problem: (i) hardware coherent caches, (ii) relaxe ..."
Abstract
-
Cited by 103 (6 self)
- Add to MetaCart
Techniques that can cope with the large latency of memory accesses are essential for achieving high processor utilization in large-scale shared-memory multiprocessors. In this paper, we consider four architectural techniques that address the latency problem: (i) hardware coherent caches, (ii) relaxed memory consistency, (iii) softwarecontrolled prefetching, and (iv) multiple-context support. While some studies of benefits of the individual techniques have been done, no study evaluates all of the techniques within a consistent framework. This paper attempts to remedy this by providing a comprehensive evaluation of the benefits of the four techniques, both individually and in combinations, using a consistent set of architectural assumptions. The results in this paper have been obtained using detailed simulations of a large-scale shared-memory multiprocessor. Our results show that caches and relaxed consistency uniformly improve performance. The improvements due to prefetching and multiple contexts are sizeable, but are much more applicationdependent. Combinations of the various techniques generally attain better performance than each one on its own. Overall, we show that using suitable combinations of the techniques, performance can be improved by 4 to 7 times.
The M-Machine Multicomputer
, 1995
"... The M-Machine is an experimental multicomputer being developed to test architectural concepts motivated by the constraints of modern semiconductor technology and the demands of programming systems. The M-Machine computing nodes are con- nected with a 3-D mesh network; each node is a multithreaded pr ..."
Abstract
-
Cited by 100 (13 self)
- Add to MetaCart
The M-Machine is an experimental multicomputer being developed to test architectural concepts motivated by the constraints of modern semiconductor technology and the demands of programming systems. The M-Machine computing nodes are con- nected with a 3-D mesh network; each node is a multithreaded processor incorporating 12 function units, on-chip cache, and local memory. The multiple function units are used to exploit both instruction-level and thread-level parallelism. A user accessible message passing system yields fast communication and synchronization between nodes. RapM access to remote memory is provided transparently to the user with a combination of hardware and software mechanisms. This paper presents the architecture of the M-Machine and describes how its mechanisms attempt to maximize both single thread performance and overall system throughput. The architecture is complete and the MAP chip, which will serve as the M-Machine processing node, is currently being implemented.
Exploring interconnections in multi-core architectures
, 2005
"... This paper examines the area, power, performance, and design issues for the on-chip interconnects on a chip multiprocessor, attempting to present a comprehensive view of a class of interconnect architectures. It shows that the design choices for the interconnect have significant effect on the rest o ..."
Abstract
-
Cited by 73 (4 self)
- Add to MetaCart
This paper examines the area, power, performance, and design issues for the on-chip interconnects on a chip multiprocessor, attempting to present a comprehensive view of a class of interconnect architectures. It shows that the design choices for the interconnect have significant effect on the rest of the chip, potentially consuming a significant fraction of the real estate and power budget. This research shows that designs that treat interconnect as an entity that can be independently architected and optimized would not arrive at the best multicore design. Several examples are presented showing the need for careful co-design. For instance, increasing interconnect bandwidth requires area that then constrains the number of cores or cache sizes, and does not necessarily increase performance. Also, shared level-2 caches become significantly less attractive when the overhead of the resulting crossbar is accounted for. A hierarchical bus structure is examined which negates some of the performance costs of the assumed baseline architecture. 1
The Impact of Synchronization and Granularity on Parallel Systems
- In Int'l. Symp. on Computer Architecture
, 1990
"... In this paper, we study the impact of synchronization and granularity on the performance of parallel systems using an execution-driven simulation technique. We find that even though there can be a lot of parallelism at the fine grain level, synchronization and scheduling strategies determine the ult ..."
Abstract
-
Cited by 40 (4 self)
- Add to MetaCart
In this paper, we study the impact of synchronization and granularity on the performance of parallel systems using an execution-driven simulation technique. We find that even though there can be a lot of parallelism at the fine grain level, synchronization and scheduling strategies determine the ultimate performance of the system. Loop-iteration level parallelism seems to be a more appropriate level when those factors are considered. We also study barrier synchronization and data synchronization at the loopiteration level and found both schemes are needed for a better performance.
Randomized Algorithms For Multiprocessor Page Migration
- SIAM Journal on Computing
"... . The page migration problem is to manage a globally addressed shared memory in a multiprocessor system. Each physical page of memory is located at a given processor, and memory references to that page by other processors incur a cost proportional to the network distance. At times the page may migra ..."
Abstract
-
Cited by 27 (2 self)
- Add to MetaCart
. The page migration problem is to manage a globally addressed shared memory in a multiprocessor system. Each physical page of memory is located at a given processor, and memory references to that page by other processors incur a cost proportional to the network distance. At times the page may migrate between processors at cost proportional to the distance times D, a page size factor. The problem is to schedule movements on-line so that the total cost of memory references is within a constant factor c of the best off-line schedule. An algorithm that does so is called c-competitive. Black and Sleator gave 3-competitive deterministic on-line algorithms for uniform networks (complete graphs with unit edge lengths) and for trees with arbitrary edge lengths. No good deterministic algorithm is known for general networks with arbitrary edge lengths. We present randomized algorithms for the migration problem that are both simple and better than 3-competitiveagainst an oblivious adversary. We ...
The wakeup problem
- SIAM Journal on Computing
, 1996
"... We study a new problem, the wakeup problem, that seems to be fundamental in distributed computing. We present efficient solutions to the problem and show how these solutions can be used to solve the consensus problem, the leader election problem, and other related problems. The main question we try ..."
Abstract
-
Cited by 22 (5 self)
- Add to MetaCart
We study a new problem, the wakeup problem, that seems to be fundamental in distributed computing. We present efficient solutions to the problem and show how these solutions can be used to solve the consensus problem, the leader election problem, and other related problems. The main question we try to answer is, how much memory is needed to solve the wakeup problem? We assume a model that captures important properties of real systems that have been largely ignored by previous work on cooperative problems.

