Results 1 -
7 of
7
False Sharing and its Effect on Shared Memory Performance
- IN PROCEEDINGS OF THE USENIX SYMPOSIUM ON EXPERIENCES WITH DISTRIBUTED AND MULTIPROCESSOR SYSTEMS (SEDMS IV
, 1993
"... False sharing occurs when processors in a shared-memory parallel system make references to different data objects within the same coherence block (cache line or page), thereby inducing "unnecessary" coherence operations. False sharing is widely believed to be a serious problem for parallel program p ..."
Abstract
-
Cited by 50 (3 self)
- Add to MetaCart
False sharing occurs when processors in a shared-memory parallel system make references to different data objects within the same coherence block (cache line or page), thereby inducing "unnecessary" coherence operations. False sharing is widely believed to be a serious problem for parallel program performance, but a precise definition and quantification of the problem has proven to be elusive. We explain why. In the process, we present a variety of possible definitions for false sharing, and discuss the merits and drawbacks of each. Our discussion is based on experience gained during a fouryear study of multiprocessor memory architecture and its effect on the behavior of applications in a sixteen-program suite. Using
An Analysis of Degenerate Sharing and False Coherence
- JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING
, 1996
"... False sharing reduces system performance in distributed shared memory systems. A major impediment to solving the problem of false sharing has been that no satisfactory definition for this problem exists. In this paper we pr ovide defi nitions for several types of degenerate sharing, including false ..."
Abstract
-
Cited by 15 (3 self)
- Add to MetaCart
False sharing reduces system performance in distributed shared memory systems. A major impediment to solving the problem of false sharing has been that no satisfactory definition for this problem exists. In this paper we pr ovide defi nitions for several types of degenerate sharing, including false sharing. We also provide an algorithm that computes the cost of unnecessar y coherence (false coherence) in a shared memory system using a single memory trace. Finally, we provide a counter intuitive example demonstrating that the elimination of degenerate sharing can reduce performance.
Region-Oriented Main Memory Management in Shared-Memory NUMA Multiprocessors
, 1992
"... The need to achieve higher performance through greater degrees of parallelism necessitates distributing the memory throughout a multiprocessor system to reduce contention and increase scalability. Unfortunately, such Non-Uniform Memory Access time (NUMA) multiprocessors introduce complications for t ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
The need to achieve higher performance through greater degrees of parallelism necessitates distributing the memory throughout a multiprocessor system to reduce contention and increase scalability. Unfortunately, such Non-Uniform Memory Access time (NUMA) multiprocessors introduce complications for the programmers, who must now be concerned with the physical distribution of their data in order to extract good performance from the system. The impact of remote memory accesses can be reduced through replication and migration, either in processor caches or in main memory. Unfortunately, the effectiveness of caches is limited for large data sets due to capacity misses, while dynamic virtual memory page management suffers from a mismatch between the pages being replicated and the data structures in programs. In this thesis we propose that data be partitioned into Shared Regions reflecting the granularity of data sharing in programs, and that special synchronization calls be added to enforce...
Hot Spot Analysis in Large Scale Shared Memory Multiprocessors
, 1993
"... Scalable multiprocessors that support a shared-memory image to application programmers are typically based on physical memory modules that are distributed. Consequently, the access times for a particular processor to various parts of physical memory differ. In this paper, we explore the implications ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
Scalable multiprocessors that support a shared-memory image to application programmers are typically based on physical memory modules that are distributed. Consequently, the access times for a particular processor to various parts of physical memory differ. In this paper, we explore the implications of this non-uniformity in memory access times. In particular, we study the effect of hot-spots in hierarchical large scale NUMA multiprocessors. Hot-spot analysis is of interest because coordinated threads of parallel programs lead to hot spots whose impact on performance may be substantial or even dominant. We have developed an analytical model of access latencies and contention for shared resources in the interconnection network that links the processors and memory modules. Our objective is to provide a better understanding of non-uniform memory access times in scalable architectures. We show the extent to which a variable can be shared before it becomes a performance bottleneck, and asse...
False Sharing and its Effect on Shared Memory
- In Proceedings of the USENIX Symposium on Experiences with Distributed and Multiprocessor Systems (SEDMS IV
, 1993
"... False sharing occurs when processors in a shared-memory parallel system make references to different data objects within the same coherence block (cache line or page), thereby inducing "unnecessary" coherence operations. False sharing is widely believed to be a serious problem for parallel program p ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
False sharing occurs when processors in a shared-memory parallel system make references to different data objects within the same coherence block (cache line or page), thereby inducing "unnecessary" coherence operations. False sharing is widely believed to be a serious problem for parallel program performance, but a precise definition and quantification of the problem has proven to be elusive. We explain why. In the process, we present a variety of possible definitions for false sharing, and discuss the merits and drawbacks of each. Our discussion is based on experience gained during a fouryear study of multiprocessor memory architecture and its effect on the behavior of applications in a sixteen-program suite.
1 Feedback-Directed Page Placement for ccNUMA via Hardware-generated Memory Traces
"... Abstract — Non-uniform memory architectures with cache coherence (ccNUMA) are becoming increasingly common, not just for large-scale high performance platforms but also in the context of multi-cores architectures. Under ccNUMA, data placement may influence overall application performance significant ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract — Non-uniform memory architectures with cache coherence (ccNUMA) are becoming increasingly common, not just for large-scale high performance platforms but also in the context of multi-cores architectures. Under ccNUMA, data placement may influence overall application performance significantly as references resolved locally to a processor/core impose lower latencies than remote ones. This work develops a novel hardware-assisted page placement paradigm based on automated tracing of the memory references made by application threads. Two placement schemes, modeling both single-level and multi-level latencies, allocate pages near processors that most frequently access that memory page. These schemes leverage performance monitoring capabilities of contemporary microprocessors to efficiently extract an approximate trace of memory accesses. This information is used to decide page affinity, i.e., the node to which the page is bound. The method operates entirely in user space, is widely automated, and handles not only static but also dynamic memory allocation. Experiments show that this method, although based on lossy tracing, can efficiently and effectively improve page placement, leading to an average wall-clock execution time saving of over 20 % for the tested benchmarks on the SGI Altix with a 2x remote access penalty and 12 % on AMD Opterons with a 1.3-2.0x access penalty. This is accompanied by a one-time tracing overhead of 2.7 % over the overall original program wallclock time. Index Terms — Hardware performance monitoring, NUMA, trace-guided optimization, page placement

