Results 1 -
6 of
6
MemSpy: Analyzing Memory System Bottlenecks in Programs
- In Proc. ACM SIGMETRICS Conf. on Measurement and Modeling of Computer Systems
, 1992
"... To cope with the increasing difference between processor and main memory speeds, modern computer systems use deep memory hierarchies. In the presence of such hierarchies, the performance attained by an application is largely determined by its memory reference behavior--- if most references hit in th ..."
Abstract
-
Cited by 99 (9 self)
- Add to MetaCart
To cope with the increasing difference between processor and main memory speeds, modern computer systems use deep memory hierarchies. In the presence of such hierarchies, the performance attained by an application is largely determined by its memory reference behavior--- if most references hit in the cache, the performance is significantly higher than if most references have to go to main memory. Frequently, it is possible for the programmer to restructure the data or code to achieve better memory reference behavior. Unfortunately, most existing performance debugging tools do not assist the programmer in this component of the overall performance tuning task. This paper describes MemSpy, a prototype tool that helps programmers identify and fix memory bottlenecks in both sequential and parallel programs. A key aspect of MemSpy is that it introduces the notion of data oriented, in addition to code oriented, performance tuning. Thus, for both source level code objects and data objects, Mem...
Some Considerations About Passive Sharing in Shared-Memory Multiprocessors
- IEEE TCCA Newsletter
, 1997
"... In a multiprocessor system, process migration guarantees load balance between processors but causes passive sharing, since private data blocks of a process can become resident in multiple caches and generate useless coherence-related overhead. We proposed a selective invalidation strategy to elimina ..."
Abstract
-
Cited by 12 (11 self)
- Add to MetaCart
In a multiprocessor system, process migration guarantees load balance between processors but causes passive sharing, since private data blocks of a process can become resident in multiple caches and generate useless coherence-related overhead. We proposed a selective invalidation strategy to eliminate these passive shared copies. The results of trace-driven simulation prove that our strategy can result successful in a number of situations such as the typical case of a general-purpose workstation. 1.
A Framework for Multiprocessor Performance Characterization and Calibration
, 1992
"... A Framework for Multiprocessor Performance Characterization and Calibration By Arun K. Nanda In parallel programs using the shared-variable paradigm, run-time communication overhead manifests itself along three principal dimensions, namely, shared data accesses (including memory contention, cache m ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
A Framework for Multiprocessor Performance Characterization and Calibration By Arun K. Nanda In parallel programs using the shared-variable paradigm, run-time communication overhead manifests itself along three principal dimensions, namely, shared data accesses (including memory contention, cache misses and non-local memory access latencies), inter-process synchronization operations, and global barrier synchronizations. Performance measurements to quantify the rate at which communication costs for an algorithm increases as more processors are used is integral to the study of an algorithm's efficiency and scalability. In this thesis, we explore the problem of performance characterization of a multiprocessor in the context of the shared-variable programming model with emphasis on characterizing the dynamic run-time behavior. We have developed a hierarchical model to characterize multiprocessor system performance using a multi-phase computation structure with concurrent asynchronous exec...
Analysis of sharing overhead in Shared Memory Multiprocessors
- IEEE Hawaii Int. Conf. on Systems, Kohala Coast, HL, January
, 1998
"... coherence protocol can be made by considering the traffic induced by the two approaches in case of different sharing. The coherence overhead induced by a WU protocol is due to all the operations needed to update the remote copies. Whereas, a WI protocol invalidates remote copies and processors gener ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
coherence protocol can be made by considering the traffic induced by the two approaches in case of different sharing. The coherence overhead induced by a WU protocol is due to all the operations needed to update the remote copies. Whereas, a WI protocol invalidates remote copies and processors generate a miss on the access to the invalidated copy. Invalidation and block fetching (due to invalidation misses) contribute to the coherence overhead of WI protocols. By considering the cost for invalidation, block fetching and updating, the potential penalty for choosing WU incorrectly is much higher than the potential penalty for choosing WI incorrectly. Three different sources of sharing may be observed: i) active sharing, which occurs when the same cached data item is referenced by processes concurrently running on different processors; ii) false sharing [Torrellas94, Tomasevic96], which occurs when several processors reference separate data items belonging
Performance Analysis of Parallel Applications Running on SMP
"... In this work, by using dynamic analysis techniques, we analyze how a workload can be accelerated in the case of a shared-bus shared-memory multiprocessor. It is well known that, in this kind of systems, the bus is the critical element that can limit the scalability of the machine. Nevertheless, many ..."
Abstract
- Add to MetaCart
In this work, by using dynamic analysis techniques, we analyze how a workload can be accelerated in the case of a shared-bus shared-memory multiprocessor. It is well known that, in this kind of systems, the bus is the critical element that can limit the scalability of the machine. Nevertheless, many factors that influence bus utilization have not been yet investigated for this kind of workload, in particular the effects of thread migration. The operating system effects are also considered in our evaluation. We analyzed a basic four-processor and a high-end sixteen-processor machine, implementing three different coherence protocols (including MESI and another solution from the literature). We show that even in the fourprocessor case, the overhead induced by the sharing of private data, as a consequence of process migration, namely passive sharing, cannot be neglected. Indeed, the analysis shows that a protocol based on a selective strategy for dealing with private and shared data has a better performance than protocols either relying on the detection of migratory access-pattern or purely using a Write-Invalidate strategy, like MESI. We varied the architectural parameters to show how passive sharing and other coherence overhead are influenced by different cache choices. Then, we considered the sixteen-processor case, where the effects on performance are more evident. We also end up that performance can take advantage of large caches and cache affinity scheduling. However, even with affinity scheduling, a selective protocol delivers better performance. 1
Can High Bandwidth and Latency Justify Large Cache Blocks in Scalable Multiprocessors?
- Computer Science Department, University of Rochester
, 1994
"... this paper, we examine the relationship between these factors in the context of large-scale, network-based, cache-coherent, shared-memory multiprocessors. ..."
Abstract
- Add to MetaCart
this paper, we examine the relationship between these factors in the context of large-scale, network-based, cache-coherent, shared-memory multiprocessors.

