Results 1 - 10
of
64
Effects of communication latency, overhead, and bandwidth in a cluster architecture
- In Proceedings of the 24th Annual International Symposium on Computer Architecture
, 1997
"... This work provides a systematic study of the impact of communication performance on parallel applications in a high performance network of workstations. We develop an experimental system in which the communication latency, overhead, and bandwidth can be independently varied to observe the effects on ..."
Abstract
-
Cited by 98 (5 self)
- Add to MetaCart
This work provides a systematic study of the impact of communication performance on parallel applications in a high performance network of workstations. We develop an experimental system in which the communication latency, overhead, and bandwidth can be independently varied to observe the effects on a wide range of applications. Our results indicate that current efforts to improve cluster communication performance to that of tightly integrated parallel machines results in significantly improved application performance. We show that applications demonstrate strong sensitivity to overhead, slowing down by a factor of 60 on 32 processors when overhead is increased from 3 to 103 s. Applications in this study are also sensitive to per-message bandwidth, but are surprisingly tolerant of increased latency and lower per-byte bandwidth. Finally, most applications demonstrate a highly linear dependence to both overhead and per-message bandwidth, indicating that further improvements in communication performance will continue to improve application performance. 1
Cooperative caching for chip multiprocessors
- In Proceedings of the 33nd Annual International Symposium on Computer Architecture
, 2006
"... Chip multiprocessor (CMP) systems have made the on-chip caches a critical resource shared among co-scheduled threads. Limited off-chip bandwidth, increasing on-chip wire delay, destructive inter-thread interference, and diverse workload characteristics pose key design challenges. To address these ch ..."
Abstract
-
Cited by 87 (1 self)
- Add to MetaCart
Chip multiprocessor (CMP) systems have made the on-chip caches a critical resource shared among co-scheduled threads. Limited off-chip bandwidth, increasing on-chip wire delay, destructive inter-thread interference, and diverse workload characteristics pose key design challenges. To address these challenge, we propose CMP cooperative caching (CC), a unified framework to efficiently organize and manage on-chip cache resources. By forming a globally managed, shared cache using cooperative private caches. CC can effectively support two important caching applications: (1) reduction of average memory access latency and (2) isolation of destructive inter-thread interference. CC reduces the average memory access latency by balancing between cache latency and capacity opti-mizations. Based private caches, CC naturally exploits their access latency benefits. To improve the effective cache capacity, CC forms a “shared ” cache using replication control and LRU-based global replacement policies. Via cooperation throttling, CC provides a spectrum of caching behaviors between the two extremes of private and shared caches, thus enabling dynamic adaptation to suit workload requirements. We show that CC can achieve a robust performance advantage over private and shared cache schemes across different processor, cache and memory configurations, and a wide selection of multithreaded and multiprogrammed
Memory sharing predictor: The key to a speculative coherent DSM
- In Proceedings of the 26th annual international symposium on Computer architecture
, 1999
"... Recent research advocates using general message predictors to learn and predict the coherence activity in distributed shared memory (DSM). By accurately predicting a message and timely invoking the necessary coherence actions, a DSM can hide much of the remote access latency. This paper proposes the ..."
Abstract
-
Cited by 57 (6 self)
- Add to MetaCart
Recent research advocates using general message predictors to learn and predict the coherence activity in distributed shared memory (DSM). By accurately predicting a message and timely invoking the necessary coherence actions, a DSM can hide much of the remote access latency. This paper proposes the Memory Sharing Predictors (MSPs), pattern-based predictors that significantly improve prediction accuracy and implementation cost over general message predictors. An MSP is based on the key observation that to hide the remote access latency, a predictor must accurately predict only the remote memory accesses (i.e., request messages) and not the subsequent coherence messages invoked by an access. Simulation results indicate that MSPs improve prediction accuracy over general message predictors from 81 % to 93 % while requiring less storage overhead. This paper also presents the first design and evaluation for a speculative coherent DSM using pattern-based predictors. We identify simple techniques and mechanisms to trigger prediction timely and perform speculation for remote read accesses. Our speculation hardware readily works with a conventional full-map write-invalidate coherence protocol without any modifications. Simulation results indicate that performing speculative read requests alone reduces execution times by 12 % in our shared-memory applications. 1
Using Prediction to Accelerate Coherence Protocols
, 1998
"... Most large shared-memory multiprocessors use directory protocols to keep per-processor caches coherent. Some memory references in such systems, however, suffer long latencies for misses to remotely cached blocks. To ameliorate this latency, researchers have augmented standard coherence protocols wit ..."
Abstract
-
Cited by 52 (4 self)
- Add to MetaCart
Most large shared-memory multiprocessors use directory protocols to keep per-processor caches coherent. Some memory references in such systems, however, suffer long latencies for misses to remotely cached blocks. To ameliorate this latency, researchers have augmented standard coherence protocols with optimizations for specific sharing patterns, such as read-modify-write, producer-consumer, and migratory sharing. This paper seeks to replace these directed solutions with general prediction logic that monitors coherence activity and triggers appropriate coherence actions. This paper takes the first step toward using general prediction to accelerate coherence protocols by developing and evaluating the Cosmos coherence message predictor. Cosmos predicts the source and type of the next coherence message for a cache block using logic that is an extension of Yeh and Patt's two-level PAp branch predictor. For five scientific applications running on 16 processors, Cosmos has prediction accuracie...
Supporting nested transactional memory in LogTM
- In ASPLOS-XII. 2006
, 2006
"... Nested transactional memory (TM) facilitates software composition by letting one module invoke another without either knowing whether the other uses transactions. Closed nested transactions extend isolation of an inner transaction until the toplevel transaction commits. Implementations may flatten n ..."
Abstract
-
Cited by 50 (9 self)
- Add to MetaCart
Nested transactional memory (TM) facilitates software composition by letting one module invoke another without either knowing whether the other uses transactions. Closed nested transactions extend isolation of an inner transaction until the toplevel transaction commits. Implementations may flatten nested transactions into the top-level one, resulting in a complete abort on conflict, or allow partial abort of inner transactions. Open nested transactions allow a committing inner transaction to immediately release isolation, which increases parallelism and expressiveness at the cost of both software and hardware complexity. This paper extends the recently-proposed flat Log-based Transactional Memory (LogTM) with nested transactions. Flat LogTM saves pre-transaction values in a log, detects conflicts with read (R) and write (W) bits per cache block, and, on abort, invokes a software handler to unroll the log. Nested LogTM supports nesting by segmenting the log into a stack of activation records and modestly replicating R/W bits. To facilitate composition with nontransactional code, such as language runtime and operating system services, we propose escape actions that allow trusted code to run outside the confines of the transactional memory system.
Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in . . .
- In Proceedings of the 30th Annual International Symposium on Computer Architecture
, 2003
"... Destination-set prediction can improve the latency/bandwidth tradeoff in shared-memory multiprocessors. The destination set is the collection of processors that receive a particular coherence request. Snooping protocols send requests to the maximal destination set (i.e., all processors) , reducing l ..."
Abstract
-
Cited by 48 (7 self)
- Add to MetaCart
Destination-set prediction can improve the latency/bandwidth tradeoff in shared-memory multiprocessors. The destination set is the collection of processors that receive a particular coherence request. Snooping protocols send requests to the maximal destination set (i.e., all processors) , reducing latency for cache-to-cache misses at the expense of increased traffic. Directory protocols send requests to the minimal destination set, reducing bandwidth at the expense of an indirection through the directory for cache-to-cache misses. Recently proposed hybrid protocols trade-off latency and bandwidth by directly sending requests to a predicted destination set.
Coherence Decoupling: Making Use of Incoherence
, 2004
"... This paper explores a new technique called coherence decoupling, which breaks a traditional cache coherence protocol into two protocols: a Speculative Cache Lookup (SCL) protocol and a safe, backing coherence protocol. The SCL protocol produces a speculative load value, typically from an invalid cac ..."
Abstract
-
Cited by 37 (2 self)
- Add to MetaCart
This paper explores a new technique called coherence decoupling, which breaks a traditional cache coherence protocol into two protocols: a Speculative Cache Lookup (SCL) protocol and a safe, backing coherence protocol. The SCL protocol produces a speculative load value, typically from an invalid cache line, permitting the processor to compute with incoherent data. In parallel, the coherence protocol obtains the necessary coherence permissions and the correct value. Eventually, the speculative use of the incoherent data can be verified against the coherent data. Thus, coherence decoupling can greatly reduce --- if not eliminate --- the e#ects of false sharing. Furthermore, coherence decoupling can also reduce latencies incurred by true sharing. SCL protocols reduce those latencies by speculatively writing updates into invalid lines, thereby increasing the accuracy of speculation, without complicating the simple, underlying coherence protocol that guarantees correctness.
Improving Multiprocessor Performance with Coarse-Grain Coherence Tracking
- In Proc. of the 32nd Annual International Symposium on Computer Architecture
, 2005
"... To maintain coherence in conventional shared-memory multiprocessor systems, processors first check other processors’ caches before obtaining data from memory. This coherence checking adds latency to memory requests and leads to large amounts of interconnect traffic in broadcastbased systems. Our res ..."
Abstract
-
Cited by 29 (5 self)
- Add to MetaCart
To maintain coherence in conventional shared-memory multiprocessor systems, processors first check other processors’ caches before obtaining data from memory. This coherence checking adds latency to memory requests and leads to large amounts of interconnect traffic in broadcastbased systems. Our results for a set of commercial, scientific and multiprogrammed workloads show that on average 67 % (and up to 94%) of broadcasts are unnecessary. Coarse-Grain Coherence Tracking is a new technique that supplements a conventional coherence mechanism and optimizes the performance of coherence enforcement. The Coarse-Grain Coherence mechanism monitors the coherence status of large regions of memory, and uses that information to avoid unnecessary broadcasts. Coarse-Grain Coherence Tracking is shown to eliminate 55-97% of the unnecessary broadcasts, and improve performance by 8.8 % on average (and up to 21.7%). 1.
Fine-Grain Producer-Initiated Communication in Cache-Coherent Multiprocessors
, 1997
"... Shared-memory multiprocessors are becoming increasingly popular as a highperformance, easy to program, and relatively inexpensive choice for parallel computation. However, the performance of shared-memory multiprocessors is limited by memory latency. Memory latencies are higher in multiprocessors du ..."
Abstract
-
Cited by 27 (5 self)
- Add to MetaCart
Shared-memory multiprocessors are becoming increasingly popular as a highperformance, easy to program, and relatively inexpensive choice for parallel computation. However, the performance of shared-memory multiprocessors is limited by memory latency. Memory latencies are higher in multiprocessors due to physical constraints and cache coherence overheads. In addition, synchronization operations, which are necessary to ensure correctness in parallel programs, add further communication overhead in shared-memory multiprocessors. Software-controlled non-binding data prefetching is a widely used consumerinitiated mechanism to hide communication latency and is currently supported on most architectures. However, on an invalidation-based cache-coherent multiprocessor, prefetching is inapplicable or insufficient for some communication patterns such as irregular communication, fine-grain pipelined loops, and synchronization. For these cases, a combination of two fine-grain, producer-initiated pr...
Interconnect-Aware Coherence Protocols for Chip Multiprocessors
- In Proceedings of ISCA-33
, 2006
"... Improvements in semiconductor technology have made it possible to include multiple processor cores on a single die. Chip Multi-Processors (CMP) are an attractive choice for future billion transistor architectures due to their low design complexity, high clock frequency, and high throughput. In a typ ..."
Abstract
-
Cited by 26 (6 self)
- Add to MetaCart
Improvements in semiconductor technology have made it possible to include multiple processor cores on a single die. Chip Multi-Processors (CMP) are an attractive choice for future billion transistor architectures due to their low design complexity, high clock frequency, and high throughput. In a typical CMP architecture, the L2 cache is shared by multiple cores and data coherence is maintained among private L1s. Coherence operations entail frequent communication over global on-chip wires. In future technologies, communication between different L1s will have a significant impact on overall processor performance and power consumption. On-chip wires can be designed to have different latency, bandwidth, and energy properties. Likewise, coherence protocol messages have different latency and bandwidth needs. We propose an interconnect composed of wires with varying latency, bandwidth, and energy characteristics, and advocate intelligently mapping coherence operations to the appropriate wires. In this paper, we present a comprehensive list of techniques that allow coherence protocols to exploit a heterogeneous interconnect and evaluate a subset of these techniques to show their performance and power-efficiency potential. Most of the proposed techniques can be implemented with a minimum complexity overhead. 1.

