Results 1 - 10
of
41
The detection and elimination of useless misses in multiprocessors
- In Proceedings of the 20th International Symposium on Computer Architecture
, 1993
"... In this paper we introduce a classification of misses in shared-memory multiprocessors based on inter processor communication. We identify the set of essential misses, i.e., the smallest set of misses necessary for correct execution. Essential misses include cold misses and true sharing misses. All ..."
Abstract
-
Cited by 64 (3 self)
- Add to MetaCart
In this paper we introduce a classification of misses in shared-memory multiprocessors based on inter processor communication. We identify the set of essential misses, i.e., the smallest set of misses necessary for correct execution. Essential misses include cold misses and true sharing misses. All other misses are useless misses and can be ignored without affecting program execution. Based on the new classification we evaluate miss reduction techniques in hardware, based on delaying and combining invalidations. We compare the effectiveness of five different protocols for combining invalidations leading to useless misses for cachebased multiprocessors and for multiprocessors with virtual shared memory. In cache based systems these techniques are very effective and lead to miss rates which are close to the minimum. In virtual shared memory systems, the techniques are also effective but leave room for additional improvements.
Assisted Execution
, 1998
"... We introduce a new execution paradigm called assisted execution. In this model, a set of auxiliary "assistant" threads, called nanothreads, is attached to each thread of an application. Nanothreads are very lightweight threads which run on the same processor as the main (application) thread and h ..."
Abstract
-
Cited by 58 (1 self)
- Add to MetaCart
We introduce a new execution paradigm called assisted execution. In this model, a set of auxiliary "assistant" threads, called nanothreads, is attached to each thread of an application. Nanothreads are very lightweight threads which run on the same processor as the main (application) thread and help execute the main thread as fast as possible. Nanothreads exploit resources that are idled in the processor because of dependencies and memory access delays. Assisted execution has the potential to alter the current trade-offs between static and dynamic execution mechanisms. Nanothreads can monitor and reconfigure the underlying hardware, can emulate hardware and can profile applications with little or no interference to improve the program on-line or off-line. We demonstrate the power of assisted execution with an important application, namely data prefetching to fight the memory wall problem. Simulation results on several SPEC95 benchmarks show that sequential and stride prefet...
Evaluation of Hardware-Based Stride and Sequential Prefetching in Shared-Memory Multiprocessors
- IEEE Transactions on Parallel and Distributed Systems
"... . We study the efficiency of previously proposed stride and sequential prefetching---two promising hardware-based prefetching schemes to reduce readmiss penalties in shared-memory multiprocessors. Although stride accesses dominate in four out of six of the applications we study, we find that sequ ..."
Abstract
-
Cited by 48 (2 self)
- Add to MetaCart
. We study the efficiency of previously proposed stride and sequential prefetching---two promising hardware-based prefetching schemes to reduce readmiss penalties in shared-memory multiprocessors. Although stride accesses dominate in four out of six of the applications we study, we find that sequential prefetching does as well as and in same cases even better than stride prefetching for five applications. This is because (i) most strides are shorter than the block size (we assume 32 byte blocks), which means that sequential prefetching is as effective for these stride accesses, and (ii) sequential prefetching also exploits the locality of read misses with non-stride accesses. However, since stride prefetching in general results in fewer useless prefetches, it offers the extra advantage of consuming less memory-system bandwidth. Corresponding author: Fredrik Dahlgren Keywords: Hardware-Controlled Prefetching, Latency Tolerance, Performance Evaluation, Relaxed Memory Consiste...
Efficient Memory Simulation in SimICS
, 1995
"... We describe novel techniques used for efficient simulation of memory in SimICS, an instruction level simulator developed at SICS. The design has focused on efficiently supporting the simulation of multiprocessors, analyzing complex memory hierarchies and running large binaries with a mixture of syst ..."
Abstract
-
Cited by 39 (6 self)
- Add to MetaCart
We describe novel techniques used for efficient simulation of memory in SimICS, an instruction level simulator developed at SICS. The design has focused on efficiently supporting the simulation of multiprocessors, analyzing complex memory hierarchies and running large binaries with a mixture of system-level and user-level code. A software caching
Performance Evaluation of a Cluster-Based Multiprocessor Built from ATM Switches and Bus-Based Multiprocessor Servers
- in The 2nd IEEE Symposium on High-Performance Computer Architecture
, 1996
"... We consider a network of workstations (NOW) organization consisting of a number of bus-based multiprocessor servers interconnected by an ATM switch. A shared-memory model is supported by distributed virtual shared memory (DVSM) and this paper focuses on the access penalties incurred by (1) ATM ..."
Abstract
-
Cited by 31 (1 self)
- Add to MetaCart
We consider a network of workstations (NOW) organization consisting of a number of bus-based multiprocessor servers interconnected by an ATM switch. A shared-memory model is supported by distributed virtual shared memory (DVSM) and this paper focuses on the access penalties incurred by (1) ATM and (2) the DVSM software. First, through detailed architectural simulations we find that while the bandwidth and the latency of the ATM switch fabrics are found to be acceptable, the latency incurred by commercially available ATM interfaces has a first order effect on the performance. We also study the effects of various scheduling policies for the coherence handlers. Our data suggest that since the probability of finding an idle processor within a cluster is high, a good policy is to schedule it there instead of letting an extra compute processor execute coherence handlers. Overall, by adjusting the adaptation layer of ATM to a DVSM system we find that ATM is a promising technology for these kinds of systems. 1
Limes: a multiprocessor simulation environment for PC platforms
- TCCA Newsletter
, 1999
"... This paper presents a multiprocessor simulation environment, developed with the aim to facilitate the researches of multiprocessor systems at the University of Belgrade. It comprises of a simulation tool applicable for architecture studies of shared-address space multiprocessors, and a detailed mode ..."
Abstract
-
Cited by 25 (1 self)
- Add to MetaCart
This paper presents a multiprocessor simulation environment, developed with the aim to facilitate the researches of multiprocessor systems at the University of Belgrade. It comprises of a simulation tool applicable for architecture studies of shared-address space multiprocessors, and a detailed model of bus-based cache coherent multiprocessor system. The environment executes on PC platforms and, while showing comparable performance with other tools of similar purpose, benefits from potential reusability of its realistic simulated models. The tool provides a simple general interface that allows for the hardware lying underneath the simulated processors to be modeled in a manner similar to the VHDL approach, but with far greater efficiency. The included model is easily extendible to other bus-based systems, and the basics of the design methodology are presented. 1.
Implementation and Evaluation of Update-Based Cache Protocols Under Relaxed Memory Consistency Models
- FUTURE GENERATION COMPUTER SYSTEMS
, 1995
"... Invalidation-based cache coherence protocols have been extensively studied in the context of large-scale shared-memory multiprocessors. Under a relaxed memory consistency model, most of the write latency can be hidden whereas cache misses still incur a severe performance problem. By contrast, upd ..."
Abstract
-
Cited by 23 (5 self)
- Add to MetaCart
Invalidation-based cache coherence protocols have been extensively studied in the context of large-scale shared-memory multiprocessors. Under a relaxed memory consistency model, most of the write latency can be hidden whereas cache misses still incur a severe performance problem. By contrast, update-based protocols have a potential to reduce both write and read penalties under relaxed memory consistency models because coherence misses can be completely eliminated. The purpose of this paper is to compare update- and invalidation-based protocols for their ability to reduce or hide memory access latencies and for their ease of implementation under relaxed memory consistency models. Based on a detailed simulation study, we find that write-update protocols augmented with simple competitive mechanisms --- we call such protocols competitive-update protocols --- can hide all the write latency and cut the read penalty by as much as 46% at the cost of some increase in the memory traff...
RPM: A rapid prototyping engine for multiprocessor systems
- IEEE Computer
, 1995
"... In multiprocessor systems, processing nodes contain a processor, some cache and a share of the system memory, and are connected through a scalable interconnect. The system memory partitions may be shared (shared-memory systems) or disjoint (messagepassing systems). Within each class of systems many ..."
Abstract
-
Cited by 17 (5 self)
- Add to MetaCart
In multiprocessor systems, processing nodes contain a processor, some cache and a share of the system memory, and are connected through a scalable interconnect. The system memory partitions may be shared (shared-memory systems) or disjoint (messagepassing systems). Within each class of systems many architectural variations are possible. Fair comparisons among systems are difficult because of the lack of a common hardware platform to implement the different architectures. RPM (Rapid Prototyping engine for Multiprocessors) is a hardware emulator for the rapid prototyping of various multiprocessor architectures. In RPM, the hardware of the target machine is emulated by reprogrammable controllers implemented with Field-Programmable Gate Arrays (FPGAs). The processors, memories and interconnect are off-theshelf and their relative speeds can be modified to emulate various component technologies. Every emulation is an actual incarnation of the target machine and therefore software written for the target machine can be easily ported on it with little modification and without instrumentation of the code. In this paper, we describe the architecture of RPM, its performance and the prototyping methodology. We also compare our approach with simulation and breadboard prototyping. Keywords: Field-Programmable Gate Arrays (FPGAs), message-passing multicomputers, shared-memory multiprocessors, design verification, performance evaluation, simulation.
Efficient Strategies for Software-Only Directory Protocols in Shared-Memory Multiprocessors
- PROC. OF THE 22ND ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE
, 1995
"... The cost, complexity, and inflexibility of hardware-based directory protocols motivate us to study the performance implications of protocols that emulate directory management using software handlers executed on the compute processors. An important performance limitation of such software-only protoco ..."
Abstract
-
Cited by 16 (1 self)
- Add to MetaCart
The cost, complexity, and inflexibility of hardware-based directory protocols motivate us to study the performance implications of protocols that emulate directory management using software handlers executed on the compute processors. An important performance limitation of such software-only protocols is that software latency associated with directory management ends up on the critical memory access path for read miss transactions. We propose five strategies that support efficient data transfers in hardware whereas directory management is handled at a slower pace in the background by software handlers. Simulations show that this approach can remove the directory-management latency from the memory access path. Whereas the directory is managed in software, the hardware mechanisms must access the memory state in order to enable data transfers at a high speed. Overall, our strategies reach between 60% and 86% of the hardware-based protocol performance.
Effectiveness of Dynamic Prefetching in Multiple-WriterDistributed Virtual Shared Memory Systems
, 1997
"... We consider a network of workstations (NOW) organization consisting of busbased multiprocessors interconnected by an ATM interconnect on which a shared-memory programming model is imposed by using a multiple-writer distributed virtual shared memory system. The latencies associated with bringing data ..."
Abstract
-
Cited by 15 (2 self)
- Add to MetaCart
We consider a network of workstations (NOW) organization consisting of busbased multiprocessors interconnected by an ATM interconnect on which a shared-memory programming model is imposed by using a multiple-writer distributed virtual shared memory system. The latencies associated with bringing data into the local memory are a severe performance limitation of such systems. To tolerate the access latencies, we propose a novel prefetch approach and show how it can be integrated into the software-based coherence layer of a multiple-writer protocol. This approach uses the access history of each page to guide which pages to prefetch. Based on detailed architectural simulations and seven scientific applications we find that our prefetch algorithm can remove a vast majority of the remote operations which improves the performance of all applications. We also find that the bandwidth provided by ATM switches available today is sufficient to accommodate prefetching. However, the protocol processing overhead of available ATM interfaces limits the gain of the prefetching algorithms.

