Results 11 - 20
of
257
Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes
- In International Conference on Parallel Processing
, 1990
"... As multiprocessors are scaled beyond single bus systems, there is renewed interest in directory-based cache coherence schemes. These schemes rely on a directory to keep track of all processors caching a memory block. When a write to that block occurs, pointto -point invalidation messages are sent to ..."
Abstract
-
Cited by 121 (4 self)
- Add to MetaCart
(Show Context)
As multiprocessors are scaled beyond single bus systems, there is renewed interest in directory-based cache coherence schemes. These schemes rely on a directory to keep track of all processors caching a memory block. When a write to that block occurs, pointto -point invalidation messages are sent to keep the caches coherent. A straightforward way of recording the identities of processors caching a memory block is to use a bit vector per memory block, with one bit per processor. Unfortunately, when the main memory grows linearly with the number of processors, the total size of the directory memory grows as the square of the number of processors, which is prohibitive for large machines. To remedy this problem several schemes that use a limited number of pointers per directory entry have been suggested. These schemes often cause excessive invalidation traffic. In this paper, we propose two simple techniques that significantly reduce invalidation traffic and directory memory requirements. ...
Dynamic Self-Invalidation: Reducing Coherence Overhead in Shared-Memory Multiprocessors
- In Proc. of the 22nd Annual Int’l Symp. on Computer Architecture (ISCA’95
, 1995
"... This paper introduces dynamic self-invalidation (DSI), a new technique for reducing cache coherence overhead in shared-memory multiprocessors. DSI eliminates invalidation messages by having a processor automatically invalidate its local copy of a cache block before a conflicting access by another pr ..."
Abstract
-
Cited by 112 (4 self)
- Add to MetaCart
(Show Context)
This paper introduces dynamic self-invalidation (DSI), a new technique for reducing cache coherence overhead in shared-memory multiprocessors. DSI eliminates invalidation messages by having a processor automatically invalidate its local copy of a cache block before a conflicting access by another processor. Eliminating invalidation overhead is particularly important under sequential consistency, where the latency of invalidating outstanding copies can increase a program’s critical path. DSI is applicable to software, hardware, and hybrid coherence schemes. In this paper we evaluate DSI in the context of hardware directory-based write-invalidate coherence protocols. Our results show that DSI reduces execution time of a sequentially consistent full-map coherence protocol by as much as 41%. This is comparable to an implementation of weak consistency that uses a coalescing write-buffer to allow up to 16 outstanding requests for exclusive blocks. When used in conjunction with weak consistency, DSI can exploit tear-off blocks—which eliminate both invalidation and acknowledgment messages— for a total reduction in messages of up to 26%.
Techniques for reducing consistency-related communication in distributed shared memory systems
- ACM Transactions on Computer Systems. Toappear
"... Distributed shared memory (DSM) is a software abstraction of shared memory on a distributed memory machine. One of the key problems in building an e cient DSM system is to reduce the amount ofcommunication needed to keep the distributed memories consistent. In this paper we presentfour techniques fo ..."
Abstract
-
Cited by 106 (13 self)
- Add to MetaCart
(Show Context)
Distributed shared memory (DSM) is a software abstraction of shared memory on a distributed memory machine. One of the key problems in building an e cient DSM system is to reduce the amount ofcommunication needed to keep the distributed memories consistent. In this paper we presentfour techniques for doing so: (1) software release consistency � (2) multiple consistency protocols � (3) write-shared protocols � and (4) an update-with-timeout mechanism. These techniques have been implemented in the Munin DSM system. We compare the performance of seven application programs on Munin, rst to their performance when implemented using message passing and then to their performance when running on a conventional DSM system that does not embody the above techniques. On a 16-processor cluster of workstations, Munin's performance is within 5 % of message passing for four out of the seven applications. For the other three, performance is within 29 % to 33%. Detailed analysis of two of these three applications indicates that the addition of a function shipping capability would bring their performance within 7 % of the message passing performance. Compared to a conventional DSM system, Munin achieves performance improvements ranging from a few to several hundred percent, depending on the application.
Software Caching and Computation Migration in Olden
, 1995
"... The goal of the Olden project is to build a system that provides parallelism for general purpose C programs with minimal programmer annotations. We focus on programs using dynamic structures such as trees, lists, and DAGs. We demonstrate that providing both software caching and computation migratio ..."
Abstract
-
Cited by 105 (0 self)
- Add to MetaCart
The goal of the Olden project is to build a system that provides parallelism for general purpose C programs with minimal programmer annotations. We focus on programs using dynamic structures such as trees, lists, and DAGs. We demonstrate that providing both software caching and computation migration can improve the performance of these programs, and provide a compile-time heuristic that selects between them for each pointer dereference. We have implemented a prototype system on the Thinking Machines CM-5. We describe our implementation and report on experiments with ten benchmarks.
Data Replication for Mobile Computers
, 1994
"... Users of mobile computers will soon have online access to a large number of databases via wireless networks. Because of limited bandwidth, wireless communication is more expensive than wire communication. In this paper we present and analyze various static and dynamic data allocation methods. The ob ..."
Abstract
-
Cited by 89 (5 self)
- Add to MetaCart
(Show Context)
Users of mobile computers will soon have online access to a large number of databases via wireless networks. Because of limited bandwidth, wireless communication is more expensive than wire communication. In this paper we present and analyze various static and dynamic data allocation methods. The objective is to optimize the communication cost between a mobile computer and the stationary computer that stores the online database. Analysis is performed in two cost models. One is connection (or time) based, as in cellular telephones, where the user is charged per minute of connection. The other is message based, as in packet radio networks, where the user is charged per message. Our analysis addresses both, the average case and the worst case for determining the best allocation method. 0 1 Introduction Users of mobile computers, such as palmtops, notebook computers and personal communication systems, will soon have online access to a large number of databases via wireless networks. The ...
An Argument for Simple COMA
- In Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture, Ipages
, 1995
"... ..."
(Show Context)
Transactional Client-Server Cache Consistency: Alternatives and Performance
- ACM Transactions on Database Systems
, 1997
"... ing with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works, requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept, ACM Inc., 1515 Broadway, New York, N ..."
Abstract
-
Cited by 80 (4 self)
- Add to MetaCart
(Show Context)
ing with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works, requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept, ACM Inc., 1515 Broadway, New York, NY 10036 USA, fax +1 (212) 869-0481, or permissions@acm.org. 2 \Delta M. J. Franklin et al. 1. INTRODUCTION 1.1 Client-Server Database System Architectures Advances in distributed computing and object-orientation have combined to bring about the development of a new class of database systems. These systems employ a client-server computing model to provide both responsiveness to users and support for complex, shared data in a distributed environment. Current relational DBMS products are based on a query-shipping approach in which most query processing is performed at servers; clients are primarily used to manage the user interface. In contrast, object-oriented database systems (OODBMS), whi...
Graphite: A Distributed Parallel Simulator for Multicores
- In HPCA
, 2010
"... Abstract-This paper introduces the Graphite open-source distributed parallel multicore simulator infrastructure. Graphite is designed from the ground up for exploration of future multicore processors containing dozens, hundreds, or even thousands of cores. It provides high performance for fast desi ..."
Abstract
-
Cited by 73 (6 self)
- Add to MetaCart
(Show Context)
Abstract-This paper introduces the Graphite open-source distributed parallel multicore simulator infrastructure. Graphite is designed from the ground up for exploration of future multicore processors containing dozens, hundreds, or even thousands of cores. It provides high performance for fast design space exploration and software development. Several techniques are used to achieve this including: direct execution, seamless multicore and multi-machine distribution, and lax synchronization. Graphite is capable of accelerating simulations by distributing them across multiple commodity Linux machines. When using multiple machines, it provides the illusion of a single process with a single, shared address space, allowing it to run off-theshelf pthread applications with no source code modification. Our results demonstrate that Graphite can simulate target architectures containing over 1000 cores on ten 8-core servers. Performance scales well as more machines are added with near linear speedup in many cases. Simulation slowdown is as low as 41× versus native execution.
Anatomy of a Message in the Alewife Multiprocessor
, 1993
"... Shared-memory provides a uniform and attractive mechanism for communication. For efficiency, it is often implemented with a layer of interpretive hardware on top of a message-passingcommunications network. This interpretive layer is responsible for data location, data movement, and cache coherence. ..."
Abstract
-
Cited by 70 (16 self)
- Add to MetaCart
(Show Context)
Shared-memory provides a uniform and attractive mechanism for communication. For efficiency, it is often implemented with a layer of interpretive hardware on top of a message-passingcommunications network. This interpretive layer is responsible for data location, data movement, and cache coherence. It uses patterns of communication that benefit common programming styles, but which are only heuristics. This suggests that certain styles of communication may benefit from direct access to the underlying communications substrate. The Alewife machine, a shared-memory multiprocessor being built at MIT, provides such an interface. The interface is an integral part of the shared memory implementation and affords direct, user-level access to the network queues, supports an efficient DMA mechanism, and includes fast trap handling for message reception. This paper discusses the design and implementation of the Alewife message-passing interface and addresses the issues and advantages of using such ...
Efficient Distributed Shared Memory Based On Multi-Protocol Release Consistency
, 1994
"... A distributed shared memory (DSM) system allows shared memory parallel programs to be executed on distributed memory multiprocessors. The challenge in building a DSM system is to achieve good performance over a wide range of shared memory programs without requiring extensive modifications to the s ..."
Abstract
-
Cited by 68 (5 self)
- Add to MetaCart
A distributed shared memory (DSM) system allows shared memory parallel programs to be executed on distributed memory multiprocessors. The challenge in building a DSM system is to achieve good performance over a wide range of shared memory programs without requiring extensive modifications to the source code. The performance challenge translates into reducing the amount of communication performed by the DSM system to that performed by an equivalent message passing program. This thesis describes four novel techniques for reducing the communication overhead of DSM, including: (i) the use of software release consistency, (ii) support for multiple consistency protocols, (iii) a multiple writer protocol, and (iv) an update timeout mechanism. Release consistency allows modifications of shared data to be handled via a delayed update queue, which masks network latencies. Providing multiple cons...