Results 1 - 10
of
12
DDM - A Cache-Only Memory Architecture
- IEEE Computer
, 1992
"... The long latencies introduced by remote accesses in a large multiprocessor can be hidden by caching. Caching also decreases the network load. We introduce a new class of architectures called Cache Only Memory Architectures (COMA). These architectures provide the programming paradigm of the shared-me ..."
Abstract
-
Cited by 137 (8 self)
- Add to MetaCart
The long latencies introduced by remote accesses in a large multiprocessor can be hidden by caching. Caching also decreases the network load. We introduce a new class of architectures called Cache Only Memory Architectures (COMA). These architectures provide the programming paradigm of the shared-memory architectures, but have no physically shared memory; instead, the caches attached to the processors contain all the memory in the system, and their size is therefore large. A datum is allowed to be in any or many of the caches, and will automatically be moved to where it is needed by a cache-coherence protocol, which also ensures that the last copy of a datum is never lost. The location of a datum in the machine is completely decoupled from its address. We also introduce one example of COMA: the Data Diffusion Machine (DDM), and its simulated performance for large applications. The DDM is based on a hierarchical network structure, with processor/memory pairs at its tips. Remote accesses...
Performance Evaluation of Hierarchical Ring-Based Shared Memory Multiprocessors
- IEEE Trans. on Computers
, 1992
"... This paper investigates the performance of word-packet, slotted unidirectional ring-based hierarchical direct networks in the context of large-scale shared memory multiprocessors. Slotted unidirectional rings are attractive because their electrical characteristics and simple interfaces allow for fas ..."
Abstract
-
Cited by 18 (7 self)
- Add to MetaCart
This paper investigates the performance of word-packet, slotted unidirectional ring-based hierarchical direct networks in the context of large-scale shared memory multiprocessors. Slotted unidirectional rings are attractive because their electrical characteristics and simple interfaces allow for fast cycle times and large bandwidths. For large-scale systems, it is necessary to use multiple rings for increased aggregate bandwidth. Hierarchies are attractive because the topology ensures unique paths between nodes, simple node interfaces and simple inter-ring connections. To ensure that a realistic region of the design space is examined, the architecture of the network used in the Hector prototype is adopted as the initial design point. A simulator of that architecture has been developed and validated with measurements from the prototype. The system and workload parameterization reflects conditions expected in the near future. The results of our study show the importance of system balance...
Effect of Node Size on the Performance of Cache-Conscious B+-Trees
- In Proc. of SIGMETRICS
, 2003
"... In main-memory databases, the number of processor cache misses has a critical impact on the performance of the system. Cacheconscious indices are designed to improve performance by reducing the number of processor cache misses that are incurred during a search operation. Conventional wisdom suggests ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
In main-memory databases, the number of processor cache misses has a critical impact on the performance of the system. Cacheconscious indices are designed to improve performance by reducing the number of processor cache misses that are incurred during a search operation. Conventional wisdom suggests that the index’s node size should be equal to the cache line size in order to minimize the number of cache misses and improve performance. As we show in this paper, this design choice ignores additional effects, such as the number of instructions executed and the number of TLB misses, which play a significant role in determining the overall performance. To capture the impact of node size on the performance of a cache-conscious B+-tree (CSB+-tree), we first develop an analytical model based on the fundamental components of the search process. This model is then validated with an actual implementation, demonstrating that the model is accurate. Both the analytical model and experiments confirm that using node sizes much larger than the cache line size can result in better search performance for the CSB+-tree.
An Analytical Model of High Performance Superscalar-Based Multiprocessors
- In Proceedings of Conference on Parallel Architectures and Compilation Technology (PACT
, 1995
"... Several shared memory multiprocessor models using approximate Mean Value Analysis (MVA) have been developed and used to evaluate a number of system architectures. Since this time, the complexity of multiprocessor systems has increased as superscalar processors and latency reduction techniques are em ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
Several shared memory multiprocessor models using approximate Mean Value Analysis (MVA) have been developed and used to evaluate a number of system architectures. Since this time, the complexity of multiprocessor systems has increased as superscalar processors and latency reduction techniques are employed in these systems. We present an MVA multiprocessor performance model which incorporates these new features and in addition, increases the level of modeling detail to improve flexibility and accuracy. We describe in detail extensions present in our model that allow us to analyze the impact of these new features. We then use the model to demonstrate some of the tradeoffs involved in designing modern multiprocessors, including the impact of highly superscalar architectures on the scalability of multiprocessor systems. 1 Introduction An analytical modeling technique that has been frequently used to evaluate shared memory multiprocessors is approximate Mean Value Analysis (MVA)[12]. In M...
A Mean Value Analysis Multiprocessor Model Incorporating Superscalar Processors and Latency Tolerating Techniques
- Int’l Journal of Parallel Programming
, 1996
"... : Several approximate Mean Value Analysis (MVA) shared memory multiprocessor models have been developed and used to evaluate a number of system architectures. In recent years, the use of superscalar processors, multilevel cache hierarchies, and latency tolerating techniques has significantly increas ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
: Several approximate Mean Value Analysis (MVA) shared memory multiprocessor models have been developed and used to evaluate a number of system architectures. In recent years, the use of superscalar processors, multilevel cache hierarchies, and latency tolerating techniques has significantly increased the complexity of multiprocessor system modeling. We present an analytical performance model which extends previous multiprocessor MVA models by incorporating these new features and in addition, increases the level of modeling detail to improve flexibility and accuracy. The extensions required to analyze the impact of these new features are described in detail. We then use the model to demonstrate some of the tradeoffs involved in designing modern multiprocessors, including the impact of highly superscalar architectures on the scalability of multiprocessor systems. Key Words: Shared memory multiprocessors, Mean Value Analysis, performance evaluation, latency tolerating techniques, supersc...
Toward The Design Of Large-Scale, Shared-Memory Multiprocessors
- Dept. of Comput. Sci., Univ. of Wisconsin-Madison
, 1992
"... The state-of-the-art in multiprocessing today employs thousands of high-performance microprocessors. As system sizes continue to grow, increasing care must be taken to design cost-efficient, balanced (i.e. scalable) systems. This thesis addresses the scalability of sharedmemory multiprocessors, pres ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
The state-of-the-art in multiprocessing today employs thousands of high-performance microprocessors. As system sizes continue to grow, increasing care must be taken to design cost-efficient, balanced (i.e. scalable) systems. This thesis addresses the scalability of sharedmemory multiprocessors, presenting a practical treatment of scalability, and proceeding to focus on aspects of two critical areas of large-scale system design: interconnection networks and cache coherence mechanisms. In these areas, pipelined-channel interconnection networks and pruning-cache directories are investigated, respectively. Pipelined-channel interconnection networks allow multiple bits to be simultaneously in flight on a single wire, decoupling channel throughput from channel latency. The first published performance analysis of the SCI ring, a new IEEE standard employing pipelined channels, is presented. This study serves as a proof-of-concept for pipelined-channel networks, demonstrating their very high p...
Evaluation of NUMA Memory Management Through Modeling and Measurements
- IEEE Transactions on Parallel and Distributed Systems
, 1991
"... The class of NUMA (nonuniform memory access time) shared memory architectures is becoming increasingly important with the desire for larger scale multiprocessors. In such machines, the placement and movement of code and data are crucial to performance. The operating system can play a role in managin ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
The class of NUMA (nonuniform memory access time) shared memory architectures is becoming increasingly important with the desire for larger scale multiprocessors. In such machines, the placement and movement of code and data are crucial to performance. The operating system can play a role in managing placement through the policies and mechanisms of the virtual memory subsystem. In this paper, we explore dynamic page placement policies using two approaches that complement each other in important ways. On one hand, we measure the performance of parallel programs running on the experimental DUnX operating system kernel for the BBN GP1000 which supports a highly parameterized dynamic page placement policy. We also develop and apply an analytic model of memory system performance of a Local/Remote NUMA architecture based on approximate mean-value analysis techniques. The model assumes that a simple workload model based on a few parameters can often provide insight into the general behavior o...
PHD: A Hierarchical Cache Coherent Protocol
- SB Thesis. MIT AI lab
, 1992
"... As the number of processors in distributed-memory multiprocessors grows, efficiently supporting a shared-memory programming model becomes difficult. We have designed the Protocol for Hierarchical Directories (PHD) to allow shared-memory support for systems containing massive numbers of processors. P ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
As the number of processors in distributed-memory multiprocessors grows, efficiently supporting a shared-memory programming model becomes difficult. We have designed the Protocol for Hierarchical Directories (PHD) to allow shared-memory support for systems containing massive numbers of processors. PHD eliminates bandwidth problems by using a scalable network, decreases hot-spots by not relying on a single point to distribute blocks, and uses a scalable amount of space for its directories. PHD provides a shared-memory model by synthesizing a global shared memory from the local memories of processors. PHD supports sequentially consistent read, write, and test-and-set operations.
Tradeoffs in the Design of Single Chip Multiprocessors
- 2nd International Conference on Parallel Architectures and Compilation Techniques (PACT94
, 1994
"... : By the end of the decade, as VLSI integration levels continue to increase, building a multiprocessor system on a single chip will become feasible. In this paper, we propose to analyze the tradeoffs involved in designing such a chip, and specifically address whether to allocate available chip area ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
: By the end of the decade, as VLSI integration levels continue to increase, building a multiprocessor system on a single chip will become feasible. In this paper, we propose to analyze the tradeoffs involved in designing such a chip, and specifically address whether to allocate available chip area to larger caches or to large numbers of processors. Using the dimensions of the Alpha 21064 microprocessor as a basis, we determine several candidate configurations which vary in cache size and number of processors, and evaluate them in terms of both processing power and cycle time. We then investigate fine tuning the architecture in order to further improve performance, by trading off the number of processors for a larger TLB size. Our results show that for a coarse-grain execution environment, adding processors at the expense of cache size improves performance up to a point. We then show that increasing TLB size at the expense of the number of processors can further improve performance. K...
Memory System Design For Bus Based Multiprocessors
, 1991
"... This dissertation studies the design of single bus, shared memory multiprocessors. The purpose of the studies is to find optimum points in the design space for different memory system components that include private caches, shared bus and main memory. Two different methodologies are used based on th ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
This dissertation studies the design of single bus, shared memory multiprocessors. The purpose of the studies is to find optimum points in the design space for different memory system components that include private caches, shared bus and main memory. Two different methodologies are used based on the operating environment of a multiprocessor. For a multiprocessor operating in the throughput-oriented environment, Customized Mean Value Analysis (CMVA) models are developed to evaluate the performance of the multiprocessor. The accuracy of the models are validated by comparing their results to those generated by actual trace-driven simulation over several thousand multiprocessor configurations. The comparison results show that the CMVA models can be as accurate as trace driven simulation in predicting the multiprocessor throughput and bus utilization. The validated models are then used to evaluate design choices that include cache size, cache block size, cache set-associativity, bus switch...

