Results 1 - 10
of
17
Virtual Memory Primitives for User Programs
, 1991
"... Memory Management Units (MMUs) are traditionally used by operating systems to implement disk-paged virtual memory. Some operating systems allow user programs to specify the protection level (inaccessible, readonly. read-write) of pages, and allow user programs t.o handle protection violations. bur. ..."
Abstract
-
Cited by 170 (2 self)
- Add to MetaCart
Memory Management Units (MMUs) are traditionally used by operating systems to implement disk-paged virtual memory. Some operating systems allow user programs to specify the protection level (inaccessible, readonly. read-write) of pages, and allow user programs t.o handle protection violations. bur. these mechanisms are not. always robust, efficient, or well-mat. ched to the needs of applications.
A Fully Associative SoftwareManaged Cache Design
- In Proceedings of the 27th Annual International Symposium on Computer Architecture
, 2000
"... As DRAM access latencies approach a thousand instruction-execution times and onchip caches grow to multiple megabytes, it is not clear that conventional cache structures continue to be appropriate. Two key features⎯full associativity and software management⎯have been used successfully in the virtual ..."
Abstract
-
Cited by 73 (2 self)
- Add to MetaCart
As DRAM access latencies approach a thousand instruction-execution times and onchip caches grow to multiple megabytes, it is not clear that conventional cache structures continue to be appropriate. Two key features⎯full associativity and software management⎯have been used successfully in the virtual-memory domain to cope with disk access latencies. Future systems will need to employ similar techniques to deal with DRAM latencies. This paper presents a practical, fully associative, software-managed secondary cache system that provides performance competitive with or superior to traditional caches without OS or application involvement. We see this structure as the first step toward OS- and application-aware management of large on-chip caches. This paper has two primary contributions: a practical design for a fully associative memory structure, the indirect index cache (IIC), and a novel replacement algorithm, generational replacement, that is specifically designed to work with the IIC. We analyze the behavior of an IIC with generational replacement as a drop-in, transparent substitute for a conventional secondary cache. We achieve miss rate reductions from 8 % to 85 % relative to a 4-way associative LRU organization, matching or beating a (practically infeasible) fully associative true LRU cache. Incorporating these miss rates into a rudimentary timing model indicates that the IIC/generational replacement cache could be competitive with a conventional cache at today’s DRAM latencies, and will outperform a conventional cache as these CPU-relative latencies grow. 1.
Optimizing Instruction Cache Performance for Operating System Intensive Workloads
- IEEE Transactions on Computers
, 1995
"... High instruction cache hit rates are key to high performance. One known technique to improve the hit rate of caches is to use an optimizing compiler to minimize cache interference via an improved layout of the code. This technique, however, has been applied to application code only, even though ther ..."
Abstract
-
Cited by 61 (11 self)
- Add to MetaCart
High instruction cache hit rates are key to high performance. One known technique to improve the hit rate of caches is to use an optimizing compiler to minimize cache interference via an improved layout of the code. This technique, however, has been applied to application code only, even though there is evidence that the operating system often uses the cache heavily and with less uniform patterns than applications. Therefore, it is unknown how well existing optimizations perform for systems code and whether better optimizations can be found. We address this problem in this paper. This paper characterizes in detail the locality patterns of the operating system code and shows that there is substantial locality. Unfortunately, caches are not able to extract much of it: rarely-executed special-case code disrupts spatial locality, loops with few iterations that call routines make loop locality hard to exploit, and plenty of loop-less code hampers temporal locality. As a result, interference...
ParaDiGM: A Highly Scalable Shared-Memory Multi-Computer Architecture
- IEEE Computer
, 1991
"... ParaDiGM is a highly scalable shared-memory multi-computer architecture. It is being developed to demonstrate the feasibility of building a relatively low-cost shared-memory parallel computer that scales to large configurations, and yet provides sequential programs with performance comparable to a h ..."
Abstract
-
Cited by 39 (4 self)
- Add to MetaCart
ParaDiGM is a highly scalable shared-memory multi-computer architecture. It is being developed to demonstrate the feasibility of building a relatively low-cost shared-memory parallel computer that scales to large configurations, and yet provides sequential programs with performance comparable to a high-end microprocessor. A key problem is building a scalable memory hierarchy. In this paper we describe the ParaDiGM architecture, highlighting the innovations of our approach and presenting results of our evaluation of the design. We envision that scalable shared-memory multiprocessors like ParaDiGM will soon become the dominant form of parallel processing, even for very large-scale computation, providing a uniform platform for parallel programming systems and applications. 1 Introduction ParaDiGM (PARAllel DIstributed Global Memory) 1 is a highly scalable shared-memory multi-computer architecture. By multi-computer, we mean a system interconnected by both bus and network technology. B...
Hardware-Software Trade-Offs in a Direct Rambus Implementation of the RAMpage Memory Hierarchy
- In Proc. 8th Int. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VIII
, 1998
"... The RAMpage memory hierarchy is an alternative to the traditional division between cache and main memory: main memory is moved up a level and DRAM is used as a paging device. The idea behind RAMpage is to reduce hardware complexity, if at the cost of software complexity, with a view to allowing more ..."
Abstract
-
Cited by 19 (9 self)
- Add to MetaCart
The RAMpage memory hierarchy is an alternative to the traditional division between cache and main memory: main memory is moved up a level and DRAM is used as a paging device. The idea behind RAMpage is to reduce hardware complexity, if at the cost of software complexity, with a view to allowing more flexible memory system design. This paper investigates some issues in choosing between RAMpage and a conventional cache architecture, with a view to illustrating trade-offs which can be made in choosing whether to place complexity in the memory system in hardware or in software. Performance results in this paper are based on a simple Rambus implementation of DRAM, with performance characteristics of Direct Rambus, which should be available in 1999. This paper explores the conditions under which it becomes feasible to perform a context switch on a miss in the RAMpage model, and the conditions under which RAMpage is a win over a conventional cache architecture: as the CPU-DRAM speed gap grows...
Virtual Shared Memory: A Survey of Techniques and Systems
, 1992
"... Shared memory abstraction on distributed memory hardware has become very popular recently. The abstraction can be provided at various levels in the architecture e.g. hardware, software, employing special mechanisms to maintain coherence of data. In this paper we present a survey of basic techniques ..."
Abstract
-
Cited by 13 (1 self)
- Add to MetaCart
Shared memory abstraction on distributed memory hardware has become very popular recently. The abstraction can be provided at various levels in the architecture e.g. hardware, software, employing special mechanisms to maintain coherence of data. In this paper we present a survey of basic techniques and review a large number of architectures that provide such an abstraction. We also propose new terminology which is more consistent and orderly as compared with the existing use of terminology for such architectures. 1 Introduction Virtual Shared Memory (VSM) in its most general sense refers to a provision of a shared address space on distributed memory hardware. Such architectures contain no physically shared memory. Instead the distributed local memories collectively provide a virtual address space shared by all the processors. VSM combines the benefits of the ease of programming found in shared-memory multiprocessors with the scalability of message-passing multiprocessors. The implemen...
Computing Per-Process Summary Side-Effect Information
- Fifth Workshop on Languages and Compilers for Parallelism
, 1992
"... This paper presents a static algorithm, based on standard iterative data-flow techniques, for computing per-process memory references to shared data in coarse-grained parallel programs. The algorithm constructs control flow graphs for families of processes by recognizing predicates used in control s ..."
Abstract
-
Cited by 11 (5 self)
- Add to MetaCart
This paper presents a static algorithm, based on standard iterative data-flow techniques, for computing per-process memory references to shared data in coarse-grained parallel programs. The algorithm constructs control flow graphs for families of processes by recognizing predicates used in control statements whose values are invariant relative to any one process, but vary across processes. It is used in conjunction with interprocedural flow-insensitive side-effect analysis techniques to identify data objects accessed by each process. Dynamically allocated objects and descriptors for sections of arrays are incorporated into the general framework. The motivation for the work was to reduce coherency overhead in shared memory multiprocessors. However, the algorithm can also be applied to local/global memory allocation decisions in distributed systems. 1 Introduction It is well known that parallel programs executing on shared memory multiprocessors fail to achieve linear speedup. In bus-ba...
Mechanisms for Distributed Shared Memory
, 1996
"... Distributed shared memory (DSM) systems simplify the task of writing distributedmemory parallel programs by automating data distribution and communication. Unfortunately, DSM systems control memory and communication using fixed policies, even when programmers or compilers could manage these resource ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
Distributed shared memory (DSM) systems simplify the task of writing distributedmemory parallel programs by automating data distribution and communication. Unfortunately, DSM systems control memory and communication using fixed policies, even when programmers or compilers could manage these resources more efficiently. This thesis proposes a new approach that lets users efficiently manage communication and memory on DSM systems. Systems provide primitive DSM mechanisms without binding them to fixed protocols (policies). Standard shared-memory programs use default protocols similar to those found in current DSM machines. Unlike current systems, these protocols are implemented in unprivileged software. Programmers and compilers are free to modify or replace them with optimized custom protocols that manage memory and communication directly and efficiently. To explore this new approach, this thesis: . identifies a set of mechanisms for distributed shared memory, . develops Tempest, a portab...
Improving the Data Cache Performance of Multiprocessor Operating Systems
- 2nd International Symposium on High-Performance Computer Architecture
, 1996
"... Bus-based shared-memory multiprocessors with coherent caches have recently become very popular. To achieve high performance, these systems rely on increasingly sophisticated cache hierarchies. However, while these machines often run loads with substantial operating system activity, performance measu ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
Bus-based shared-memory multiprocessors with coherent caches have recently become very popular. To achieve high performance, these systems rely on increasingly sophisticated cache hierarchies. However, while these machines often run loads with substantial operating system activity, performance measurements have consistently indicated that the operating system uses the data cache hierarchy poorly. In this paper, we address the issue of how to eliminate most of the data cache misses in a multiprocessor operating system while still using off-the-shelf processors. We use a performance monitor to examine traces of a 4processor machine running four system-intensive loads under UNIX. Based on our observations, we propose hardware and software support that targets block operations, coherence activity, and cache conflicts. For block operations, simple cache bypassing or prefetching schemes are undesirable. Instead, it is best to use a DMA-like scheme that pipelines the data transfer in the bus ...
A Dynamic Cache Sub-block Design to Reduce False Sharing
, 1995
"... Parallel applications suffer from significant bus traffic due to the transfer of shared data. Large block sizes exploit locality and decrease the effective memory access time. It also has a tendency to group data together even though only a part of it is needed by any one processor. This is known as ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
Parallel applications suffer from significant bus traffic due to the transfer of shared data. Large block sizes exploit locality and decrease the effective memory access time. It also has a tendency to group data together even though only a part of it is needed by any one processor. This is known as the false sharing problem. This research presents a dynamic sub-block coherence protocol which minimizes false sharing by trying to dynamically locate the point of false reference. Sharing traffic is minimized by maintaining coherence on smaller blocks (sub-blocks) which are truly shared, whereas larger blocks are used as the basic units of transfer. Larger blocks exploit locality while coherence is maintained on sub-blocks which minimize bus traffic due to shared misses. The simulation results indicate that the dynamic sub-block protocol reduces the false sharing misses by 20 to 40 percent over the fixed sub-block scheme.

