Results 1 - 10
of
41
Practical, transparent operating system support for superpages
- SIGOPS Oper. Syst. Rev
, 2002
"... Most general-purpose processors provide support for memory pages of large sizes, called superpages. Superpages enable each entry in the translation lookaside buffer (TLB) to map a large physical memory region into a virtual address space. This dramatically increases TLB coverage, reduces TLB misses, ..."
Abstract
-
Cited by 60 (5 self)
- Add to MetaCart
Most general-purpose processors provide support for memory pages of large sizes, called superpages. Superpages enable each entry in the translation lookaside buffer (TLB) to map a large physical memory region into a virtual address space. This dramatically increases TLB coverage, reduces TLB misses, and promises performance improvements for many applications. However, supporting superpages poses several challenges to the operating system, in terms of superpage allocation and promotion tradeoffs, fragmentation control, etc. We analyze these issues, and propose the design of an effective superpage management system. We implement it in FreeBSD on the Alpha CPU, and evaluate it on real workloads and benchmarks. We obtain substantial performance benefits, often exceeding 30%; these benefits are sustained even under stressful workload scenarios. 1
Scalable Vector Media-processors for Embedded Systems
, 2002
"... Over the past twenty years, processor designers have concentrated on superscalar and VLIW architectures that exploit the instruction-level parallelism (ILP) available in engineering applications for workstation systems. Recently, however, the focus in computing has shifted from engineering to multim ..."
Abstract
-
Cited by 40 (3 self)
- Add to MetaCart
Over the past twenty years, processor designers have concentrated on superscalar and VLIW architectures that exploit the instruction-level parallelism (ILP) available in engineering applications for workstation systems. Recently, however, the focus in computing has shifted from engineering to multimedia applications and from workstations to embedded systems. In this new computing environment, the performance, energy consumption, and development cost of ILP processors renders them ineffective despite their theoretical generality. This thesis
A look at several memory management units, TLB-refill mechanisms, and page table organizations
- in Proceedings of the Eighth International Conference on Architectural Support for Programming Languages and Operating Systems. ACM
, 1998
"... Virtual memory is a staple in modem systems, though there is little agreement on how its functionality is to be implemented on either the hardware or software side of the interface. The myriad of design choices and incompatible hardware mechanisms suggests potential performance problems, especially ..."
Abstract
-
Cited by 38 (3 self)
- Add to MetaCart
(Show Context)
Virtual memory is a staple in modem systems, though there is little agreement on how its functionality is to be implemented on either the hardware or software side of the interface. The myriad of design choices and incompatible hardware mechanisms suggests potential performance problems, especially since increasing numbers of sys-tems (even embedded systems) are using memory management. A comparative study of the implementation choices in virtual memory should therefore aid system-level designers. This paper compares several virtual memory designs, including combinations of hierarchical and inverted page tables on hardware-managed and software-managed translation lookaside buffers (TLBs). The simulations show that systems are fairly sensitive to TLB size; that interrupts already account for a large portion of memory-manage-ment overhead and can become a significant factor as processors exe-cute more concurrent instructions; and that if one includes the cache misses inflicted on applications by the VM system, the total VM over-head is roughly twice what was thought (10-200/o rather than 5-10%). 1
Virtual Memory: Issues of Implementation
, 1998
"... ign to look like another. Inserting this hardware abstraction layer 3,4 hides hardware particulars from the higher levels of software but can also compromise performance and compatibility; the higher levels of software often make unwitting assumptions about Computing Practices Bruce Jacob Univer ..."
Abstract
-
Cited by 29 (9 self)
- Add to MetaCart
ign to look like another. Inserting this hardware abstraction layer 3,4 hides hardware particulars from the higher levels of software but can also compromise performance and compatibility; the higher levels of software often make unwitting assumptions about Computing Practices Bruce Jacob University of Maryland Trevor Mudge University of Michigan those hardware particulars, creating inconsistencies between expected and actual behavior. 5 Here we present the software mechanisms of virtual memory from a hardware perspective and then describe several hardware examples and how they support virtual-memory software (see the architecture sidebars beginning on page 39 for hardware examples) . Our focus is on the mechanisms and structures popular in today's OSs and microprocessors, which are geared toward demand-paged virtual memory. However, this focus in no way impedes our goal: to show
Deconstructing Process Isolation
- In Proceedings of the 2006 Workshop on Memory System Performance and Correctness
, 2006
"... Most operating systems enforce process isolation through hardware protection mechanisms such as memory segmentation, page mapping, and differentiated user and kernel instructions. Singularity is a new operating system that uses software mechanisms to enforce process isolation. A software isolated pr ..."
Abstract
-
Cited by 28 (4 self)
- Add to MetaCart
(Show Context)
Most operating systems enforce process isolation through hardware protection mechanisms such as memory segmentation, page mapping, and differentiated user and kernel instructions. Singularity is a new operating system that uses software mechanisms to enforce process isolation. A software isolated process (SIP) is a process whose boundaries are established by language safety rules and enforced by static type checking. SIPs provide a low cost isolation mechanism that provides failure isolation and fast inter-process communication. To compare the performance of Singularity’s SIPs against traditional isolation techniques, we implemented an optional hardware isolation mechanism. Protection domains are hardware-enforced address spaces, which can contain one or more SIPs. Domains can either run at the kernel’s privilege level or be fully isolated from the kernel and run at the normal application privilege level. With protection domains, we can construct Singularity configurations that are similar to micro-kernel and monolithic kernel systems. We found that hardware-based isolation incurs non-trivial performance costs (up to 25-33%) and complicates system implementation. Software isolation has less than 5 % overhead on these benchmarks. The lower run-time cost of SIPs makes their use feasible at a finer granularity than conventional processes. However, hardware isolation remains valuable as a defense-in-depth against potential failures in software isolation mechanisms. Singularity’s ability to employ hardware isolation selectively enables careful balancing of the costs and benefits of each isolation technique.
Characterizing the d-TLB Behavior of SPEC CPU2000 Benchmarks
- In Proceedings of the 2002 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems
, 2002
"... Despite the numerous optimization and evaluation studies that have been conducted with TLBs over the years, there is still a deficiency in an indepth understanding of TLB characteristics from an application angle. This paper presents a detailed characterization study of the TLB behavior of the SPEC ..."
Abstract
-
Cited by 17 (0 self)
- Add to MetaCart
(Show Context)
Despite the numerous optimization and evaluation studies that have been conducted with TLBs over the years, there is still a deficiency in an indepth understanding of TLB characteristics from an application angle. This paper presents a detailed characterization study of the TLB behavior of the SPEC CPU2000 benchmark suite. The contributions of this work are in identifying important application characteristics for TLB studies, quantifying the SPEC2000 application behavior for these characteristics, as well as making pronouncements and suggestions for future research based on these results.
Efficient Virtual Memory for Big Memory Servers
"... Our analysis shows that many “big-memory ” server workloads, such as databases, in-memory caches, and graph analytics, pay a high cost for page-based virtual memory. They consume as much as 10 % of execution cycles on TLB misses, even using large pages. On the other hand, we find that these workload ..."
Abstract
-
Cited by 15 (1 self)
- Add to MetaCart
(Show Context)
Our analysis shows that many “big-memory ” server workloads, such as databases, in-memory caches, and graph analytics, pay a high cost for page-based virtual memory. They consume as much as 10 % of execution cycles on TLB misses, even using large pages. On the other hand, we find that these workloads use read-write permission on most pages, are provisioned not to swap, and rarely benefit from the full flexibility of page-based virtual memory. To remove the TLB miss overhead for big-memory workloads, we propose mapping part of a process’s linear virtual address space with a direct segment, while page mapping the rest of the virtual address space. Direct segments use minimal hardware—base, limit and offset registers per core—to map contiguous virtual memory regions directly to contiguous physical memory. They eliminate the possibility of TLB misses for key data structures such as database buffer pools and in-memory key-value stores. Memory mapped by a direct segment may be converted back to paging when needed. We prototype direct-segment software support for x86-64 in Linux and emulate direct-segment hardware. For our workloads, direct segments eliminate almost all TLB misses and reduce the execution time wasted on TLB misses to less than 0.5%.
Uniprocessor virtual memory without tlbs
- IEEE Transactions on Computers
, 2001
"... AbstractÐWe present a feasibility study for performing virtual address translation without specialized translation hardware. Removing address translation hardware and instead managing address translation in software has the potential to make the processor design simpler, smaller, and more energy-eff ..."
Abstract
-
Cited by 15 (0 self)
- Add to MetaCart
(Show Context)
AbstractÐWe present a feasibility study for performing virtual address translation without specialized translation hardware. Removing address translation hardware and instead managing address translation in software has the potential to make the processor design simpler, smaller, and more energy-efficient at little or no cost in performance. The purpose of this study is to describe the design and quantify its performance impact. Trace-driven simulations show that software-managed address translation is just as efficient as hardware-managed address translation. Moreover, mechanisms to support such features as shared memory, superpages, fine-grained protection, and sparse address spaces can be defined completely in software, allowing for more flexibility than in hardware-defined mechanisms. Index TermsÐVirtual memory, virtual address translation, virtual caches, memory management, software-managed address translation, translation lookaside buffers. æ 1
In-Line Interrupt Handling for Software-Managed TLBs
, 2001
"... The general-purpose precise interrupt mechanism, which has long been used to handle exceptional conditions that occur infre-quently, is now being used increasingly often to handle conditions that are neither exceptional nor infrequent. One example is the use of interrupts to perform memory managemen ..."
Abstract
-
Cited by 11 (2 self)
- Add to MetaCart
The general-purpose precise interrupt mechanism, which has long been used to handle exceptional conditions that occur infre-quently, is now being used increasingly often to handle conditions that are neither exceptional nor infrequent. One example is the use of interrupts to perform memory management-e.g., to handle translation lookaside buffer (TLB) misses in today’s microproces-sors. Because the frequency of TLB misses tends to increase with memory footprint, there is pressure on the precise interrupt mech-anism to become more lightweight. When modem out-of order processors handle interrupts precisely, they typically begin by flushing the pipeline. Doing so makes the CPU available to exe-cute handler instructions, but it wastes potentially hundreds of cycles of execution time. However; if the handler code is small, it could potentially j t in the reorder buffer along with the user-level code already there. This essentially in-lines the interrupt-handler code. One good example of where this would be both possible and useful is in the TLB-miss handler in a software-managed TLB implementation. The benejts of doing so are two-fold: (1) the instructions that would otherwise be flushed from the pipe need not be re-fetched and re-executed; and (2) any instructions that are independent of the exceptional instruction can continue to exe-cute in parallel with the handler code. In effect, doing so provides us with lockup-free TLBs. We simulate a lockup-free data-TLB facility on a processor model with a 4-way out-of-order core remi-niscent of the Alpha 21264. We find that, by using lockup-jree TLBs, one can get the performance of a fully associative TLB with a lockup-free TLB of one-fourth the size.
Cache Design for Embedded Real-Time Systems
- Proceedings of the Embedded Systems Conference, Summer
, 1999
"... Caches have long been a mechanism for speeding memory access and are popular in embedded hardware architectures from microcontrollers to core-based ASIC designs. However, caches are considered ill-suited for embedded real-time systems because they provide a probabilistic performance boost— a cache m ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
(Show Context)
Caches have long been a mechanism for speeding memory access and are popular in embedded hardware architectures from microcontrollers to core-based ASIC designs. However, caches are considered ill-suited for embedded real-time systems because they provide a probabilistic performance boost— a cache may or may not contain the desired data at any given moment. Analysis that guarantees when an item will or will not be in the cache has proven difficult, so many real-time systems simply disable caching and schedule tasks based on worst-case memory access time. Yet there are several cache organizations that provide the benefit of caching without the real-time drawbacks of hardware-managed caches. These are software-managed caches, and several different examples can be found, from DSP-style on-chip RAM to academic designs. This paper compares the operation and organization of caches as found in general-purpose processors, microcontrollers, and DSPs; it also discusses designs for embedded realtime systems. 1