Results 1 - 10
of
62
Xen and the art of virtualization
- In SOSP (2003
"... Numerous systems have been designed which use virtualization to subdivide the ample resources of a modern computer. Some require specialized hardware, or cannot support commodity operating systems. Some target 100 % binary compatibility at the expense of performance. Others sacrifice security or fun ..."
Abstract
-
Cited by 990 (27 self)
- Add to MetaCart
Numerous systems have been designed which use virtualization to subdivide the ample resources of a modern computer. Some require specialized hardware, or cannot support commodity operating systems. Some target 100 % binary compatibility at the expense of performance. Others sacrifice security or functionality for speed. Few offer resource isolation or performance guarantees; most provide only best-effort provisioning, risking denial of service. This paper presents Xen, an x86 virtual machine monitor which allows multiple commodity operating systems to share conventional hardware in a safe and resource managed fashion, but without sacrificing either performance or functionality. This is achieved by providing an idealized virtual machine abstraction to which operating systems such as Linux, BSD and Windows XP, can be ported with minimal effort. Our design is targeted at hosting up to 100 virtual machine instances simultaneously on a modern server. The virtualization approach taken by Xen is extremely efficient: we allow operating systems such as Linux and Windows XP to be hosted simultaneously for a negligible performance overhead — at most a few percent compared with the unvirtualized case. We considerably outperform competing commercial and freely available solutions in a range of microbenchmarks and system-wide tests.
Exokernel: An Operating System Architecture for Application-Level Resource Management
, 1995
"... We describe an operating system architecture that securely multiplexes machine resources while permitting an unprecedented degree of application-specific customization of traditional operating system abstractions. By abstracting physical hardware resources, traditional operating systems have signifi ..."
Abstract
-
Cited by 561 (20 self)
- Add to MetaCart
We describe an operating system architecture that securely multiplexes machine resources while permitting an unprecedented degree of application-specific customization of traditional operating system abstractions. By abstracting physical hardware resources, traditional operating systems have significantly limited the performance, flexibility, and functionality of applications. The exokernel architecture removes these limitations by allowing untrusted software to implement traditional operating system abstractions entirely at application-level. We have implemented a prototype exokernel-based system that includes Aegis, an exokernel, and ExOS, an untrusted application-level operating system. Aegis defines the low-level interface to machine resources. Applications can allocate and use machine resources, efficiently handle events, and participate in resource revocation. Measurements show that most primitive Aegis operations are 10–100 times faster than Ultrix,a mature monolithic UNIX operating system. ExOS implements processes, virtual memory, and inter-process communication abstractions entirely within a library. Measurements show that ExOS’s application-level virtual memory and IPC primitives are 5–50 times faster than Ultrix’s primitives. These results demonstrate that the exokernel operating system design is practical and offers an excellent combination of performance and flexibility. 1
Power Aware Page Allocation
- In Architectural Support for Programming Languages and Operating Systems
, 2000
"... One of the major challenges of post-PC computing is the need to reduce energy consumption, thereby extending the lifetime of the batteries that p ower these mobile devices. Memory is a particularly important tar get for e orts to improve energy e ciency. Memory technolo gy is becoming available that ..."
Abstract
-
Cited by 121 (9 self)
- Add to MetaCart
One of the major challenges of post-PC computing is the need to reduce energy consumption, thereby extending the lifetime of the batteries that p ower these mobile devices. Memory is a particularly important tar get for e orts to improve energy e ciency. Memory technolo gy is becoming available that o ers power management featur es such as the ability to put individual chips in any one of several di erent power modes. In this paper we explor e the interaction of page plac ement with static and dynamic hardware policies to exploit these emer ginghardwar efeatur es. In p articular, we c onsider p age allo cation p olicies that ancbe employed by an informed operating system to complement the hardware power management strategies. We perform experiments using two complementary simulation envir onments: a tracedriven simulator with workload traces that are representative of mobile computing and an execution-driven simulator with a detaile d processor/memory model and a more memoryintensive set of benchmarks (SPEC2000). Our r esults make a compelling case for a cooperative hardwar e/software approach for exploiting power-aware memory, with down to as little as 45 % of the Energy Delay for the best static policy and 1 % to 20 % of the Ener gyDelay for a traditional fullpower memory. 1.
An analysis of database workload performance on simultaneous multithreaded processors
- In Proceedings of the 25th Annual International Symposium on Computer Architecture
, 1998
"... Simultaneous multithreading (SMT) is an architectural technique in which the processor issues multiple instructions from multiple threads each cycle. While SMT has been shown to be effective on scientific workloads, its performance on database systems is still an open question. In particular, databa ..."
Abstract
-
Cited by 119 (13 self)
- Add to MetaCart
Simultaneous multithreading (SMT) is an architectural technique in which the processor issues multiple instructions from multiple threads each cycle. While SMT has been shown to be effective on scientific workloads, its performance on database systems is still an open question. In particular, database systems have poor cache performance, and the addition of multithreading has the potential to exacerbate cache conflicts. This paper examines database performance on SMT processors using traces of the Oracle database management system. Our research makes three contributions. First, it characterizes the memory-system behavior of database systems running on-line transaction processing and decision support system workloads. Our data show that while DBMS workloads have large memory footprints, there is substantial data reuse in a small, cacheable “critical ” working set. Second, we show that the additional data cache conflicts caused by simultaneousmultithreaded instruction scheduling can be nearly eliminated by the proper choice of software-directed policies for virtual-to-physical page mapping and per-process address offsetting. Our results demonstrate that with the best policy choices, D-cache miss rates on an 8-context SMT are roughly equivalent to those on a single-threaded superscalar. Multithreading also leads to better interthread instruction cache sharing, reducing I-cache miss rates by up to 35%. Third, we show that SMT’s latency tolerance is highly effective for database applications. For example, using a memory-intensive OLTP workload, an 8context SMT processor achieves a 3-fold increase in instruction throughput over a single-threaded superscalar with similar resources. 1
A Fully Associative SoftwareManaged Cache Design
- In Proceedings of the 27th Annual International Symposium on Computer Architecture
, 2000
"... As DRAM access latencies approach a thousand instruction-execution times and onchip caches grow to multiple megabytes, it is not clear that conventional cache structures continue to be appropriate. Two key features⎯full associativity and software management⎯have been used successfully in the virtual ..."
Abstract
-
Cited by 73 (2 self)
- Add to MetaCart
As DRAM access latencies approach a thousand instruction-execution times and onchip caches grow to multiple megabytes, it is not clear that conventional cache structures continue to be appropriate. Two key features⎯full associativity and software management⎯have been used successfully in the virtual-memory domain to cope with disk access latencies. Future systems will need to employ similar techniques to deal with DRAM latencies. This paper presents a practical, fully associative, software-managed secondary cache system that provides performance competitive with or superior to traditional caches without OS or application involvement. We see this structure as the first step toward OS- and application-aware management of large on-chip caches. This paper has two primary contributions: a practical design for a fully associative memory structure, the indirect index cache (IIC), and a novel replacement algorithm, generational replacement, that is specifically designed to work with the IIC. We analyze the behavior of an IIC with generational replacement as a drop-in, transparent substitute for a conventional secondary cache. We achieve miss rate reductions from 8 % to 85 % relative to a 4-way associative LRU organization, matching or beating a (practically infeasible) fully associative true LRU cache. Incorporating these miss rates into a rudimentary timing model indicates that the IIC/generational replacement cache could be competitive with a conventional cache at today’s DRAM latencies, and will outperform a conventional cache as these CPU-relative latencies grow. 1.
Managing Distributed, Shared L2 Caches through OS-Level Page Allocation
- IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE
, 2006
"... This paper presents and studies a distributed L2 cache management approach through OS-level page allocation for future many-core processors. L2 cache management is a crucial multicore processor design aspect to overcome non-uniform cache access latency for good program performance and to reduce on-c ..."
Abstract
-
Cited by 61 (9 self)
- Add to MetaCart
This paper presents and studies a distributed L2 cache management approach through OS-level page allocation for future many-core processors. L2 cache management is a crucial multicore processor design aspect to overcome non-uniform cache access latency for good program performance and to reduce on-chip network traffic and related power consumption. Unlike previously studied hardwarebased private and shared cache designs implementing a "fixed" caching policy, the proposed OS-microarchitecture approach is flexible; it can easily implement a wide spectrum of L2 caching policies without complex hardware support. Furthermore, our approach can provide differentiated execution environment to running programs by dynamically controlling data placement and cache sharing degrees. We discuss key design issues of the proposed approach and present preliminary experimental results showing the promise of our approach.
Characterizing the Memory Behavior of Java Workloads: A Structured View and Opportunities for Optimizations
, 2000
"... This paper studies the memory behavior of important Java workloads used in benchmarking Java Virtual Machines (JVMs), based on instrumentation of both application and library code in a state-of-theart JVM, and provides structured information about these workloads to help guide systems' design. We be ..."
Abstract
-
Cited by 54 (3 self)
- Add to MetaCart
This paper studies the memory behavior of important Java workloads used in benchmarking Java Virtual Machines (JVMs), based on instrumentation of both application and library code in a state-of-theart JVM, and provides structured information about these workloads to help guide systems' design. We begin by characterizing the inherent memory behavior of the benchmarks, such as information on the breakup of heap accesses among different categories and on the hotness of references to fields and methods. We then provide detailed information about misses in the data TLB and caches, including the distribution of misses over different kinds of accesses and over different methods. In the process, we make interesting discoveries about TLB behavior and limitations of data prefetching schemes discussed in the literature in dealing with pointer-intensive Java codes. Throughout this paper, we develop a set of recommendations to computer architects and compiler writers on how to optimize computer systems and system software to run Java programs more efficiently. This paper also makes the first attempt to compare the characteristics of SPECjvm98 to those of a server-oriented benchmark, pBOB, and explain why the current set of SPECjvm98 benchmarks may not be adequate for a comprehensive and objective evaluation of JVMs and just-in-time (JIT) compilers. We discover that the fraction of accesses to array elements is quite significant, demonstrate that the number of "hot spots" in the benchmarks is small, and show that field reordering cannot yield significant performance gains. We also show that even a fairly large L2 data cache is not effective for many Java benchmarks. We observe that instructions used to prefetch data into the L2 data cache are often squashed because of high TLB miss ...
Thread Scheduling for Cache Locality
- IN PROCEEDINGS OF THE SEVENTH ACM CONFERENCE ON ARCHITECTURAL SUPPORT FOR PROGRAMMING LANGUAGES AND OPERATING SYSTEMS
, 1996
"... This paper describes a method to improve the cache locality of sequential programs by scheduling fine-grained threads. The algorithm relies upon hints provided at the time of thread creation to determine a thread execution order likely to reduce cache misses. This technique may be particularly valua ..."
Abstract
-
Cited by 49 (9 self)
- Add to MetaCart
This paper describes a method to improve the cache locality of sequential programs by scheduling fine-grained threads. The algorithm relies upon hints provided at the time of thread creation to determine a thread execution order likely to reduce cache misses. This technique may be particularly valuable when compiler-directed tiling is not feasible. Experiments with several application programs, on two systems with different cache structures, show that our thread scheduling method can improve program performance by reducing second-level cache misses.
Design and Implementation of Code Optimizations for a Type-Directed Compiler for Standard ML
, 1996
"... Abstract The trends in software development are towards larger programs, more complex programs, and more use of programs as "component software. " These trends mean that the features of modern programming languages are becoming more important than ever before. Programming languages need to ..."
Abstract
-
Cited by 47 (2 self)
- Add to MetaCart
Abstract The trends in software development are towards larger programs, more complex programs, and more use of programs as "component software. " These trends mean that the features of modern programming languages are becoming more important than ever before. Programming languages need to have features such as strong typing, a module system, polymorphism, automatic storage management, and higher-order functions. In short, modern programming languages are becoming more important than ever before.
The Measured Performance of Personal Computer Operating Systems
- ACM Transactions on Computer Systems
, 1996
"... This paper presents a comparative study of the performance of three operating systems that run on the personal computer archi-tecture derived from the IBM-PC, The operating systems, Windows for Workgroups, Windows NT, and NetBSD (a freely available variant of the UNIX operating system), cover a broa ..."
Abstract
-
Cited by 41 (4 self)
- Add to MetaCart
This paper presents a comparative study of the performance of three operating systems that run on the personal computer archi-tecture derived from the IBM-PC, The operating systems, Windows for Workgroups, Windows NT, and NetBSD (a freely available variant of the UNIX operating system), cover a broad range ofs ystem functionalist y and user requirements, from a single address space model to full protection with preemptive multi-tasking. Our measurements were enabled by hardware counters in Intel’s Pentium processor that permit measurement of a broad range of processor events including instruction counts and on-chip cache miss counts. We used both microbenchmarks, which expose specific differences between the systems, and application workloads, which provide an indication of expected end-to-end performance. Our microbenchmark results show that accessing system functionality is often more expensive in Windows for Workgroups than in the other two systems due to frequent changes in machine mode and the use of system call hooks. When running native applications, Windows NT is more efficient than Windows, but it incurs overhead similar to that of a microkemel since its application interface (the Wln32 API) is implemented as a user-level server. Overall, system functionality can be accessed most efficiently in NetBSD; we attribute this to its monolithic structure, and to the absence of the complications created by hardware backwards compatibility requirements in the other systems. Measurements of application performance show that although the impact of these differences is significant in terms of instruction counts and other hardware events (often a factor of 2 to 7 difference between the systems), overall performance is sometimes determined by the functionality provided by specific subsystems, such as the graphics subsystem or the file system buffer cache. 1.

