Results 1 - 10
of
25
Cache Profiling and the SPEC Benchmarks: A Case Study
- IEEE Computer
, 1994
"... As VLSI technology improvements continue to widen the gap between processor and main memory cycle times, cache performance becomes increasingly important to overall system performance. Cache memories help alleviate the cycle time disparity, but only for programs that exhibit sufficient spatial an ..."
Abstract
-
Cited by 137 (7 self)
- Add to MetaCart
As VLSI technology improvements continue to widen the gap between processor and main memory cycle times, cache performance becomes increasingly important to overall system performance. Cache memories help alleviate the cycle time disparity, but only for programs that exhibit sufficient spatial and temporal locality. Programs with unruly access patterns spend much of their time transferring data to and from the cache. To fully exploit the performance potential of fast processors, programmers must explicitly consider cache behavior, restructuring their codes to increase locality. As these fast processors proliferate, techniques for improving cache performance must move beyond the supercomputer and multiprocessor communities and into the mainstream of computing. In this paper, we examine some of the techniques that programmers can use to improve cache performance. We show how to use CPROF, a cache profiler, to identify cache performance bottlenecks and gain insight into their o...
Avoiding Conflict Misses Dynamically in Large Direct-Mapped Caches
- In Proceedings of the 6th International Conference on Architectural Support for Programming Languages and Operating Systems
, 1994
"... This paper describes a method for improving the performance of a large direct-mapped cache by reducing the number of conflict misses. Our solution consists of two components: an inexpensive hardware device called a Cache Miss Lookaside (CML) buffer that detects conflicts by recording and summarizing ..."
Abstract
-
Cited by 96 (4 self)
- Add to MetaCart
This paper describes a method for improving the performance of a large direct-mapped cache by reducing the number of conflict misses. Our solution consists of two components: an inexpensive hardware device called a Cache Miss Lookaside (CML) buffer that detects conflicts by recording and summarizing a history of cache misses, and a software policy within the operating system's virtual memory system that removes conflicts by dynamically remapping pages whenever large numbers of conflict misses are detected. Using trace-driven simulation of applications and the operating system, we show that a CML buffer enables a large direct-mapped cache to perform nearly as well as a two-way set associative cache of equivalent size and speed, although with lower hardware cost and complexity. 1 Introduction In this paper we describe a dynamic method to eliminate conflict misses in large direct-mapped physically indexed caches. Conflicts are caused by interleaved references to words in memory that are...
Toward Real Microkernels
, 1996
"... (IPC), and basic scheduling. All servers---even device drivers ---run in user mode and are treated exactly like any other application by the kernel. Since each server has its own address space, all these objects are protected from one another. When the microkernel idea was introduced in the COMMUNI ..."
Abstract
-
Cited by 69 (0 self)
- Add to MetaCart
(IPC), and basic scheduling. All servers---even device drivers ---run in user mode and are treated exactly like any other application by the kernel. Since each server has its own address space, all these objects are protected from one another. When the microkernel idea was introduced in the COMMUNICATIONS OF THE ACM September 1996/Vol. 39, No. 9 71 late 1980s, the software technology advantages seemed obvious: . Different application program interfaces (APIs), different file systems, and perhaps even different basic operating system strategies can coexist in one system. They are implemented as competing or cooperating servers. . The system becomes more flexible and extensible. It can be more easily and effectively adapted to new hardware or new applications. Only selected servers need to be modified or added to the system. In particular, the impact of such modifications can be restricted to a subset of the system, so all other processes ar
Instruction Fetching: Coping with Code Bloat
- In Proceedings of the 22nd Annual International Symposium on Computer Architecture
, 1995
"... Previous research has shown that the SPEC benchmarks achieve low miss ratios in relatively small instruction caches. This paper presents evidence that current software-development practices produce applications that exhibit substantially higher instruction-cache miss ratios than do the SPEC benchmar ..."
Abstract
-
Cited by 62 (9 self)
- Add to MetaCart
Previous research has shown that the SPEC benchmarks achieve low miss ratios in relatively small instruction caches. This paper presents evidence that current software-development practices produce applications that exhibit substantially higher instruction-cache miss ratios than do the SPEC benchmarks. To represent these trends, we have assembled a collection of applications, called the Instruction Benchmark Suite (IBS), that provides a better test of instruction-cache performance. We discuss the rationale behind the design of IBS and characterize its behavior relative to the SPEC benchmark suite. Our analysis is based on trace-driven and trap-driven simulations and takes into full account both the application and operating-system components of the workloads. This paper then reexamines a collection of previously-proposed hardware mechanisms for improving instruction-fetch performance
System Support for Automatic Profiling and Optimization
"... The Morph system provides a framework for automatic collection and management of profile information and application of profile-driven optimizations. In this paper, we focus on the operating system support that is required to collect and manage profile information on an end-user’s workstation in an ..."
Abstract
-
Cited by 59 (6 self)
- Add to MetaCart
The Morph system provides a framework for automatic collection and management of profile information and application of profile-driven optimizations. In this paper, we focus on the operating system support that is required to collect and manage profile information on an end-user’s workstation in an automatic, continuous, and transparent manner. Our implementation for a Digital Alpha machine running Digital UNIX 4.0 achieves run-time overheads of less than 0.3 % during profile collection. Through the application of three code layout optimizations, we further show that Morph can use statistical profiles to improve application performance. With appropriate system support, automatic profiling and optimization is both possible and effective.
Compiler-directed page coloring for multiprocessors
- In Proceedings of the Seventh International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VII
, 1996
"... This paper presents a new technique, compiler-directed page coloring, that eliminates conflict misses in multiprocessor applications. It enables applications to make better use of the increased aggregate cache size available in a multiprocessor. This technique uses the compiler’s knowledge of the ac ..."
Abstract
-
Cited by 55 (9 self)
- Add to MetaCart
This paper presents a new technique, compiler-directed page coloring, that eliminates conflict misses in multiprocessor applications. It enables applications to make better use of the increased aggregate cache size available in a multiprocessor. This technique uses the compiler’s knowledge of the access patterns of the parallelized applications to direct the operating system’s virtual memory page mapping strategy. We demonstrate that this technique can lead to significant performance improvements over two commonly used page mapping strategies for machines with either direct-mapped or two-way set-associative caches. We also show that it is complementary to latency-hiding techniques such as prefetching. We implemented compiler-directed page coloring in the SUIF parallelizing compiler and on two commercial operating systems. We applied the technique to the SPEC95fp benchmark suite, a representative set of numeric programs. We used the SimOS machine simulator to analyze the applications and isolate their performance bottlenecks. We also validated these results on a real machine, an eight-processor 350MHz Digital AlphaServer. Compiler-directed page coloring leads to significant performance improvements for several applications. Overall, our technique improves the SPEC95fp rating for eight processors by 8 % over Digital UNIX’s page mapping policy and by 20 % over a page coloring, a standard page mapping policy. The SUIF compiler achieves a SPEC95fp ratio of 57.4, the highest ratio to date.
The Measured Performance of Personal Computer Operating Systems
- ACM Transactions on Computer Systems
, 1996
"... This paper presents a comparative study of the performance of three operating systems that run on the personal computer archi-tecture derived from the IBM-PC, The operating systems, Windows for Workgroups, Windows NT, and NetBSD (a freely available variant of the UNIX operating system), cover a broa ..."
Abstract
-
Cited by 41 (4 self)
- Add to MetaCart
This paper presents a comparative study of the performance of three operating systems that run on the personal computer archi-tecture derived from the IBM-PC, The operating systems, Windows for Workgroups, Windows NT, and NetBSD (a freely available variant of the UNIX operating system), cover a broad range ofs ystem functionalist y and user requirements, from a single address space model to full protection with preemptive multi-tasking. Our measurements were enabled by hardware counters in Intel’s Pentium processor that permit measurement of a broad range of processor events including instruction counts and on-chip cache miss counts. We used both microbenchmarks, which expose specific differences between the systems, and application workloads, which provide an indication of expected end-to-end performance. Our microbenchmark results show that accessing system functionality is often more expensive in Windows for Workgroups than in the other two systems due to frequent changes in machine mode and the use of system call hooks. When running native applications, Windows NT is more efficient than Windows, but it incurs overhead similar to that of a microkemel since its application interface (the Wln32 API) is implemented as a user-level server. Overall, system functionality can be accessed most efficiently in NetBSD; we attribute this to its monolithic structure, and to the absence of the complications created by hardware backwards compatibility requirements in the other systems. Measurements of application performance show that although the impact of these differences is significant in terms of instruction counts and other hardware events (often a factor of 2 to 7 difference between the systems), overall performance is sometimes determined by the functionality provided by specific subsystems, such as the graphics subsystem or the file system buffer cache. 1.
Memory Subsystem Performance of Programs with Intensive Heap Allocation
- IN 21ST ANNUAL ACM SYMPOSIUM ON PRINCIPLES OF PROGRAMMING LANGUAGES
, 1994
"... Heap allocation with copying garbage collection is a general storage management technique for modern programming languages. It is believed to have poor memory subsystem performance. To investigate this, we conducted an in-depth study of the memory subsystem performance of heap allocation for memory ..."
Abstract
-
Cited by 28 (8 self)
- Add to MetaCart
Heap allocation with copying garbage collection is a general storage management technique for modern programming languages. It is believed to have poor memory subsystem performance. To investigate this, we conducted an in-depth study of the memory subsystem performance of heap allocation for memory subsystems found on many machines. We studied the performance of mostly-functional Standard ML programs which made heavy use of heap allocation. We found that most machines support heap allocation poorly. However, with the appropriate memory subsystem organization, heap allocation can have good performance. The memory subsystem property crucial for achieving good performance was the ability to allocate and initialize a new object into the cache without a penalty. This can be achieved by having subblock placement with a subblock size of one word with a write allocate policy, along with fast page-mode writes or a write buffer. For caches with subblock placement, the data cache overhead was under 9% for a 64K of larger data cache; without subblock placement the overhead was often higher than 50%.
Trap-driven simulation with Tapeworm II
- 6TH INT. CONF. ARCHITECTURAL SUPPORT FOR PROGRAMMING LANGUAGES AND
, 1994
"... Tapeworm II is a software-based simulation tool that evaluates the cache and TLB performance of multiple-task and operating system intensive workloads. Tapeworm resides in an OS kernel and causes a host machine's hardware to drive simulations with kernel traps instead of with address traces, as is ..."
Abstract
-
Cited by 24 (7 self)
- Add to MetaCart
Tapeworm II is a software-based simulation tool that evaluates the cache and TLB performance of multiple-task and operating system intensive workloads. Tapeworm resides in an OS kernel and causes a host machine's hardware to drive simulations with kernel traps instead of with address traces, as is conventionally done. This allows Tapeworm to quickly and accurately capture complete memory referencing behavior with a limited degradation in overall system performance. This paper compares trap-driven simulation, as implemented in Tapeworm, with the more common technique of trace-driven memory simulation with respect to speed, accuracy, portability and flexibility.

