Results 1 - 10
of
29
The Multi-Queue Replacement Algorithm for Second Level Buffer Caches
- In Proceedings of the 2001 USENIX Annual Technical Conference
, 2001
"... This paper reports our research results that improve second level buer cache performance. Several previous studies have shown that a good single level cache replacement algorithm such as LRU does not work well with second level buer caches. Second level buer caches have dierent access pattern from ..."
Abstract
-
Cited by 86 (12 self)
- Add to MetaCart
This paper reports our research results that improve second level buer cache performance. Several previous studies have shown that a good single level cache replacement algorithm such as LRU does not work well with second level buer caches. Second level buer caches have dierent access pattern from rst level buer caches because Accesses to second level buer caches are actually misses from rst level buer caches.
OS-controlled cache predictability for real-time systems
- In Third IEEE Real-time Technology and Applications Symposium (RTAS
, 1997
"... Cache-partitioning techniques have been invented to make modern processors with an extensive cache structure useful in real-time systems where task switches disrupt cache working sets and hence make execution times unpredictable. This paper describes an OS-controlled application-transparent cache-pa ..."
Abstract
-
Cited by 65 (12 self)
- Add to MetaCart
Cache-partitioning techniques have been invented to make modern processors with an extensive cache structure useful in real-time systems where task switches disrupt cache working sets and hence make execution times unpredictable. This paper describes an OS-controlled application-transparent cache-partitioning technique. The resulting partitions can be transparently assigned to tasks for their exclusive use. The major drawbacks found in other cache-partitioning techniques, namely waste of memory and additions on the critical performance path within CPUs, are avoided using memory coloring techniques that do not require changes within the chips of modern CPUs or on the critical path for performance. A simple filter algorithm commonly used in real-time systems, a matrixmultiplication algorithm and the interaction of both are analysed with regard to cache-induced worst case penalties. Worst-case penalties are determined for different widely-used cache architectures. Some insights regarding the impact of cache architectures on worst-case execution are described.
Architectural Support for Single Address Space Operating Systems
, 1992
"... Recent microprocessor announcements show a trend toward wide-address computers: architectures that support 64 bits of virtual address space. Such architectures facilitate fundamentally new operating system organizations that promote efficient data sharing and cooperation, both between complex applic ..."
Abstract
-
Cited by 63 (5 self)
- Add to MetaCart
Recent microprocessor announcements show a trend toward wide-address computers: architectures that support 64 bits of virtual address space. Such architectures facilitate fundamentally new operating system organizations that promote efficient data sharing and cooperation, both between complex applications and between parts of the operating system itself. One such organization is the single address space operating system, in which all processes run within a single global virtual address space; protection is provided not through conventional address space boundaries, but through protection domains that dictate which pages of the global address space a process can reference. This paper focuses on the architectural implications of single address space operating systems, specifically the interaction between the memory system architecture and the operating system's use of addressing and protection. Our purpose is to explore certain architectural opportunities created by single address space ...
Instruction Fetching: Coping with Code Bloat
- In Proceedings of the 22nd Annual International Symposium on Computer Architecture
, 1995
"... Previous research has shown that the SPEC benchmarks achieve low miss ratios in relatively small instruction caches. This paper presents evidence that current software-development practices produce applications that exhibit substantially higher instruction-cache miss ratios than do the SPEC benchmar ..."
Abstract
-
Cited by 62 (9 self)
- Add to MetaCart
Previous research has shown that the SPEC benchmarks achieve low miss ratios in relatively small instruction caches. This paper presents evidence that current software-development practices produce applications that exhibit substantially higher instruction-cache miss ratios than do the SPEC benchmarks. To represent these trends, we have assembled a collection of applications, called the Instruction Benchmark Suite (IBS), that provides a better test of instruction-cache performance. We discuss the rationale behind the design of IBS and characterize its behavior relative to the SPEC benchmark suite. Our analysis is based on trace-driven and trap-driven simulations and takes into full account both the application and operating-system components of the workloads. This paper then reexamines a collection of previously-proposed hardware mechanisms for improving instruction-fetch performance
Out-of-Order Vector Architectures
, 1997
"... Register renaming and out-of-order instruction issue are now commonly used in superscalar processors. These techniques can also be used to significant advantage in vector processors, as this paper shows. Performance is improved and available memory bandwidth is used more effectively. Using a trace d ..."
Abstract
-
Cited by 46 (21 self)
- Add to MetaCart
Register renaming and out-of-order instruction issue are now commonly used in superscalar processors. These techniques can also be used to significant advantage in vector processors, as this paper shows. Performance is improved and available memory bandwidth is used more effectively. Using a trace driven simulation we compare a conventional vector implementation, based on the Convex C3400, with an out-of-order, register renaming, vector implementation. When the number of physical registers is above 12, out-of-order execution coupled with register renaming provides a speedup of 1.24--1.72 for realistic memory latencies. Out-of-order techniques also tolerate main memory latencies of 100 cycles with a performance degradation less than 6%. The mechanisms used for register renaming and out-of-order issue can be used to support precise interrupts -- generally a difficult problem in vector machines. When precise interrupts are implemented, there is typically less than a 10% degradation in performance. A new technique based on register renaming is targeted at dynamically eliminating spill code; this technique is shown to provide an extra speedup ranging between 1.10 and 1.20 while reducing total memory traffic by an average of 15--20%.
Software-Managed Address Translation
, 1997
"... In this paper we explore software-managed address translation. The purpose of the study is to specify the memory management design for a high clock-rate PowerPC implementation in which a simple design is a prerequisite for a fast clock and a short design cycle. We show that softwaremanaged address t ..."
Abstract
-
Cited by 39 (7 self)
- Add to MetaCart
In this paper we explore software-managed address translation. The purpose of the study is to specify the memory management design for a high clock-rate PowerPC implementation in which a simple design is a prerequisite for a fast clock and a short design cycle. We show that softwaremanaged address translation is just as efficient as hardware -managed address translation, and it is much more flexible. Operating systems such as OSF/1 and Mach charge between 0.10 and 0.28 cycles per instruction (CPI) for address translation using dedicated memory-management hardware. Software-managed translation requires 0.05 CPI. Mechanisms to support such features as shared memory, superpages, sub-page protection, and sparse address spaces can be defined completely in software, allowing much more flexibility than in hardware-defined mechanisms. 1 Introduction In many commercial architectures the hardware support for memory management is unnecessarily complicated, places constraints on the operating sys...
Practical, transparent operating system support for superpages
- SIGOPS Oper. Syst. Rev
, 2002
"... Most general-purpose processors provide support for memory pages of large sizes, called superpages. Superpages enable each entry in the translation lookaside buffer (TLB) to map a large physical memory region into a virtual address space. This dramatically increases TLB coverage, reduces TLB misses, ..."
Abstract
-
Cited by 32 (3 self)
- Add to MetaCart
Most general-purpose processors provide support for memory pages of large sizes, called superpages. Superpages enable each entry in the translation lookaside buffer (TLB) to map a large physical memory region into a virtual address space. This dramatically increases TLB coverage, reduces TLB misses, and promises performance improvements for many applications. However, supporting superpages poses several challenges to the operating system, in terms of superpage allocation and promotion tradeoffs, fragmentation control, etc. We analyze these issues, and propose the design of an effective superpage management system. We implement it in FreeBSD on the Alpha CPU, and evaluate it on real workloads and benchmarks. We obtain substantial performance benefits, often exceeding 30%; these benefits are sustained even under stressful workload scenarios. 1
Randomized Cache Placement for Eliminating Conflicts
, 1999
"... Applications with regular patterns of memory access can experience high levels of cache conflict misses. In shared-memory multiprocessors conflict misses can be increased significantly by the data transpositions required for parallelization. Techniques such as blocking which are introduced within a ..."
Abstract
-
Cited by 25 (1 self)
- Add to MetaCart
Applications with regular patterns of memory access can experience high levels of cache conflict misses. In shared-memory multiprocessors conflict misses can be increased significantly by the data transpositions required for parallelization. Techniques such as blocking which are introduced within a single thread to improve locality, can result in yet more conflict misses. The tension between minimizing cache conflicts and the other transformations needed for efficient parallelization leads to complex optimization problems for parallelizing compilers. This paper shows how the introduction of a pseudorandom element into the cache index function can effectively eliminate repetitive conflict misses and produce a cache where miss ratio depends solely on working set behavior. We examine the impact of pseudo-random cache indexing on processor cycle times and present practical solutions to some of the major implementation issues for this type of cache. Our conclusions are supported by simulati...
Page placement algorithms for large real-indexed caches
- ACM Transactions on Computer Systems
, 1992
"... When a computer system supports both paged virtual memory and large real-indexed caches, cache performance depends in part on the main memory page placement. To date, most operating systems place pages by selecting an arbitrary page frame from a pool of page frames that have been made available by t ..."
Abstract
-
Cited by 21 (0 self)
- Add to MetaCart
When a computer system supports both paged virtual memory and large real-indexed caches, cache performance depends in part on the main memory page placement. To date, most operating systems place pages by selecting an arbitrary page frame from a pool of page frames that have been made available by the page replacement algorithm. We give a simple model that shows that this naive (arbitrary) page placement leads to up to 30 % unnecessary cache conflicts. We develop several page placement algorithms, called careful-mappmg algonthnzs, that try to select a page frame (from the pool of available page frames) that is likely to reduce cache contention. Using trace-driven simulation, we find that careful mappmg results in 1O–2OTO fewer (dynamic) cache misses than naive mapping (for a direct-mapped real-indexed multimegabyte cache). Thus, our results suggest that careful mapping by the operating system can get about half the cache miss reduction that a cache size (or associativity) doubling can.

