Results 1 - 10
of
31
Tolerating Latency Through Software-Controlled Prefetching in Shared-Memory Multiprocessors
- Journal of Parallel and Distributed Computing
, 1991
"... The large latency of memory accesses is a major obstacle in obtaining high processor utilization in large scale shared-memory multiprocessors. Although the provision of coherent caches in many recent machines has alleviated the problem somewhat, cache misses still occur frequently enough that they s ..."
Abstract
-
Cited by 264 (17 self)
- Add to MetaCart
The large latency of memory accesses is a major obstacle in obtaining high processor utilization in large scale shared-memory multiprocessors. Although the provision of coherent caches in many recent machines has alleviated the problem somewhat, cache misses still occur frequently enough that they significantly lower performance. In this paper we evaluate the effectiveness of non-binding software-controlled lyrefetching, as proposed in the Stanford DASH Multiprocessor, to address this problem. The prefetches are non-binding in the sense that the prefetched data is brought to a cache close to the processor, but is still available to the cache coherence protocol to keep it consistent. Prefetching is software-controlled since the program must explicitly issue prefetch instructions.
Cache Write Policies and Performance
, 1991
"... This paper investigates issues involving writes and caches. First, tradeoffs between write-through and write-back caching when writes hit in a cache are considered. A mixture of these two alternatives, called write caching is proposed. Write caching places a small fully-associative cache behind a wr ..."
Abstract
-
Cited by 122 (3 self)
- Add to MetaCart
This paper investigates issues involving writes and caches. First, tradeoffs between write-through and write-back caching when writes hit in a cache are considered. A mixture of these two alternatives, called write caching is proposed. Write caching places a small fully-associative cache behind a write-through cache. A write cache can eliminate almost as much write traffic as a write-back cache. Second, tradeoffs on writes that miss in the cache are investigated. In particular, whether the missed cache block is fetched on a write miss, whether the missed cache block is allocated in the cache, and whether the cache line accessed is invalidated are considered. Depending on the combination of these polices chosen, the entire cache miss rate can vary by a factor of two on some applications. Furthermore, the combination of no-fetch-on-write and write-allocate can provide better performance than cache line allocation instructions. Finally, the traffic at the back side of write-through and wr...
Data Prefetch Mechanisms
, 2000
"... The expanding gap between microprocessor and DRAM performance has necessitated the use of increasingly aggressive techniques designed to reduce or hide the latency of main memory access. Although large cache hierarchies have proven to be effective in reducing this latency for the most frequently use ..."
Abstract
-
Cited by 79 (4 self)
- Add to MetaCart
The expanding gap between microprocessor and DRAM performance has necessitated the use of increasingly aggressive techniques designed to reduce or hide the latency of main memory access. Although large cache hierarchies have proven to be effective in reducing this latency for the most frequently used data, it is still not uncommon for many programs to spend more than half their run times stalled on memory requests. Data prefetching has been proposed as a technique for hiding the access latency of data referencing patterns that defeat caching strategies. Rather than waiting for a cache miss to initiate a memory fetch, data prefetching anticipates such misses and issues a fetch to the memory system in advance of the actual memory reference. To be effective, prefetching must be implemented in such a way that prefetches are timely, useful, and introduce little overhead. Secondary effects such as cache pollution and increased memory bandwidth requirements must also be taken into consideration. Despite these obstacles, prefetching has the potential to significantly improve overall program execution time by overlapping computation with memory accesses. Prefetching
Cache Conscious Algorithms for Relational Query Processing
- In Proceedings of the 20th VLDB Conference
, 1994
"... The current main memory (DRAM) access speeds lag far behind CPU speeds. Cache memory, made of static RAM, is being used in today's architectures to bridge this gap. It provides access latencies of 2--4 processor cycles, in contrast to main memory which requires 15--25 cycles. Therefore, the performa ..."
Abstract
-
Cited by 75 (2 self)
- Add to MetaCart
The current main memory (DRAM) access speeds lag far behind CPU speeds. Cache memory, made of static RAM, is being used in today's architectures to bridge this gap. It provides access latencies of 2--4 processor cycles, in contrast to main memory which requires 15--25 cycles. Therefore, the performance of the CPU depends upon how well the cache can be utilized. We show that there are significant benefits in redesigning our traditional query processing algorithms so that they can make better use of the cache. The new algorithms run 8%--200% faster than the traditional ones. 1 Introduction The DRAM access speeds have not reduced much compared to the CPU cycle time reduction resulting from the improvements in VLSI technology. Cache memories, made of fast static RAM, help alleviate this disparity by exploiting the spatial and temporal locality in the data accesses of a program. However, programs with poor access locality waste significantly many cycles transferring the data to and from th...
Reducing DRAM latencies with an integrated memory hierarchy design
- In Proceedings of the 7th International Symposium on High Performance Computer Architecture (HPCA-7
, 2001
"... In this papel; we address the severe performance gap caused by high processor clock rates and slow DRAM accesses. We show that even with an aggressive, next-generation memory system using four Direct Rambus channels and an integrated one-megabyte level-two cache, a processor still spends over half o ..."
Abstract
-
Cited by 74 (5 self)
- Add to MetaCart
In this papel; we address the severe performance gap caused by high processor clock rates and slow DRAM accesses. We show that even with an aggressive, next-generation memory system using four Direct Rambus channels and an integrated one-megabyte level-two cache, a processor still spends over half of its time stalling for L2 misses. Large cache blocks can improve performance, but only when coupled with wide memory channels. DRAM address mappings also affect performance significantly. We evaluate an aggressive prefetch unit integrated with the L2 cache and memory controllers. By issuing prefetches only when the Rambus channels are idle, prioritizing them to maximize DRAM row bufSer hits, and giving them low replacement priority, we achieve a 43% speedup across IO of the 26 SPEC2000 benchmarks, without degrading performance on the others. With eight Rambus channels, these ten benchmarks improve to within 10 % of the peflormance of a perfect L2 cache. 1.
GreedyDual* Web Caching Algorithm -- Exploiting the Two Sources of Temporal Locality in Web Request Streams
- IN PROCEEDINGS OF THE 5TH INTERNATIONAL WEB CACHING AND CONTENT DELIVERY WORKSHOP
, 2000
"... The relative importance of long-term popularity and short-term temporal correlation of references for Web cache replacement policies has not been studied thoroughly. This is partially due to the lack of accurate characterization of temporal locality that enables the identification of the relative st ..."
Abstract
-
Cited by 57 (3 self)
- Add to MetaCart
The relative importance of long-term popularity and short-term temporal correlation of references for Web cache replacement policies has not been studied thoroughly. This is partially due to the lack of accurate characterization of temporal locality that enables the identification of the relative strengths of these two sources of temporal locality in a reference stream. In [21], we have proposed such a metric and have shown that Web reference streams differ significantly in the the prevelance of these two sources of temporal locality. These findings underscore the importance of a Web caching strategy that can adapt in a dynamic fashion to the prevelance of these two sources of temporal locality. In this paper, we propose a novel cache replacement algorithm, GreedyDual*, which is a generalization of GreedyDual-Size. GreedyDual* uses the metrics proposed in [21] to adjust the relative worth of long-term popularity versus short-term temporal correlation of references. Our trace-driven simulati...
Inexpensive implementations of set-associativity
- In 16th Annual International Symposium on Computer Architecture
, 1989
"... The traditional approach to implementing wide setassociativity is expensive, requiring a wide tag memory (directory) and many comparators. Here we examine altemative implementations of associativity that use hardware similar to that used to implement a direct-mapped cache. One approach scans tags se ..."
Abstract
-
Cited by 49 (0 self)
- Add to MetaCart
The traditional approach to implementing wide setassociativity is expensive, requiring a wide tag memory (directory) and many comparators. Here we examine altemative implementations of associativity that use hardware similar to that used to implement a direct-mapped cache. One approach scans tags serially from most-recently used to least-recently used. Another uses a partial compare of a few bits from each tag to reduce the number of tags that must be examined serially. The drawback of both approaches is that they increase cache access time by a factor of two or more over the traditional implementation of setassociativity, making them inappropriate for cache designs in which a fast access time is crucial (e.g. level one caches, caches directly servicing processor requests). These schemes are useful, however, if (1) the low miss ratio of wide set-associative caches is desired, (2) the low cost of a direct-mapped implementation is preferred, and (3) the slower access time of these approaches can be tolerated. We expect these conditions to be true for caches in multiprocessors designed to reduce memory interconnection traffic, caches implemented with large, narrow memory chips, and level two (or higher) caches in a cache hierarchy. 1.
Cache Coherence in Distributed Systems
, 1987
"... v Abstract Caching has long been recognized as a powerful performance enhancement technique in many areas of computer design. Most modern computer systems include a hardware cache between the processor and main memory, and many operating systems include a software cache between the file system rout ..."
Abstract
-
Cited by 24 (0 self)
- Add to MetaCart
v Abstract Caching has long been recognized as a powerful performance enhancement technique in many areas of computer design. Most modern computer systems include a hardware cache between the processor and main memory, and many operating systems include a software cache between the file system routines and the disk hardware. In a distributed file system, where the file systems of several client machines are separated from the server backing store by a communications network, it is desirable to have a cache of recently used file blocks at the client, to avoid some of the communications overhead. In this configuration, special care must be taken to maintain consistency between the client caches, as some disk blocks may be in use by more than one client. For this reason, most current distributed file systems do not provide a cache at the client machine. Those systems that do place restrictions on the types of file blocks that may be shared, or require extra communication to confirm that...
Temporal Locality in Web Request Streams: Sources, Characteristics, and Caching Implications (Extended Abstract)
- In Proceedings of SIGMETRICS
, 2000
"... Shudong Jin and Azer Bestavros Computer Science Department, Boston University 111 Cummington St, Boston, MA 02215 fjins,bestavrosg@cs.bu.edu 1. INTRODUCTION Web access patterns exhibit a number of unique properties that have been identied and characterized. The prevalence of some of these prope ..."
Abstract
-
Cited by 18 (4 self)
- Add to MetaCart
Shudong Jin and Azer Bestavros Computer Science Department, Boston University 111 Cummington St, Boston, MA 02215 fjins,bestavrosg@cs.bu.edu 1. INTRODUCTION Web access patterns exhibit a number of unique properties that have been identied and characterized. The prevalence of some of these properties has motivated the development of many protocols (and optimizations thereof) that exploit such properties. One such property is the temporal locality of reference exhibited in Web request streams. Temporal locality in Web request streams emerges from two distinct phenomena, the long-term popularity [1, 2, 3] of Web documents and the short-term temporal correlations of references. Delineating between these two sources is important because they have dierent implications for caching and replication protocols. The highly skewed popularity of Web documents suggests the use of long-term frequency in caching and replication algorithms, while the temporal correlations of references suggests t...
Stride-directed Prefetching for Secondary Caches
- In Proceedings of the 1997 International Conference on Parallel Processing
, 1997
"... skim @ aus tin.ibm. com ..."

