Results 1 - 10
of
454
Value Locality and Load Value Prediction
, 1996
"... Since the introduction of virtual memory demand-paging and cache memories, computer systems have been exploiting spatial and temporal locality to reduce the average latency of a memory reference. In this paper, we introduce the notion of value locality, a third facet of locality that is frequently p ..."
Abstract
-
Cited by 331 (18 self)
- Add to MetaCart
Since the introduction of virtual memory demand-paging and cache memories, computer systems have been exploiting spatial and temporal locality to reduce the average latency of a memory reference. In this paper, we introduce the notion of value locality, a third facet of locality that is frequently present in real-world programs, and describe how to effectively capture and exploit it in order to perform load value prediction. Temporal and spatial locality are attributes of storage locations, and describe the future likelihood of references to those locations or their close neighbors. In a similar vein, value locality describes the likelihood of the recurrence of a previously-seen value within a storage location. Modern processors already exploit value locality in a very restricted sense through the use of control speculation (i.e. branch prediction), which seeks to predict the future value of a single condition bit based on previously-seen values. Our work extends this to predict entire 32- and 64-bit register values based on previously-seen values. We find that, just as condition bits are fairly predictable on a per-static-branch basis, full register values being loaded from memory are frequently predictable as well. Furthermore, we show that simple microarchitectural enhancements to two modern microprocessor implementations (based on the PowerPC 620 and Alpha 21164) that enable load value prediction can effectively exploit value locality to collapse true dependencies, reduce average memory latency and bandwidth requirements, and provide measurable performance gains. 1. Introduction and Related
Optimally Profiling and Tracing Programs
- ACM Transactions on Programming Languages and Systems
, 1994
"... copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others ..."
Abstract
-
Cited by 256 (17 self)
- Add to MetaCart
copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Publications
Cache coherence protocols: Evaluation using a multiprocessor simulation model
- ACM Transactions on Computer Systems
, 1986
"... Using simulation, we examine the efficiency of several distributed, hardware-based solutions to the cache coherence problem in shared-bus multiprocessors. For each of the approaches, the associated protocol is outlined. The simulation model is described, and results from that model are presented. Th ..."
Abstract
-
Cited by 230 (5 self)
- Add to MetaCart
Using simulation, we examine the efficiency of several distributed, hardware-based solutions to the cache coherence problem in shared-bus multiprocessors. For each of the approaches, the associated protocol is outlined. The simulation model is described, and results from that model are presented. The magnitude of the potential performance difference between the various approaches indicates that the choice of coherence solution is very important in the design of an efficient shared-bus multiprocessor, since it may limit the number of processors in the system.
An Evaluation of Directory Schemes for Cache Coherence
- In Proceedings of the 15th Annual International Symposium on Computer Architecture
, 1988
"... The problem of cache coherence in shared-memory multiprocessors has been addressed using two basic approaches: directory schemes and snoopy cache schemes. Directory schemes have been given less attention in the past several years, while snoopy cache methods have become extremely popular. Directory s ..."
Abstract
-
Cited by 216 (13 self)
- Add to MetaCart
The problem of cache coherence in shared-memory multiprocessors has been addressed using two basic approaches: directory schemes and snoopy cache schemes. Directory schemes have been given less attention in the past several years, while snoopy cache methods have become extremely popular. Directory schemes for cache coherence are potentially attractive in large multiprocessor systems that are beyond the scaling limits of the snoopy cache schemes. Slight modifications to directory schemes can make them competitive in performance with snoopy cache schemes for small multiprocessors. Trace driven simulation, using data collected from several real multiprocessor applications, is used to compare the performance of standard directory schemes, modifications to these schemes, and snoopy cache protocols. 1 Introduction In the past several years, shared-memory multiprocessors have gained wide-spread attention due to the simplicity of the shared-memory parallel programming model. However, allowing...
Tile Size Selection Using Cache Organization and Data Layout
, 1995
"... When dense matrix computations are too large to fit in cache, previous research proposes tiling to reduce or eliminate capacity misses. This paper presents a new algorithm for choosing problem-size dependent tile sizes based on the cache size and cache line size for a direct-mapped cache. The algori ..."
Abstract
-
Cited by 193 (2 self)
- Add to MetaCart
When dense matrix computations are too large to fit in cache, previous research proposes tiling to reduce or eliminate capacity misses. This paper presents a new algorithm for choosing problem-size dependent tile sizes based on the cache size and cache line size for a direct-mapped cache. The algorithm eliminates both capacity and self-interference misses and reduces cross-interference misses. We measured simulated miss rates and execution times for our algorithm and two others on a variety of problem sizes and cache organizations. At higher set associativity, our algorithm does not always achieve the best performance. However on direct-mapped caches, our algorithm improves simulated miss rates and measured execution times when compared with previous work. 1 Introduction Due to the wide gap between processor and memory speed in current architectures, achieving good performance requires high cache efficiency. Compiler optimizations to improve data locality for uniprocessors is increas...
Data Caching Issues in an Information Retrieval System
- ACM Transactions on Database Systems
, 1990
"... Currently, a variety of information retrieval systems are available to potential users. These services are provided by commercial enterprises (such as Dow Jones [6] and The Source [7]), while others are research efforts (the Boston Community Information System [S]). While in many cases these systems ..."
Abstract
-
Cited by 191 (6 self)
- Add to MetaCart
Currently, a variety of information retrieval systems are available to potential users. These services are provided by commercial enterprises (such as Dow Jones [6] and The Source [7]), while others are research efforts (the Boston Community Information System [S]). While in many cases these systems are accessed from personal computers, typically no advantage is taken of the computing resources of those machines (such as local processing and storage). In this paper we explore the possibility of using the user’s local storage capabilities to cache data at the user’s site. This would improve the response time of user queries albeit at the cost of incurring the overhead required in maintaining multiple copies. In order to reduce this overhead it may be appropriate to allow copies to diverge in a controlled fashion. This would not only make caching less costly, but would also make it possible to propagate updates to the copies more efficiently, for example, when the system is lightly loaded, when communication tariffs are lower, or by batching updates together. Just as importantly, it also makes it possible to access the copies even when the communication lines or the central site are down. Thus, we introduce the notion of quasi-copies, which embodies the ideas sketched above. We also define the types of deviations that seem useful, and discuss the available implementation strategies.
Memory bandwidth limitations of future microprocessors
- IN PROCEEDINGS OF THE 23RD ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE
, 1996
"... This paper makes the case that pin bandwidth will be a critical consideration for future microprocessors. We show that many of the techniques used to tolerate growing memory latencies do so at the expense of increased bandwidth requirements. Using a decomposition of execution time, we show that for ..."
Abstract
-
Cited by 184 (10 self)
- Add to MetaCart
This paper makes the case that pin bandwidth will be a critical consideration for future microprocessors. We show that many of the techniques used to tolerate growing memory latencies do so at the expense of increased bandwidth requirements. Using a decomposition of execution time, we show that for modern processors that employ aggressive memory latency tolerance techniques, wasted cycles due to insufficient bandwidth generally exceed those due to raw memory latencies. Given the importance of maximizing memory bandwidth, we calculate effective pin bandwidth, then estimate optimal effective pin bandwidth. We measure these quantities by determining the amount by which both caches and minimal-traffic caches filter accesses to the lower levels of the memory hierarchy. We see that there is a gap that can exceed two orders of magnitude between the total memory traffic generated by caches and the minimal-traffic caches—implying that the potential exists to increase effective pin bandwidth substantially. We decompose this traffic gap into four factors, and show they contribute quite differently to traffic reduction for different benchmarks. We conclude that, in the short term, pin bandwidth limitations will make more complex on-chip caches cost-effective. For example, flexible caches may allow individual applications to choose from a range of caching policies. In the long term, we predict that off-chip accesses will be so expensive that all system memory will reside on one or more processor chips.
Characterizing Reference Locality in the WWW
, 1996
"... As the World Wide Web (Web) is increasingly adopted as the infrastructure for large-scale distributed information systems, issues of performance modeling become ever more critical. In particular, locality of reference is an important property in the performance modeling of distributed information sy ..."
Abstract
-
Cited by 184 (18 self)
- Add to MetaCart
As the World Wide Web (Web) is increasingly adopted as the infrastructure for large-scale distributed information systems, issues of performance modeling become ever more critical. In particular, locality of reference is an important property in the performance modeling of distributed information systems. In the case of the Web, understanding the nature of reference locality will help improve the design of middleware, such as caching, prefetching, and document dissemination systems. For example, good measurements of reference locality would allow us to generate synthetic reference streams with accurate performance characteristics, would allow us to compare empirically measured streams to explain differences, and would allow us to predict expected performance for system design and capacity planning. In this paper we propose models for both temporal and spatial locality of reference in streams of requests arriving at Web servers. We show that simple models based only on document popularity (likelihood of reference) are insufficient for capturing either temporal or spatial locality. Instead, we rely on an equivalent, but numerical, representation of a reference stream: a stack distance trace. We show that temporal locality can be characterized by
Exploring the Bounds of Web Latency Reduction from Caching and Prefetching
, 1997
"... Prefetching and caching are techniques commonly used in I/O systems to reduce latency. Many researchers have advocated the use of caching and prefetching to reduce latency in the Web. We derive several bounds on the performance improvements seen from these techniques, and then use traces of Web prox ..."
Abstract
-
Cited by 184 (7 self)
- Add to MetaCart
Prefetching and caching are techniques commonly used in I/O systems to reduce latency. Many researchers have advocated the use of caching and prefetching to reduce latency in the Web. We derive several bounds on the performance improvements seen from these techniques, and then use traces of Web proxy activity taken at Digital Equipment Corporation to quantify these bounds. We found that for these traces, local proxy caching could reduce latency by at best 26%, prefetching could reduce latency by at best 57%, and a combined caching and prefetching proxy could provide at best a 60% latency reduction. Furthermore, we found that how far in advance a prefetching algorithm was able to prefetch an object was a significant factor in its ability to reduce latency. We note that the latency reduction from caching is significantly limited by the rapid changes of objects in the Web. We conclude that for the workload studied caching offers moderate assistance in reducing latency. Prefetching can of...
Effective Hardware-based Data Prefetching for High-performance Processors
- IEEE Transactions on Computers
, 1995
"... Abstract-Memory latency and bandwidth are progressing at a much slower pace than processor performance. In this paper, we describe and evaluate the performance of three variations of a hardware function unit whose goal is to assist a data cache in prefetching data accesses so that memory latency is ..."
Abstract
-
Cited by 180 (2 self)
- Add to MetaCart
Abstract-Memory latency and bandwidth are progressing at a much slower pace than processor performance. In this paper, we describe and evaluate the performance of three variations of a hardware function unit whose goal is to assist a data cache in prefetching data accesses so that memory latency is hidden as often as possible. The basic idea of the prefetching scheme is to keep track of data access patterns in a Reference Prediction Table (RPT) organized as an instruction cache. The three designs differ mostly on the timing of the prefetching. In the simplest scheme (basic), prefetches can be generated one iteration ahead of actual use. The lookahead variation takes advantage of a lookahead pro-gram counter that ideally stays one memory latency time ahead of the real program counter and that is used as the control mecha-nism to generate the prefetches. Finally the correlated scheme uses a more sophisticated design to detect patterns across loop levels. These designs are evaluated by simulating the ten SPEC benchmarks on a cycle-by-cycle basis. The results show that 1) the three hardware prefetching schemes all yield significant reductions in the data access penalty when compared with regu-lar caches, 2) the benefits are greater when the hardware assist augments small on-chip caches, and 3) the lookahead scheme is the preferred one cost-performance wise. Index Terms-Prefetching, hardware function unit, reference prediction, branch prediction, data cache, cycle-by-cycle simulations. I.

