Results 1 - 10
of
10
Cache-Conscious Data Placement
- in Proceedings of the Eighth International Conference on Architectural Support for Programming Languages and Operating Systems
, 1998
"... As the gap between memory and processor speeds continues to widen, cache efficiency is an increasingly important component of processor performance. Compiler techniques have been used to improve instruction cache performance by mapping code with temporal locality to different cache blocks in the vir ..."
Abstract
-
Cited by 131 (3 self)
- Add to MetaCart
As the gap between memory and processor speeds continues to widen, cache efficiency is an increasingly important component of processor performance. Compiler techniques have been used to improve instruction cache performance by mapping code with temporal locality to different cache blocks in the virtual address space eliminating cache conflicts. These code placement techniques can be applied directly to the problem of placing data for improved data cache performance. In this paper we present a general framework for Cache Conscious Data Placement. This is a compiler directed approach that creates an address placement for the stack (local variables), global variables, heap objects, and constants in order to reduce data cache misses. The placement of data objects is guided by a temporal relationship graph between objects generated via profiling. Our results show that profile driven data placement significantly reduces the data miss rate by 24% on average. 1 Introduction Much effort has b...
A Permutation-based Page Interleaving Scheme to Reduce Row-buffer Conflicts and Exploit Data Locality
- In Proceedings of the 33rd Annual International Symposium on Microarchitecture
, 2000
"... DRAM row-buffer conflicts occur when a sequence of requests on different rows goes to the same memory bank, causing much higher memory access latency than requests to the same row or to different banks. In this paper, we analyze the sources of row-buffer conflicts in the context of superscalar proce ..."
Abstract
-
Cited by 44 (8 self)
- Add to MetaCart
DRAM row-buffer conflicts occur when a sequence of requests on different rows goes to the same memory bank, causing much higher memory access latency than requests to the same row or to different banks. In this paper, we analyze the sources of row-buffer conflicts in the context of superscalar processors, and propose a permutation-based page interleaving scheme to reduce row-buffer conflicts and to exploit data access locality in the row-buffer. Compared with several existing schemes, we show that the permutation-based scheme dramatically increases the hit rates on DRAM row-buffers and reduces memory stall time of the SPEC95 and TPC-C workloads. The memory stall times of the workloads are reduced up to 68% and 50%, compared with the conventional cache line and page interleaving schemes, respectively. 1 Introduction Concurrent accesses to multiple interleaved memory banks are supported in modern computer systems, where each bank has a row-buffer holding a page of data. 1 With the si...
Access Order and Memory-Conscious Cache Utilization
- In Proceedings of the First Annual Symposium on High Performance Computer Architecture
, 1995
"... As processor speeds increase relative to memory speeds, memory bandwidth is rapidly becoming the limiting performance factor for many applications. Several approaches to bridging this performance gap have been suggested. This paper examines one approach, access ordering, and pushes its limits to det ..."
Abstract
-
Cited by 39 (11 self)
- Add to MetaCart
As processor speeds increase relative to memory speeds, memory bandwidth is rapidly becoming the limiting performance factor for many applications. Several approaches to bridging this performance gap have been suggested. This paper examines one approach, access ordering, and pushes its limits to determine bounds on memory performance. We present several access-ordering schemes, and compare their performance, developing analytic models and partially validating these with benchmark timings on the Intel i860XR. 1. Introduction Processor speeds are increasing much faster than memory speeds, thus memory bandwidth is rapidly becoming the limiting performance factor for many applications, particularly scientific computations. Proposed solutions range from software prefetching [4, 16, 27] and iteration space tiling [5, 8, 9, 18, 32, 38], to address transformations [12, 13], unusual memory systems [3, 10, 33, 36], and prefetching or non-blocking caches [1, 6, 34]. Here we take one technique, ...
Dynamic Access Ordering for Streamed Computations
, 2000
"... Memory bandwidth is rapidly becoming the limiting performance factor for many applications, particularly for streaming computations such as scientific vector processing or multimedia (de)compression. Although these computations lack the temporal locality of reference that makes traditional caching ..."
Abstract
-
Cited by 24 (3 self)
- Add to MetaCart
Memory bandwidth is rapidly becoming the limiting performance factor for many applications, particularly for streaming computations such as scientific vector processing or multimedia (de)compression. Although these computations lack the temporal locality of reference that makes traditional caching schemes effective, they have predictable access patterns. Since most modern DRAM components support modes that make it possible to perform some access sequences faster than others, the predictability of the stream accesses makes it possible to reorder them to get better memory performance. We describe a Stream Memory Controller (SMC) system that combines compile-time detection of streams with execution-time selection of the access order and issue. The SMC effectively prefetches read-streams, buffers write-streams, and reorders the accesses to exploit the existing memory bandwidth as much as possible. Unlike most other hardware prefetching or stream buffer designs, this system does no...
Mini-rank: Adaptive DRAM architecture for improving memory power efficiency
- In Proceedings of the 40th International Symposium on Microarchitecture
, 2008
"... The widespread use of multicore processors has dramatically increased the demand on high memory bandwidth and large memory capacity. As DRAM subsystem designs stretch to meet the demand, memory power consumption is now approaching that of processors. However, the conventional DRAM architecture preve ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
The widespread use of multicore processors has dramatically increased the demand on high memory bandwidth and large memory capacity. As DRAM subsystem designs stretch to meet the demand, memory power consumption is now approaching that of processors. However, the conventional DRAM architecture prevents any meaningful power and performance trade-offs for memory-intensive workloads. We propose a novel idea called mini-rank for DDRx (DDR/DDR2/DDR3) DRAMs, which uses a small bridge chip on each DRAM DIMM to break a conventional DRAM rank into multiple smaller mini-ranks so as to reduce the number of devices involved in a single memory access. The design dramatically reduces the memory power consumption with only a slight increase on the memory idle latency. It does not change the DDRx bus protocol and its configuration can be adapted for the best performancepower trade-offs. Our experimental results using four-core multiprogramming workloads show that using x32 mini-ranks reduces memory power by 27.0 % with 2.8 % performance penalty and using x16 mini-ranks reduces memory power by 44.1 % with 7.4 % performance penalty on average for memory-intensive workloads, respectively. 1.
Dynamic Access Ordering for Symmetric Shared-Memory Multiprocessors
, 1994
"... Memory bandwidth is rapidly becoming the performance bottleneck in the application of high performance microprocessors to vector-like algorithms, including the "Grand Challenge" scientific problems. Caching is not the sole solution for these applications due to the poor temporal and spatial locality ..."
Abstract
-
Cited by 4 (4 self)
- Add to MetaCart
Memory bandwidth is rapidly becoming the performance bottleneck in the application of high performance microprocessors to vector-like algorithms, including the "Grand Challenge" scientific problems. Caching is not the sole solution for these applications due to the poor temporal and spatial locality of their data accesses. Moreover, the nature of memories themselves has changed. Achieving greater bandwidth requires exploiting the characteristics of memory components "on the other side of the cache" -- they should not be treated as uniform access-time RAM. This paper describes the use of hardware-assisted access ordering in symmetric multiprocessor (SMP) systems. Our technique combines compile-time detection of memory access patterns with a memory subsystem (called a Stream Memory Controller, or SMC) tha...
Hardware And Software Mechanisms For Reducing Load Latency
, 1996
"... As processor demands quickly outpace memory, the performance of load instructions becomes an increasingly critical component to good system performance. This thesis contributes four novel load latency reduction techniques, each targeting a different component of load latency: address calculation, da ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
As processor demands quickly outpace memory, the performance of load instructions becomes an increasingly critical component to good system performance. This thesis contributes four novel load latency reduction techniques, each targeting a different component of load latency: address calculation, data cache access, address translation, and data cache misses. The contributed techniques are as follows: ffl Fast Address Calculation employs a stateless set index predictor to allow address calculation to overlap with data cache access. The design eliminates the latency of address calculation for many loads. ffl Zero-Cycle Loads combine fast address calculation with an early-issue mechanism to produce pipeline designs capable of hiding the latency of many loads that hit in the data cache. ffl High-Bandwidth Address Translation develops address translation mechanisms with better latency and area characteristics than a multi-ported TLB. The new designs provide multiple-issue processors with ...
Memory scheduling for modern microprocessors
- ACM Transactions on Computer Systems
, 2007
"... The need to carefully schedule memory operations has increased as memory performance has become increasingly important to overall system performance. This article describes the adaptive history-based (AHB) scheduler, which uses the history of recently scheduled operations to provide three conceptual ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
The need to carefully schedule memory operations has increased as memory performance has become increasingly important to overall system performance. This article describes the adaptive history-based (AHB) scheduler, which uses the history of recently scheduled operations to provide three conceptual benefits: (1) it allows the scheduler to better reason about the delays associated with its scheduling decisions, (2) it provides a mechanism for combining multiple constraints, which is important for increasingly complex DRAM structures, and (3) it allows the scheduler to select operations so that they match the program’s mixture of Reads and Writes, thereby avoiding certain bottlenecks within the memory controller. We have previously evaluated this scheduler in the context of the IBM Power5. When compared with the state of the art, this scheduler improves performance by 15.6%, 9.9%, and 7.6 % for the Stream, NAS, and commercial benchmarks, respectively. This article expands our understanding of the AHB scheduler in a variety of ways. Looking backwards, we describe the scheduler in the context of prior work that focused exclusively on avoiding bank conflicts, and we show that the AHB scheduler is superior for the IBM Power5, which we argue will be representative of future microprocessor memory controllers. Looking forwards, we evaluate this scheduler in the context of future systems by varying a number of microarchitectural features and hardware parameters. For example, we show that the benefit of this scheduler increases as we move to multithreaded environments.
Brief Contributions Nonprime Memory Systems and Error Correction in Address Translation
"... Abstract—Using a prime number p of memory banks on a vector processor allows a conflict-free access for any slice of p consecutive elements of a vector stored with a stride not multiple of p. To reject the use of a prime number of memory banks, it is generally advanced that address computation for s ..."
Abstract
- Add to MetaCart
Abstract—Using a prime number p of memory banks on a vector processor allows a conflict-free access for any slice of p consecutive elements of a vector stored with a stride not multiple of p. To reject the use of a prime number of memory banks, it is generally advanced that address computation for such a memory system would require systematic Euclidean division by the number p. The Chinese Remainder Theorem allows a simple mapping of data onto the memory banks for which address computation does not require any Euclidean division. However, this requires that the number of words in each memory module m and p be relatively prime. We propose a method based on the Chinese Remainder Theorem for moduli with common factors that does not have such a restriction. The proposed method does not require Euclidean division and also results in an efficient error detection/correction mechanism for address translation. Index Terms—Address translation, error correction, error detection, logical address, memory systems, physical address, vector processors. 1
Bounding on the Gain of Optimizing Data Layout in Vector Processors
"... In vector processors, the number of memory banks (m) is generally larger than or equal to the memory access time divided with the processor cycle time. This ratio is denoted t, i.e. m t. Data is moved between the vector registers and the memory using long sequences of memory accesses for which the ..."
Abstract
- Add to MetaCart
In vector processors, the number of memory banks (m) is generally larger than or equal to the memory access time divided with the processor cycle time. This ratio is denoted t, i.e. m t. Data is moved between the vector registers and the memory using long sequences of memory accesses for which the addresses are separated by a fixed distance called the stride. For some strides, the performance is seriously degraded due to memory bank conflicts. Many scientific applications are based on large matrices, and for such programs it is well known that the most unfavorable strides can be avoided by adding a number of dummy columns or by using hardware skewing. We present an optimal upper bound on the number of access conflicts when optimizing the data layout in this way. Programs are categorized according to their strides, and the worst-case behavior for each such category is given in a theorem. The result shows that for worst-case scenarios the number of conflicts increases rapidly when t grows, e.g. if we want to keep the worstcase behavior relatively constant when t grows from 6 to 10, we need to at least double the number of memory banks. The result is valid for skewed as well as for non-skewed memory systems. Keywords Performance bound, Skewing, Vector processors, Matrix computation, Memory bank conflicts 1.

