Results 1 - 10
of
14
Memory Controller Optimizations for Web Servers
"... This paper analyzes memory access scheduling and virtual channels as mechanisms to reduce the latency of main memory accesses by the CPU and peripherals in web servers. Despite the address filtering effects of the CPU's cache hierarchy, there is significant locality and bank parallelism in the DRAM ..."
Abstract
-
Cited by 37 (1 self)
- Add to MetaCart
This paper analyzes memory access scheduling and virtual channels as mechanisms to reduce the latency of main memory accesses by the CPU and peripherals in web servers. Despite the address filtering effects of the CPU's cache hierarchy, there is significant locality and bank parallelism in the DRAM access stream of a web server, which includes traffic from the operating system, application, and peripherals. However, a sequential memory controller leaves much of this locality and parallelism unexploited, as serialization and bank conflicts affect the realizable latency. Aggressive scheduling within the memory controller to exploit the available parallelism and locality can reduce the average read latency of the SDRAM. However, bank conflicts and the limited ability of the SDRAM's internal row buffers to act as a cache hinder further latency reduction. Virtual channel SDRAM overcomes these limitations by providing a set of channel buffers that can hold segments from rows of any internal SDRAM bank. This paper presents memory controller policies that can make effective use of these channel buffers to further reduce the average read latency of the SDRAM.
Adaptive History-Based Memory Schedulers
"... As memory performance becomes increasingly important to overall system performance, the need to carefully schedule memory operations also increases. This paper presents a new approach to memory scheduling that considers the history of recently scheduled operations. This history-based approach provid ..."
Abstract
-
Cited by 34 (2 self)
- Add to MetaCart
As memory performance becomes increasingly important to overall system performance, the need to carefully schedule memory operations also increases. This paper presents a new approach to memory scheduling that considers the history of recently scheduled operations. This history-based approach provides two conceptual advantages: (1) it allows the scheduler to better reason about the delays associated with its scheduling decisions, and (2) it allows the scheduler to select operations so that they match the program's mixture of Reads and Writes, thereby avoiding certain bottlenecks within the memory controller. We evaluate our solution using a cycle-accurate simulator for the recently announced IBM Power5. When compared with an in-order scheduler, our solution achieves IPC improvements of 10.9% on the NAS benchmarks and 63% on the data-intensive Stream benchmarks. Using microbenchmarks, we illustrate the growing importance of memory scheduling in the context of CMP's, hardware controlled prefetching, and faster CPU speeds.
ATLAS: A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers
"... Modern chip multiprocessor (CMP) systems employ multiple memory controllers to control access to main memory. The scheduling algorithm employed by these memory controllers has a significant effect on system throughput, so choosing an efficient scheduling algorithm is important. The scheduling algori ..."
Abstract
-
Cited by 13 (6 self)
- Add to MetaCart
Modern chip multiprocessor (CMP) systems employ multiple memory controllers to control access to main memory. The scheduling algorithm employed by these memory controllers has a significant effect on system throughput, so choosing an efficient scheduling algorithm is important. The scheduling algorithm also needs to be scalable – as the number of cores increases, the number of memory controllers shared by the cores should also increase to provide sufficient bandwidth to feed the cores. Unfortunately, previous memory scheduling algorithms are inefficient with respect to system throughput and/or are designed for a single memory controller and do not scale well to multiple memory controllers, requiring significant finegrained coordination among controllers. This paper proposes ATLAS (Adaptive per-Thread Least-Attained-Service memory scheduling), a fundamentally new memory
Detecting and exploiting spatial regularity in data memory references
, 2003
"... The growing processor/memory performance gap causes the performance of many codes to be limited by memory accesses. If known to exist in an application, strided memory accesses forming streams can be targeted by optimizations such as prefetching, relocation, remapping, and vector loads. Undetected, ..."
Abstract
-
Cited by 12 (5 self)
- Add to MetaCart
The growing processor/memory performance gap causes the performance of many codes to be limited by memory accesses. If known to exist in an application, strided memory accesses forming streams can be targeted by optimizations such as prefetching, relocation, remapping, and vector loads. Undetected, they can be a significant source of memory stalls in loops. Existing stream-detection mechanisms either require special hardware, which may not gather statistics for subsequent analysis, or are limited to compile-time detection of array accesses in loops. Formally, little treatment has been accorded to the subject; the concept of locality fails to capture the existence of streams in a program’s memory accesses. The contributions of this paper are as follows. First, we define spatial regularity as a means to discuss the presence and effects of streams. Second, we develop measures to quantify spatial regularity, and we design and implement an on-line, parallel algorithm to detect streams — and hence regularity — in running applications. Third, we use examples from real codes and common benchmarks to illustrate how derived stream statistics can be used to guide the application of profile-driven optimizations. Overall, we demonstrate the benefits of our novel regu-This work was performed under the auspices of the U.S. Department of
Implementation and Evaluation of the Complex Streamed Instruction Set
, 2001
"... An architectural paradigm designed to accelerate streaming operations on mixed-width data is presented and evaluated. The described Complex Streamed Instruction (CSI) set contains instructions that process data streams of arbitrary length. The number of bits or elements that will be processed in par ..."
Abstract
-
Cited by 12 (5 self)
- Add to MetaCart
An architectural paradigm designed to accelerate streaming operations on mixed-width data is presented and evaluated. The described Complex Streamed Instruction (CSI) set contains instructions that process data streams of arbitrary length. The number of bits or elements that will be processed in parallel is, therefore, not visible to the programmer, so no recompilation is needed in order to benefit from a wider datapath. CSI also eliminates many overhead instructions (such as instructions needed for data alignment and reorganization) often needed in applications utilizing media ISA extensions such as MMX and VIS by replacing them by a hardware mechanism. Simulation results using several multimedia kernels demonstrate that CSI provides a factor of up to 9.9 (4.0 on average) performance improvement when compared to Sun's VIS extension. For complete applications, the performance gain is 9% to 36% with an average of 20%.
Beyond Performance: Secure and Fair Memory Management for Multiple Systems on a Chip
- In FPT
, 2003
"... Developments in VLSI technologies create the possibility of hosting several independent (sub) systems in a single chip. There is a need to share a number of resources, especially off-chip resources, which creates new constraints in the design process. Although performance is still a key constraint, ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Developments in VLSI technologies create the possibility of hosting several independent (sub) systems in a single chip. There is a need to share a number of resources, especially off-chip resources, which creates new constraints in the design process. Although performance is still a key constraint, sharing implies that secure access to those resources and QoS guarantees are needed. In this paper, an architecture is presented that achieves the goals listed above. The Embedded Hardware Manager acts as a middleware between the applications and the resources, taking the role of resource manager and security agent. The results show that it can prevent resource misuse and undue information peeking or even altering while maintaining individual QoS guarantees. At the same time, high performance is still achieved.
Exploiting locality to ameliorate packet queue contention and serialization
- In CF ’06: Proceedings of the 3rd Conference on Computing Frontiers
, 2006
"... Packet processing systems maintain high throughput despite relatively high memory latencies by exploiting the coarse-grained parallelism available between packets. In particular, multiple processors are used to overlap the processing of multiple packets. Packet queuing—the fundamental mechanism enab ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Packet processing systems maintain high throughput despite relatively high memory latencies by exploiting the coarse-grained parallelism available between packets. In particular, multiple processors are used to overlap the processing of multiple packets. Packet queuing—the fundamental mechanism enabling packet scheduling, differentiated services, and traffic isolation—requires a read-modify-write operation on a linked list data structure to enqueue and dequeue packets; this operation represents a potential serializing bottleneck. If all packets awaiting service are destined for different queues, these read-modify-write cycles can proceed in parallel. However, if all or many of the incoming packets are destined for the same queue, or for a small number of queues, then system throughput will be serialized by these sequential external memory operations. For this reason, low latency SRAMs are used to implement the queue data structures. This reduces the absolute cost of serialization but does not eliminate it; SRAM latencies determine system throughput. In this paper we observe that the worst-case scenario for packet queuing coincides with the best-case scenario for caches: i.e., when locality exists and the majority of packets are destined for a small number of queues. The main contribution of this work is the queuing cache, which consists of a hardware cache and a closely coupled queuing engine that implements queue operations. The queuing cache improves performance dramatically by moving the bottleneck from external memory onto the packet processor, where clock rates are higher and latencies are lower. We compare the queuing cache to a number of alternatives, specifically, SRAM controllers with: no queuing support, a softwarecontrolled cache plus a queuing engine (like that used on Intel’s IXP network processor), and a hardware cache. Relative to these models, we show that a queuing cache improves worst-case throughput by factors of 3.1, 1.5, and 2.1 and the throughput of real-world traffic traces by factors of 2.6, 1.3, and 1.75, respectively. We also show that the queuing cache decreases external memory bandwidth usage, on-chip communication, and the num-
A Compiler Algorithm for Exploiting Page-Mode Memory Access in Embedded-DRAM Devices
"... This paper presents a compiler algorithm and several optimization techniques to exploit a DRAM memory characteristic(page mode) automatically. A page-mode memory access exploits a form of spatial locality, where the data item is in the same row of the memory buffer as the previous access. Thus, acce ..."
Abstract
- Add to MetaCart
This paper presents a compiler algorithm and several optimization techniques to exploit a DRAM memory characteristic(page mode) automatically. A page-mode memory access exploits a form of spatial locality, where the data item is in the same row of the memory buffer as the previous access. Thus, access time is reduced because the cost of row selection is eliminated. The algorithm increases frequency of page-mode accesses by reordering data accesses, grouping together accesses to the same memory row. We implemented this algorithm and present speedup results for four multimedia kernels ranging from 1.25 to 2.19 for a Processing-In-Memory (PIM) embedded DRAM device. 1.
Research Statement
"... o transmit stream parameters to the SMC hardware at run-time, and the SMC prefetched read data and buffered write data, dynamically ordering the memory accesses to avoid bank conflicts and to take advantage of DRAM component features. A set of programmable stream registers (FIFOs) buffered data betw ..."
Abstract
- Add to MetaCart
o transmit stream parameters to the SMC hardware at run-time, and the SMC prefetched read data and buffered write data, dynamically ordering the memory accesses to avoid bank conflicts and to take advantage of DRAM component features. A set of programmable stream registers (FIFOs) buffered data between the CPU and memory, and the processor accessed these in the natural order of the computation. I studied the performance tradeoffs of numerous SMC configurations for both uniprocessor [TC 00] and SMP systems [EUROPAR 95], and our research group built two proof-of-concept uniprocessor implementations whose performance matched the results of my analysis and simulations: the SMC decreased the execution times of inner loops by factors of up to 13 over normal caching. Subsequent studies have looked at how such a system can be adapted to exploit Rambus DRAMs, using either statically [HPCA 99] or dynamically detected streams [ICS 00]. Current research focuses on the design and implementation of

