Results 1 - 10
of
22
Cache-Conscious Structure Layout
, 1999
"... Hardware trends have produced an increasing disparity between processor speeds and memory access times. While a variety of techniques for tolerating or reducing memory latency have been proposed, these are rarely successful for pointer-manipulating programs. This paper explores a complementary appro ..."
Abstract
-
Cited by 164 (8 self)
- Add to MetaCart
Hardware trends have produced an increasing disparity between processor speeds and memory access times. While a variety of techniques for tolerating or reducing memory latency have been proposed, these are rarely successful for pointer-manipulating programs. This paper explores a complementary approach that attacks the source (poor reference locality) of the problem rather than its manifestation (memory latency). It demonstrates that careful data organization and layout provides an essential mechanism to improve the cache locality of pointer-manipulating programs and consequently, their performance. It explores two placement technique-lustering and colorinet improve cache performance by increasing a pointer structure’s spatial and temporal locality, and by reducing cache-conflicts. To reduce the cost of applying these techniques, this paper discusses two strategies-cache-conscious reorganization and cacheconscious allocation--and describes two semi-automatic toolsccmorph and ccmalloc-that use these strategies to produce cache-conscious pointer structure layouts. ccmorph is a transparent tree reorganizer that utilizes topology information to cluster and color the structure. ccmalloc is a cache-conscious heap allocator that attempts to co-locate contemporaneously accessed data elements in the same physical cache block. Our evaluations, with microbenchmarks, several small benchmarks, and a couple of large real-world applications, demonstrate that the cache-conscious structure layouts produced by ccmorph and ccmalloc offer large performance benefit-n most cases, significantly outperforming state-of-the-art prefetching.
Runahead execution: An alternative to very large instruction windows for out-of-order processors
- In HPCA-9
, 2003
"... Today’s high performance processors tolerate long latency operations by means of out-of-order execution. However, as latencies increase, the size of the instruction window must increase even faster if we are to continue to tolerate these latencies. We have already reached the point where the size of ..."
Abstract
-
Cited by 123 (19 self)
- Add to MetaCart
Today’s high performance processors tolerate long latency operations by means of out-of-order execution. However, as latencies increase, the size of the instruction window must increase even faster if we are to continue to tolerate these latencies. We have already reached the point where the size of an instruction window that can handle these latencies is prohibitively large, in terms of both design complexity and power consumption. And, the problem is getting worse. This paper proposes runahead execution as an effective way to increase memory latency tolerance in an out-of-order processor, without requiring an unreasonably large instruction window. Runahead execution unblocks the instruction window blocked by long latency operations allowing the processor to execute far ahead in the program path. This results in data being prefetched into caches long before it is needed. On a machine model based on the Intel R ○ Pentium R ○ 4 processor, having a 128-entry instruction window, adding runahead execution improves the IPC (Instructions Per Cycle) by 22 % across a wide range of memory intensive applications. Also, for the same machine model, runahead execution combined with a 128-entry window performs within 1 % of a machine with no runahead execution and a 384-entry instruction window. 1.
Using generational garbage collection to implement cache-conscious data placement
- In Proceedings of the International Symposium on Memory Management
, 1998
"... The cost of accessing main memory is increasing. Machine designers have tried to mitigate the consequences of the processor and memory technology trends underlying this increasing gap with a variety of techniques to reduce or tolerate memory latency. These techniques, unfortunately, are only occasio ..."
Abstract
-
Cited by 90 (11 self)
- Add to MetaCart
The cost of accessing main memory is increasing. Machine designers have tried to mitigate the consequences of the processor and memory technology trends underlying this increasing gap with a variety of techniques to reduce or tolerate memory latency. These techniques, unfortunately, are only occasionally successful for pointer-manipulating programs. Recent research has demonstrated the value of a complementary approach, in which pointer-based data structures are reorganized to improve cache locality. This paper studies a technique for using a generational garbage collector to reorganize data
Integrated Parallel Prefetching and Caching
, 1995
"... Recently there has been a great deal of interest in prefetching from parallel disks, as a technique for enabling serial applications to improve I/O performance. Studies have also shown that for optimal performance, it is important to properly integrate prefetching and caching. In this paper, we stud ..."
Abstract
-
Cited by 63 (5 self)
- Add to MetaCart
Recently there has been a great deal of interest in prefetching from parallel disks, as a technique for enabling serial applications to improve I/O performance. Studies have also shown that for optimal performance, it is important to properly integrate prefetching and caching. In this paper, we study integrated prefetching and caching strategies for multiple disks. We present two algorithms, regular aggressive and reverse aggressive, and show that reverse aggressive is close to optimal. Using trace-driven simulation on a collection of file access traces, we evaluated these algorithms under a variety of data placement alternatives. Our results show that both algorithms can achieve near linear speedup when the load is distributed evenly on the disks, and reverse aggressive performs well even when the placement of blocks on disks distributes the load unevenly. Our simulations also show that, for file system traces, replicating data, even across all of the disks, offers little performance ...
Out-of-Order Vector Architectures
, 1997
"... Register renaming and out-of-order instruction issue are now commonly used in superscalar processors. These techniques can also be used to significant advantage in vector processors, as this paper shows. Performance is improved and available memory bandwidth is used more effectively. Using a trace d ..."
Abstract
-
Cited by 46 (21 self)
- Add to MetaCart
Register renaming and out-of-order instruction issue are now commonly used in superscalar processors. These techniques can also be used to significant advantage in vector processors, as this paper shows. Performance is improved and available memory bandwidth is used more effectively. Using a trace driven simulation we compare a conventional vector implementation, based on the Convex C3400, with an out-of-order, register renaming, vector implementation. When the number of physical registers is above 12, out-of-order execution coupled with register renaming provides a speedup of 1.24--1.72 for realistic memory latencies. Out-of-order techniques also tolerate main memory latencies of 100 cycles with a performance degradation less than 6%. The mechanisms used for register renaming and out-of-order issue can be used to support precise interrupts -- generally a difficult problem in vector machines. When precise interrupts are implemented, there is typically less than a 10% degradation in performance. A new technique based on register renaming is targeted at dynamically eliminating spill code; this technique is shown to provide an extra speedup ranging between 1.10 and 1.20 while reducing total memory traffic by an average of 15--20%.
Caches As Filters: A Framework for the Analysis of Caching Systems
, 2001
"... This dissertation describes the Cache Filter Model, an analytical framework for cache system analysis. This framework provides a language and formal notation that enables researchers to reason and communicate about systems in an insightful new way. There are four major components that form the frame ..."
Abstract
-
Cited by 15 (4 self)
- Add to MetaCart
This dissertation describes the Cache Filter Model, an analytical framework for cache system analysis. This framework provides a language and formal notation that enables researchers to reason and communicate about systems in an insightful new way. There are four major components that form the framework. First, the TSpec notation is a formal way for researchers to communicate with clarity about memory references generated by a processor. Second, the concept of an equivalence class of memory references provides an abstraction for eliminating artifacts due to chance address bindings or specific inputs. Third, the functional cache filter model uses the TSpec notation and equivalence class concept to allow designers to more clearly understand the effects of cache systems on particular memory references. Fourth, new metrics provide more insight into cache system behavior than current measures such as hit rate or average memory access time. This dissertation presents the cache filter framework in detail and demonstrates its use on several example kernels.
Third generation computer systems
- ACM Computing Surveys
, 1971
"... The common features of third generation operating systems are surveyed from a general view, with emphasis on the common abstractions that constitute at least the basis for a "theory " of operating systems. Properties of specific systems are not discussed except where examples are useful. T ..."
Abstract
-
Cited by 13 (0 self)
- Add to MetaCart
The common features of third generation operating systems are surveyed from a general view, with emphasis on the common abstractions that constitute at least the basis for a "theory " of operating systems. Properties of specific systems are not discussed except where examples are useful. The technical aspects of issues and concepts are stressed, the nontechnical aspects mentioned only briefly. A perfunctory knowledge of third generation systems is presumed. Key words and phrases: multiprogramming systems, operating systems, supervisory systems, time-sharing systems, programming, storage allocation, memory allocation, processes, concurrency, parallelism, resource allocation, protection CR categories: 1.3, 4.0, 4.30, 6.20 It has been the custom to divide the era of electronic computing into "generations" whose approximate dates are:
Before Memory Was Virtual
, 1997
"... This paper celebrated the successful birth of virtual memory. Object-Oriented Virtual Memory ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
This paper celebrated the successful birth of virtual memory. Object-Oriented Virtual Memory
Smart Register Files for High-Performance Microprocessors
, 1999
"... This report examines how the compiler can more efficiently use a large number of processor registers. The placement of data items into registers, called register allocation, is known to be one of the most important compiler optimizations for high-speed computers because registers are the fastest st ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
This report examines how the compiler can more efficiently use a large number of processor registers. The placement of data items into registers, called register allocation, is known to be one of the most important compiler optimizations for high-speed computers because registers are the fastest storage devices in the computer system. However, register allocation has been limited in scope because of aliasing in the memory system. To break this limitation and allow more data to be placed into registers, new compiler and microarchitecture support is needed. We propose the modification of register access semantics to include an indirect access mode. We call this optimization the Smart Register File. The smart register file allows the relaxation of overly-conservative assumptions in the compiler by having the hardware provide support for aliased data items in processor registers. As a result, the compiler can allocate data from a larger pool of candidates than in a conventional system. An...
Microarchitectural Innovations: Boosting Microprocessor Performance beyond Semiconductor Technology Scaling
- Proc. IEEE 89(11): 1560– 1575
, 2001
"... plentiful transistors to build microprocessors, and applications continue to drive the demand for more powerful microprocessors. Weaving the “raw ” semiconductor material into a microprocessor that offers the performance needed by modern and future applications is the role of computer architecture. ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
plentiful transistors to build microprocessors, and applications continue to drive the demand for more powerful microprocessors. Weaving the “raw ” semiconductor material into a microprocessor that offers the performance needed by modern and future applications is the role of computer architecture. This paper overviews some of the microarchitectural techniques that empower modern high-performance microprocessors. The techniques are classified into: 1) techniques meant to increase the concurrency in instruction processing, while maintaining the appearance of sequential processing and 2) techniques that exploit program behavior. The first category includes pipelining, superscalar execution, out-of-order execution, register renaming, and techniques to overlap memory-accessing instructions. The second category includes memory hierarchies, branch predictors, trace caches, and memory-dependence predictors. The paper also discusses microarchitectural techniques likely to be used in future microprocessors, including data value speculation and instruction reuse, microarchitectures with multiple sequencers and thread-level speculation, and microarchitectural techniques for tackling the problems of power consumption and reliability. Keywords—Branch prediction, high-performance microprocessors, memory dependence speculation, microarchitecture, out-oforder execution, speculative execution, thread-level speculation. I.

