Results 1 - 10
of
18
Making Pointer-Based Data Structures Cache Conscious
- IEEE COMPUTER
, 2000
"... Processor and memory technology trends portend a continual increase in the relative cost of accessing main memory. Machine designers have tried to mitigate the effect of this trend through a hierarchy of caches and a variety of other hardware and software techniques. These techniques, unfortunately, ..."
Abstract
-
Cited by 19 (1 self)
- Add to MetaCart
Processor and memory technology trends portend a continual increase in the relative cost of accessing main memory. Machine designers have tried to mitigate the effect of this trend through a hierarchy of caches and a variety of other hardware and software techniques. These techniques, unfortunately, have only been partially successful for pointer-manipulating programs. This paper explores a complementary approach of enlisting programmers and tool writers in the task of improving the cache locality of accesses to pointer-based data structures. Throughout, we exploit the location transparency of pointer-based data structures that allow changes to the memory (and cache) layout of nodes, records, fields, etc. We discuss how programmers can manually improve cache performance with techniques, such as clustering, compression, and coloring. We then explore how to lessen a programmer's burden with the help of semi-automatic and automatic tools for changing structure layout to improve cache per...
Improving Locality for Adaptive Irregular Scientific Codes
- In 13 th Int'l Workshop on Languages and Compilers for Parallel Computing
, 1999
"... An important class of scientific codes access memory in an irregular manner. Because irregular access patterns reduce temporal and spatial locality, they tend to underutilize caches, resulting in poor performance. Researchers have shown that consecutively packing data relative to traversal order can ..."
Abstract
-
Cited by 14 (2 self)
- Add to MetaCart
An important class of scientific codes access memory in an irregular manner. Because irregular access patterns reduce temporal and spatial locality, they tend to underutilize caches, resulting in poor performance. Researchers have shown that consecutively packing data relative to traversal order can significantly reduce cache miss rates by increasing spatial locality. In this paper, we investigate techniques for using partitioning algorithms to improve locality in adaptive irregular codes. We develop parameters to guide both geometric (RCB) and graph partitioning (METIS) algorithms, and develop a new graph partitioning algorithm based on hierarchical clustering (GPART) which achieves good locality with low overhead. We also examine the effectiveness of locality optimizations for adaptive codes, where connection patterns dynamically change at intervals during program execution. We use a simple cost model to guide locality optimizations when access patterns change. Experiments on irregul...
Memory-Side Prefetching for Linked Data Structures
- JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING
, 2001
"... This work studies a memory-side prefetching technique to hide latency incurred by inherently serial accesses to linked data structures (LDS). A programmable prefetch engine sits close to memory and traverses LDS independently from the processor. The prefetch engine can run ahead of the processor bec ..."
Abstract
-
Cited by 12 (0 self)
- Add to MetaCart
This work studies a memory-side prefetching technique to hide latency incurred by inherently serial accesses to linked data structures (LDS). A programmable prefetch engine sits close to memory and traverses LDS independently from the processor. The prefetch engine can run ahead of the processor because of its low latency, high bandwidth path to memory. This allows the prefetch engine to initiate data transfers earlier than the processor and pipeline multiple such transfers over the network. We evaluate
Quantifying Load Stream Behavior
- IN PROCEEDINGS OF THE EIGHTH INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE COMPUTER ARCHITECTURE
, 2002
"... The increasing performance gap between processors and memory will force future architectures to devote significant resources towards removing and hiding memory latency. The two major architectural features used to address this growing gap are caches and prefetching. In this paper ..."
Abstract
-
Cited by 11 (3 self)
- Add to MetaCart
The increasing performance gap between processors and memory will force future architectures to devote significant resources towards removing and hiding memory latency. The two major architectural features used to address this growing gap are caches and prefetching. In this paper
Access Pattern based Local Memory Customization for Low Power Embedded Systems
- In DATE
, 2001
"... Memory accesses represent a major bottleneck in embedded systems power and performance. Traditionally, the local memory relied on a large cache to store all the variables in the application. However, especially in large real-life applications, different types of data exhibit divergent types of local ..."
Abstract
-
Cited by 10 (3 self)
- Add to MetaCart
Memory accesses represent a major bottleneck in embedded systems power and performance. Traditionally, the local memory relied on a large cache to store all the variables in the application. However, especially in large real-life applications, different types of data exhibit divergent types of locality and access patterns, with diverse locality and bandwidth needs. Traditional caches had to compromise between the different types of locality required by the access patterns, and trade-off performance against bandwidth requirement. Instead, our approach customizes the local memory architecture matching the diverse access patterns and locality types present in the application, to reduce the main memory bandwidth requirement, and significantly improve power consumption, without sacrificing performance. Our approach generated an average 30% memory power reduction without degrading performance on a set of large multimedia/general purpose applications and scientific kernels, over the best traditional cache configuration of similar size, demonstrating the utility of our algorithm.
Profile Guided Compiler Optimizations
, 2002
"... Over the past several decades numerous compile-time optimizations have been developed to speed up the execution of programs. Application of a typical optimization can be viewed as consisting of two primary tasks: uncovering optimization opportunities through static analysis of the program; and trans ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Over the past several decades numerous compile-time optimizations have been developed to speed up the execution of programs. Application of a typical optimization can be viewed as consisting of two primary tasks: uncovering optimization opportunities through static analysis of the program; and transforming the program to exploit the uncovered opportunities. Most of the early work on classical code optimizations is based upon a very simple performance model. Optimizations are de ned such that their application is always considered to have a positive impact on performance. Therefore, the focus of the research has been on developing aggressive analysis techniques for uncovering opportunities for optimization in greater numbers and designing powerful program transformations to exploit most if not all of the uncovered opportunities. While the simplicity of the approach is attractive, in recent years it has been recognized that the above approach for optimizing program...
Compressing heap data for improved memory performance
, 2006
"... We introduce a class of transformations that modify the representation of dynamic data structures used in programs with the objective of compressing their sizes. Based upon a profiling study of data value characteristics, we have developed the common-prefix and narrow-data transformations that respe ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
We introduce a class of transformations that modify the representation of dynamic data structures used in programs with the objective of compressing their sizes. Based upon a profiling study of data value characteristics, we have developed the common-prefix and narrow-data transformations that respectively compress a 32 bit address pointer and a 32 bit integer field into 15 bit entities. A pair of fields that have been compressed by the above compression transformations are packed together into a single 32 bit word. The above transformations are designed to apply to data structures that are partially compressible, that is, they compress portions of data structures to which transformations apply and provide a mechanism to handle the data that is not compressible. The accesses to compressed data are efficiently implemented by designing data compression extensions (DCX) to the processor’s instruction set. We have observed average reductions in heap allocated storage of 25 % and average reductions in execution time and power consumption of 30%. If DCX support is not provided the reductions in execution times fall from 30 % to
Architectural support for uniprocessor and multiprocessor active memory systems
- IEEE Transactions on Computers
, 2004
"... Abstract—We introduce an architectural approach to improve memory system performance in both uniprocessor and multiprocessor systems. The architectural innovation is a flexible active memory controller backed by specialized cache coherence protocols that permit the transparent use of address remappi ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Abstract—We introduce an architectural approach to improve memory system performance in both uniprocessor and multiprocessor systems. The architectural innovation is a flexible active memory controller backed by specialized cache coherence protocols that permit the transparent use of address remapping techniques. The resulting system shows significant performance improvement across a spectrum of machine configurations, from uniprocessors through single-node multiprocessors (SMPs) to distributed shared memory clusters (DSMs). Address remapping techniques exploit the data access patterns of applications to enhance their cache performance. However, they create coherence problems since the processor is allowed to refer to the same data via more than one address. While most active memory implementations require cache flushes, we present a new approach to solve the coherence problem. We leverage and extend the cache coherence protocol so that our techniques work transparently to efficiently support uniprocessor, SMP and DSM active memory systems. We detail the coherence protocol extensions to support our active memory techniques and present simulation results that show uniprocessor speedup from 1.3 to 7.6 on a range of applications and microbenchmarks. We also show remarkable performance improvement on small to medium-scale SMP and DSM multiprocessors, allowing some parallel applications to continue to scale long after their performance levels off on normal systems. Index Terms—Active memory systems, address remapping, cache coherence protocol, distributed shared memory, flexible memory controller architecture. æ 1
Using Squids to Address Forwarding Pointer Aliasing
- Project Aries Tech Note 4, Massachusetts Institute of Technology AI Lab, http://www.ai.mit.edu/projects/aries
, 2002
"... Forwarding pointers allow safe and efficient data migration. However, they also introduce a new source of aliasing and as a result can have a serious impact on program performance. In this paper we introduce short quasi-unique ID's (squids), a simple hardware mechanism for capability architectures t ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
Forwarding pointers allow safe and efficient data migration. However, they also introduce a new source of aliasing and as a result can have a serious impact on program performance. In this paper we introduce short quasi-unique ID's (squids), a simple hardware mechanism for capability architectures that mitigates the problems associated with aliasing. In the common case, squids can be used to prove that there is no aliasing and therefore avoid any overhead. The probability of having to perform expensive dereferencing operations to check for aliasing when comparing pointers to different objects is exponentially small in the number of bits used to implement squids. Benchmark programs on a simulated architecture show that squids can, in extreme cases, reduce execution time by more than a factor of four.
Design and Evaluation of the Hamal Parallel Computer
, 2002
"... Over the years there has been an enormous amount of hardware research in parallel computation. It os a testament... ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Over the years there has been an enormous amount of hardware research in parallel computation. It os a testament...

