Results 11 - 20
of
104
Improving Cache Behavior of Dynamically Allocated Data Structures
- In International Conference on Parallel Architectures and Compilation Techniques
, 1998
"... Poor data layout in memory may generate weak data locality and poor performance. Code transformations such as loop blocking or interchanging and array padding have addressed this issue for scientific applications. However many generalist applications do not use data arrays, but dynamically allocated ..."
Abstract
-
Cited by 50 (0 self)
- Add to MetaCart
Poor data layout in memory may generate weak data locality and poor performance. Code transformations such as loop blocking or interchanging and array padding have addressed this issue for scientific applications. However many generalist applications do not use data arrays, but dynamically allocated heterogeneous data structures. In this paper, we explore two data layout techniques for dynamically allocated data structures: field reorganization, and instance interleaving. The application of these techniques may be guided by program profiling. This allows significant cache behavior improvements on some applications. To support instance interleaving, we developed a specific memory allocation library called ialloc. An ialloc-like library may be of great help in a toolbox for performance tuning of general-purpose applications.
Locality vs. Criticality
, 2001
"... Current memory hierarchies exploit locality of references to reduce load latency and thereby improve processor performance. Locality based schemes aim at reducing the number of cache misses and tend to ignore the nature of misses. This leads to a potential mis-match between load latency requirements ..."
Abstract
-
Cited by 41 (0 self)
- Add to MetaCart
Current memory hierarchies exploit locality of references to reduce load latency and thereby improve processor performance. Locality based schemes aim at reducing the number of cache misses and tend to ignore the nature of misses. This leads to a potential mis-match between load latency requirements and latencies realized using a traditional memory system. To bridge this gap, we partition loads as critical and non-critical. A load that needs to complete early to prevent processor stalls is classified as critical, while a load that can tolerate a long latency is considered non-critical. In this paper, we investigate if it is worth violating locality to exploit information on criticality to improve processor performance. We present a dynamic critical load classification scheme and show that 40% performance improvements are possible on average, if all critical loads are guaranteed to hit in the L1 cache. We then compare the two properties, locality and criticality, in the context of several cache organization and prefetching schemes. We find that the working set of critical loads is large, and hence practical cache organization schemes based on criticality are unable to reduce the critical load miss ratios enough to produce performance gains. Although criticality-based prefetching can help for some resource constrained programs, its benefit over locality-based prefetching is small and may not be worth the added complexity.
Pretenuring for Java
- In Proceedings of SIGPLAN 2001 Conference on Object-Oriented Programming, Languages, & Applications
, 2001
"... Pretenuring is a technique for reducing copying costs in garbage collectors. When pretenuring, the allocator places long-lived objects into regions that the garbage collector will rarely, if ever, collect. We extend previous work on profiling-driven pretenuring as follows. (1) We develop a collector ..."
Abstract
-
Cited by 39 (8 self)
- Add to MetaCart
Pretenuring is a technique for reducing copying costs in garbage collectors. When pretenuring, the allocator places long-lived objects into regions that the garbage collector will rarely, if ever, collect. We extend previous work on profiling-driven pretenuring as follows. (1) We develop a collector-neutral approach to obtaining object lifetime profile information. We show that our collection of Java programs exhibits a very high degree of homogeneity of object lifetimes at each allocation site. This result is robust with respect to different inputs, and is similar to previous work on ML, but is in contrast to C programs, which require dynamic call chain context information to extract homogeneous lifetimes. Call-site homogeneity considerably simplifies the implementation of pretenuring and makes it more efficient. (2) Our pretenuring advice is neutral with respect to the collector algorithm, and we use it to improve two quite different garbage collectors: a traditional generational collector and an older-first collector. The system is also novel because it classifies and allocates objects into 3 categories: we allocate immortal objects into a permanent region that the collector will never consider, long-lived objects into a region in which the collector placed survivors of the most recent collection, and shortlived objects into the nursery, i.e., the default region. (3) We evaluate pretenuring on Java programs. Our simulation results show that pretenuring significantly reduces collector copying for generational and older-first collectors. 1.
Dynamically Allocating Processor Resources between Nearby and Distant ILP
, 2001
"... Modern superscalar processors use wide instruction issue widths and out-of-order execution in order to increase instruction-level parallelism (ILP). Because instructions must be committed in order so as to guarantee precise exceptions, increasing ILP implies increasing the sizes of structures such a ..."
Abstract
-
Cited by 38 (2 self)
- Add to MetaCart
Modern superscalar processors use wide instruction issue widths and out-of-order execution in order to increase instruction-level parallelism (ILP). Because instructions must be committed in order so as to guarantee precise exceptions, increasing ILP implies increasing the sizes of structures such as the register file, issue queue, and reorder buffer. Simultaneously, cycle time constraints limit the sizes of these structures, resulting in conflicting design requirements. In this paper, we present a novel microarchitecture designed to overcome the limitations of a register file size dictated by cycle time constraints. Available registers are dynamically allocated between the primary program thread and a future thread. The future thread executes instructions when the primary thread is limited by resource availability. The future thread is not constrained by in-order commit requirements. It is therefore able to examine a much larger instruction window and jump far ahead to execute ready instructions. Results are communicated back to the primary thread by warming up the register file, instruction cache, data cache, and instruction reuse buffer, and by resolving branch mispredicts early. The proposed microarchitecture is able to get an overall speedup of 1.17 over the base processor for our benchmark set, with speedups of up to 1.64. 1
Continuous Program Optimization: A Case Study
- ACM Transactions on Programming Languages and Systems
, 2003
"... This paper presents a system that provides code generation at load-time and continuous program optimization at run-time. First, the architecture of the system is presented. Then, two optimization techniques are discussed that were developed specifically in the context of continuous optimization. The ..."
Abstract
-
Cited by 38 (7 self)
- Add to MetaCart
This paper presents a system that provides code generation at load-time and continuous program optimization at run-time. First, the architecture of the system is presented. Then, two optimization techniques are discussed that were developed specifically in the context of continuous optimization. The first of these optimizations continually adjusts the storage layouts of dynamic data structures to maximize data cache locality, while the second performs profile-driven instruction re-scheduling to increase instruction-level parallelism. These two optimizations have very di#erent cost/benefit ratios, presented in a series of benchmarks. The paper concludes with an outlook to future research directions and an enumeration of some remaining research problems. The empirical results presented in this paper make a case in favor of continuous optimization, but indicate that it needs to be applied judiciously. In many situations, the costs of dynamic optimizations outweigh their benefit, so that no break-even point is ever reached. In favorable circumstances, on the other hand, speed-ups of over 120% have been observed. It appears as if the main beneficiaries of continuous optimization are shared libraries, which at di#erent times can be optimized in the context of the currently dominant client application.
Push vs. Pull: Data Movement for Linked Data Structures
- In International Conference on Supercomputing
, 2000
"... As the performance gap between the CPU and main memory continues to grow, techniques to hide memory latency are essential to deliver a high performance computer system. Prefetching can often overlap memory latency with computation for array-based numeric applications. However, prefetching for pointe ..."
Abstract
-
Cited by 37 (2 self)
- Add to MetaCart
As the performance gap between the CPU and main memory continues to grow, techniques to hide memory latency are essential to deliver a high performance computer system. Prefetching can often overlap memory latency with computation for array-based numeric applications. However, prefetching for pointer-intensive applications still remains a challenging problem. Prefetching linked data structures (LDS) is difficult because the address sequence of LDS traversal does not present the same arithmetic regularity as array-based applications and the data dependence of pointer dereferences can serialize the address generation process. In this paper, we propose a cooperative hardware/software mechanism to reduce memory access latencies for linked data structures. Instead of relying on the past address history to predict future accesses, we identify the load instructions that traverse the LDS, and execute them ahead of the actual computation. To overcome the serial nature of the LDS address genera...
Is it a Tree, a DAG, or a Cyclic Graph?
, 1996
"... This paper reports on the design and implementation of a practical shape analysis for C. The purpose of the analysis is to aid in the disambiguation of heap-allocated data structures by estimating the shape (Tree, DAG, or Cyclic Graph) of the data structure accessible from each heap-directed pointer ..."
Abstract
-
Cited by 37 (0 self)
- Add to MetaCart
This paper reports on the design and implementation of a practical shape analysis for C. The purpose of the analysis is to aid in the disambiguation of heap-allocated data structures by estimating the shape (Tree, DAG, or Cyclic Graph) of the data structure accessible from each heap-directed pointer. This shape information can be used to improve dependence testing and in parallelization, and to guide the choice of more complex heap analyses. The method has been implemented as a contextsensitive interprocedural analysis in the McCAT compiler. Experimental results and observations are given for 16 benchmark programs. These results show that the analysis gives accurate and useful results for an important group of applications. 1 Introduction and Related Work Pointer analyses are of critical importance for optimizing /parallelizing compilers that support languages like C, C++ and FORTRAN90. The pointer analysis problem can be divided into two distinct subproblems: (i) analyzing pointers t...
Instruction Path Coprocessors
- In Proceedings of the 27th Annual International Symposium on Computer Architecture
, 2000
"... This paper presents the concept of an Instruction Path Coprocessor (I-COP), which is a programmable on-chip coprocessor, with its own instruction set, that operates on the core processor's instructions to transform them into an internal format that can be more efficiently executed. It is located ..."
Abstract
-
Cited by 34 (2 self)
- Add to MetaCart
This paper presents the concept of an Instruction Path Coprocessor (I-COP), which is a programmable on-chip coprocessor, with its own instruction set, that operates on the core processor's instructions to transform them into an internal format that can be more efficiently executed. It is located off the critical path of the core processor to ensure that it does not negatively impact the core processor's cycle time. An I-COP is highly versatile and can be used to implement different types of instruction transformations to enhance the IPC of the core processor. We study four potential applications of the I-COP to demonstrate the feasibility of this concept and investigate the design issues of such a coprocessor. A prototype instruction set for the I-COP is presented along with an implementation framework that facilitates achieving high I-COP performance. Initial results indicate that the I-COP is able to efficiently implement the trace cache fill unit, register move optimizatio...
Predicting Data Cache Misses in Non-Numeric Applications Through Correlation Profiling
- In MICRO-30
, 1997
"... To maximize the benefit and minimize the overhead of software-based latency tolerance techniques, we would like to apply them precisely to the set of dynamic references that suffer cache misses. Unfortunately, the information provided by the state-of-theart cache miss profiling technique (summary pr ..."
Abstract
-
Cited by 34 (3 self)
- Add to MetaCart
To maximize the benefit and minimize the overhead of software-based latency tolerance techniques, we would like to apply them precisely to the set of dynamic references that suffer cache misses. Unfortunately, the information provided by the state-of-theart cache miss profiling technique (summary profiling) is inadequate for references with intermediate miss ratios---it results in either failing to hide latency, or else inserting unnecessary overhead. To overcome this problem, we propose and evaluate a new technique--- correlation profiling---which improves predictability by correlating the caching behavior with the associated dynamic context. Our experimental results demonstrate that roughly half of the 22 non-numeric applications we study can potentially enjoy significant reductions in memory stall time by exploiting at least one of the three forms of correlation profiling we consider. 1 Introduction As the disparity between processor and memory speeds continues to grow, memory l...
Executing Multiple Pipelined Data Analysis Operations in the Grid
, 2002
"... Processing of data in many data analysis applications can be represented as an acyclic, coarse grain data flow, from data sources to the client. This paper is concerned with scheduling of multiple data analysis operations, each of which is represented as a pipelined chain of processing on data. We ..."
Abstract
-
Cited by 32 (7 self)
- Add to MetaCart
Processing of data in many data analysis applications can be represented as an acyclic, coarse grain data flow, from data sources to the client. This paper is concerned with scheduling of multiple data analysis operations, each of which is represented as a pipelined chain of processing on data. We define the scheduling problem for effectively placing components onto Grid resources, and propose two scheduling algorithms. Experimental results are presented using a visualization application.

