Results 1 - 10
of
54
Dynamic hot data stream prefetching for general-purpose programs
- InACM SIGPLANConference on Programming Language Designand Implementation
, 2002
"... Prefetching data ahead of use has the potential to tolerate the growing processor-memory performance gap by overlapping long latency memory accesses with useful computation. While sophisticated prefetching techniques have been automated for limited domains, such as scientific codes that access dense ..."
Abstract
-
Cited by 87 (1 self)
- Add to MetaCart
Prefetching data ahead of use has the potential to tolerate the growing processor-memory performance gap by overlapping long latency memory accesses with useful computation. While sophisticated prefetching techniques have been automated for limited domains, such as scientific codes that access dense arrays in loop nests, a similar level of success has eluded general-purpose programs, especially pointer-chasing codes written in languages such as C and C++. We address this problem by describing, implementing and evaluating a dynamic prefetching scheme. Our technique runs on stock hardware, is completely automatic, and works for generalpurpose programs, including pointer-chasing codes written in weakly-typed languages, such as C and C++. It operates in three phases. First, the profiling phase gathers a temporal data reference profile from a running program with low-overhead. Next, the profiling is turned off and a fast analysis algorithm extracts hot data streams, which are data reference sequences that frequently repeat in the same order, from the temporal profile. Then, the system dynamically injects code at appropriate program points to detect and prefetch these hot data streams. Finally, the process enters the hibernation phase where no profiling or analysis is performed, and the program continues to execute with the added prefetch instructions. At the end of the hibernation phase, the program is deoptimized to remove the inserted checks and prefetch instructions, and control returns to the profiling phase. For long-running programs, this profile, analyze and optimize, hibernate, cycle will repeat multiple times. Our initial results from applying dynamic prefetching are promising, indicating overall execution time improvements of 5–19 % for several memory-performance-limited SPECint2000 benchmarks running their largest (ref) inputs.
Dynamic metrics for Java
- In Proceedings of the 18th ACM SIGPLAN conference on Object-oriented programing, systems, languages, and applications
, 2003
"... ..."
Reconsidering custom memory allocation
- In Proceedings of the Conference on Object-Oriented Programming: Systems, Languages, and Applications (OOPSLA) 2002
, 2002
"... Programmers hoping to achieve performance improvements often use custom memory allocators. This in-depth study examines eight applications that use custom allocators. Surprisingly, for six of these applications, a state-of-the-art general-purpose allocator (the Lea allocator) performs as well as or ..."
Abstract
-
Cited by 58 (13 self)
- Add to MetaCart
Programmers hoping to achieve performance improvements often use custom memory allocators. This in-depth study examines eight applications that use custom allocators. Surprisingly, for six of these applications, a state-of-the-art general-purpose allocator (the Lea allocator) performs as well as or better than the custom allocators. The two exceptions use regions, which deliver higher performance (improvements of up to 44%). Regions also reduce programmer burden and eliminate a source of memory leaks. However, we show that the inability of programmers to free individual objects within regions can lead to a substantial increase in memory consumption. Worse, this limitation precludes the use of regions for common programming idioms, reducing their usefulness. We present a generalization of general-purpose and region-based allocators that we call reaps. Reaps are a combination of regions and heaps, providing a full range of region semantics with the addition of individual object deletion. We show that our implementation of reaps provides high performance, outperforming other allocators with region-like semantics. We then use a case study to demonstrate the space advantages and software engineering benefits of reaps in practice. Our results indicate that programmers needing fast regions should use reaps, and that most programmers considering custom allocators should instead use the Lea allocator.
Bursty Tracing: A Framework for Low-Overhead Temporal Profiling
- In 4th ACM Workshop on Feedback-Directed and Dynamic Optimization
, 2001
"... With processor speed increasing much more rapidly than memory access speed, memory system optimizations have the potential to significantly improve program performance. Unfortunately, cache-level optimizations often require detailed temporal information about a program's references to be effective. ..."
Abstract
-
Cited by 46 (9 self)
- Add to MetaCart
With processor speed increasing much more rapidly than memory access speed, memory system optimizations have the potential to significantly improve program performance. Unfortunately, cache-level optimizations often require detailed temporal information about a program's references to be effective. Traditional techniques for obtaining this information are too expensive to be practical in an on-line setting. We address this problem by describing and evaluating a framework for low-overhead temporal profiling. Our framework extends the Arnold-Ryder framework that uses instrumentation and counter-based sampling to collect frequency profiles with low overhead. Our framework samples bursts (sub-sequences) of the trace of all runtime events to construct a temporal program profile. Our bursty tracing profiler is built using Vulcan, an executable-editing tool for x86, and we evaluate it on optimized x86 binaries. Like the Arnold-Ryder framework, we have the advantages of not requiring operating system or hardware support and being deterministic. Unlike them, we are not limited to capturing temporal relationships on intraprocedural acyclic paths since our trace bursts can span procedure boundaries. In addition, our framework does not require access to program source or recompilation. A direct implementation of our extensions to the Arnold-Ryder framework results in profiling overhead of 6-35%. We describe techniques that reduce this overhead to 3-18%, making it suitable for use in an on-line setting.
Online Feedback-Directed Optimization of Java
, 2002
"... This paper describes the implementation of an online feedback-directed optimization system. The system is fully automatic; it requires no prior... ..."
Abstract
-
Cited by 45 (3 self)
- Add to MetaCart
This paper describes the implementation of an online feedback-directed optimization system. The system is fully automatic; it requires no prior...
Predicting Whole-Program Locality Through Reuse Distance Analysis
, 2003
"... Profiling can accurately analyze program behavior for select data inputs. We show that profiling can also predict program locality for inputs other than profiled ones. Here locality is defined by the distance of data reuse. Studying whole-program data reuse may reveal global patterns not apparent in ..."
Abstract
-
Cited by 44 (0 self)
- Add to MetaCart
Profiling can accurately analyze program behavior for select data inputs. We show that profiling can also predict program locality for inputs other than profiled ones. Here locality is defined by the distance of data reuse. Studying whole-program data reuse may reveal global patterns not apparent in short-distance reuses or local control flow. However, the analysis must meet two requirements to be useful. The first is efficiency. It needs to analyze all accesses to all data elements in full-size benchmarks and to measure distance of any length and in any required precision. The second is predication. Based on a few training runs, it needs to classify patterns as regular and irregular and, for regular ones, it should predict their (changing) behavior for other inputs. In this paper, we show that these goals are attainable through three techniques: approximate analysis of reuse distance (originally called LRU stack distance), pattern recognition, and distance-based sampling. When tested on 15 integer and floating-point programs from SPEC and other benchmark suites, our techniques predict with on average 94% accuracy for data inputs up to hundreds times larger than the training inputs. Based on these results, the paper discusses possible uses of this analysis.
Cost Effective Dynamic Program Slicing
, 2004
"... Although dynamic program slicing was first introduced to aid in user level debugging, applications aimed at improving software quality, reliability, security, and performance have since been identified as candidates for using dynamic slicing. However, the dynamic dependence graph constructed to comp ..."
Abstract
-
Cited by 43 (12 self)
- Add to MetaCart
Although dynamic program slicing was first introduced to aid in user level debugging, applications aimed at improving software quality, reliability, security, and performance have since been identified as candidates for using dynamic slicing. However, the dynamic dependence graph constructed to compute dynamic slices can easily cause slicing algorithms to run out of memory for realistic program runs. In this paper we present the design and evaluation of a cost effective dynamic program slicing algorithm. This algorithm is based upon a dynamic dependence graph representation that is highly compact and rapidly traversable. Thus, the graph can be held in memory and dynamic slices can be quickly computed. A compact representation is derived by recognizing that all dynamic dependences (data and control) need not be individually represented. We identify sets of dynamic dependence edges between a pair of statements that can share a single representative edge. We further show that the dependence graph can be transformed in a manner that increases sharing and sharing can be performed even in the presence of aliasing. Experiments show that transformed dynamic dependence graphs explicitly represent only 6% of the dependence edges present in the full dynamic dependence graph. When the full graph sizes range from 0.84 to 1.95 Gigabytes in size, our compacted graphs range from 20 to 210 Megabytes in size. Average slicing times for our algorithm range from 1.74 to 36.25 seconds across several benchmarks from SPECInt2000/95.
Efficient, Transparent and Comprehensive Runtime Code Manipulation
, 2004
"... This thesis addresses the challenges of building a software system for general-purpose runtime code manipulation. Modern applications, with dynamically-loaded modules and dynamicallygenerated code, are assembled at runtime. While it was once feasible at compile time to observe and manipulate every i ..."
Abstract
-
Cited by 28 (1 self)
- Add to MetaCart
This thesis addresses the challenges of building a software system for general-purpose runtime code manipulation. Modern applications, with dynamically-loaded modules and dynamicallygenerated code, are assembled at runtime. While it was once feasible at compile time to observe and manipulate every instruction — which is critical for program analysis, instrumentation, trace gathering, optimization, and similar tools — it can now only be done at runtime. Existing runtime tools are successful at inserting instrumentation calls, but no general framework has been developed for fine-grained and comprehensive code observation and modification without high overheads. This thesis demonstrates the feasibility of building such a system in software. We present DynamoRIO, a fully-implemented runtime code manipulation system that supports code transformations on any part of a program, while it executes. DynamoRIO uses code caching technology to provide efficient, transparent, and comprehensive manipulation of an unmodified application running on a stock operating system and commodity hardware. DynamoRIO executes large, complex, modern applications with dynamically-loaded, generated, or even modified code. Despite the
VPC3: A Fast and Effective Trace-Compression Algorithm
, 2004
"... Trace files are widely used in research and academia to study the behavior of programs. They are simple to process and guarantee repeatability. Unfortunately, they tend to be very large. This paper describes vpc3, a fundamentally new approach to compressing program traces. Vpc3 employs value predict ..."
Abstract
-
Cited by 27 (8 self)
- Add to MetaCart
Trace files are widely used in research and academia to study the behavior of programs. They are simple to process and guarantee repeatability. Unfortunately, they tend to be very large. This paper describes vpc3, a fundamentally new approach to compressing program traces. Vpc3 employs value predictors to bring out and amplify patterns in the traces so that conventional compressors can compress them more effectively. In fact, our approach not only results in much higher compression rates but also provides faster compression and decompression. For example, compared to bzip2, vpc3's geometric mean compression rate on SPECcpu2000 store address traces is 18.4 times higher, compression is ten times faster, and decompression is three times faster.
Whole Execution Traces
, 2004
"... Different types of program profiles (control flow, value, address, and dependence) have been collected and extensively studied by researchers to identify program characteristics that can then be exploited to develop more effective compilers and architectures. Due to the large amounts of profile data ..."
Abstract
-
Cited by 23 (3 self)
- Add to MetaCart
Different types of program profiles (control flow, value, address, and dependence) have been collected and extensively studied by researchers to identify program characteristics that can then be exploited to develop more effective compilers and architectures. Due to the large amounts of profile data produced by realistic program runs, most work has focused on separately collecting and compressing different types of profiles. In this paper we present a unified representation of profiles called Whole Execution Trace (WET) which includes the complete information contained in each of the above types of traces. Thus WETs provide a basis for a next generation software tool that will enable mining of program profiles to identify program characteristics that require understanding of relationships among various types of profiles. The key features of our WET representation are: WET is constructed by labeling a static program representation with profile information such that relavent and related profile information can be directly accessed by analysis algorithms as they traverse the representation; a highly effective two tier strategy is used to significantly compress the WET; and compression techniques are designed such that they do not adversely affect the ability to rapidly traverse WET for extracting subsets of information corresponding to individual profile types as well as a combination of profile types (e.g., in form of dynamic slices of WETs). Our experimentation shows that on an average execution traces resulting from execution of 647 Million statements can be stored in 331 Megabytes of storage after compression. The compression factors range from 16 to 83. Moreover the rates at which different types of profiles can be individually or simultaneously extracted are high.

