Results 1 - 10
of
12
A Survey of Adaptive Optimization in Virtual Machines
- PROCEEDINGS OF THE IEEE, 93(2), 2005. SPECIAL ISSUE ON PROGRAM GENERATION, OPTIMIZATION, AND ADAPTATION
, 2004
"... Virtual machines face significant performance challenges beyond those confronted by traditional static optimizers. First, portable program representations and dynamic language features, such as dynamic class loading, force the deferral of most optimizations until runtime, inducing runtime optimiza ..."
Abstract
-
Cited by 26 (5 self)
- Add to MetaCart
Virtual machines face significant performance challenges beyond those confronted by traditional static optimizers. First, portable program representations and dynamic language features, such as dynamic class loading, force the deferral of most optimizations until runtime, inducing runtime optimization overhead. Second, modular
Online performance auditing: using hot optimizations without getting burned
- In Proceedings of the SIGPLAN Conference on Programming Language Design and Implementation
, 2006
"... As hardware complexity increases and virtualization is added at more layers of the execution stack, predicting the performance impact of optimizations becomes increasingly difficult. Production compilers and virtual machines invest substantial development effort in performance tuning to achieve good ..."
Abstract
-
Cited by 24 (2 self)
- Add to MetaCart
As hardware complexity increases and virtualization is added at more layers of the execution stack, predicting the performance impact of optimizations becomes increasingly difficult. Production compilers and virtual machines invest substantial development effort in performance tuning to achieve good performance for a range of benchmarks. Although optimizations typically perform well on average, they often have unpredictable impact on running time, sometimes degrading performance significantly. Today’s VMs perform sophisticated feedback-directed optimizations, but these techniques do not address performance degradations, and they actually make the situation worse by making the system more unpredictable. This paper presents an online framework for evaluating the effectiveness of optimizations, enabling an online system to automatically identify and correct performance anomalies that occur at runtime. This work opens the door for a fundamental shift in the way optimizations are developed and tuned for online systems, and may allow the body of work in offline empirical optimization search to be applied automatically at runtime. We present our implementation and evaluation of this system in a product Java VM.
Design Space Optimization of Embedded Memory Systems via Data Remapping
- In Proceedings of the Languages, Compilers, and Tools for Embedded Systems and Software and Compilers for Embedded Systems
, 2002
"... In this paper, we provide a novel compile-time data remapping algorithm that runs in linear time. This remapping algorithm is the first fully automatic approach applicable to pointer-intensive dynamic applications. We show that data remapping can be used to significantly reduce the energy consumed a ..."
Abstract
-
Cited by 15 (3 self)
- Add to MetaCart
In this paper, we provide a novel compile-time data remapping algorithm that runs in linear time. This remapping algorithm is the first fully automatic approach applicable to pointer-intensive dynamic applications. We show that data remapping can be used to significantly reduce the energy consumed as well as the memory size needed to meet a user-specified performance goal (i.e., execution time) -- relative to the same application executing without being remapped. These twin advantages afforded by a remapped program -- reduced cache size and energy needs -- constitute a key step in a framework for design space exploration: for any given performance goal, remapping allows the user to reduce the primary and secondary cache size by 50%, yielding a concomitant energy savings of 57%. Additionally, viewed as a compiler optimization for a fixed processor, we show that remapping improves the energy consumed by the cache subsystem by 25%. All of the above savings are in the context of the cache subsystem in isolation. We also show that remapping yields an average 20% energy saving for an ARM-like processor and cache subsystem. All of our improvements are achieved in the context of DIS, OLDEN and SPEC2000 pointer-centric benchmarks.
Trimaran: An Infrastructure for Research in Instruction-Level Parallelism
- IN INSTRUCTION-LEVEL PARALLELISM. LECTURE NOTES IN COMPUTER SCIENCE
, 2004
"... Trimaran is an integrated compilation and performance monitoring infrastructure. The architecture space that Trimaran covers is characterized by HPL-PD, a parameterized processor architecture supporting novel features such as predication, control and data speculation and compiler controlled mana ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
Trimaran is an integrated compilation and performance monitoring infrastructure. The architecture space that Trimaran covers is characterized by HPL-PD, a parameterized processor architecture supporting novel features such as predication, control and data speculation and compiler controlled management of the memory hierarchy. Trimaran also
Dynamic memory optimization using pool allocation and prefetching
- SIGARCH Comput. Archit. News
"... Heap memory allocation plays an important role in modern applications. Conventional heap allocators, however, generally ignore the underlying memory hierarchy of the system, favoring instead a low runtime overhead and fast response times. Unfortunately, with little concern for the memory hierarchy, ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Heap memory allocation plays an important role in modern applications. Conventional heap allocators, however, generally ignore the underlying memory hierarchy of the system, favoring instead a low runtime overhead and fast response times. Unfortunately, with little concern for the memory hierarchy, the data layout may exhibit poor spatial locality, and degrade cache performance. In this paper, we describe a dynamic heap allocation scheme called pool allocation. The strategy aims to improve cache performance by inspecting memory allocation requests, and allocating memory from appropriate heap pools as dictated by the requesting context. The advantages are two fold. First, by pooling together data with a common context, we expect to improve spatial locality, as data fetched to the caches will contain fewer items from different contexts. If the allocation patterns are closely matched to the traversal patterns, the end result is faster memory performance. Second, by pooling heap objects, we expect access patterns to exhibit more regularity, thus creating more opportunities for data prefetching. Our dynamic memory optimizer exploits the increased regularity to insert prefetch instructions at runtime. The optimizations are implemented in DynamoRIO, a dynamic optimization framework. We evaluate the work using various benchmarks, and measure a 17 % speedup over gcc-O3 on an Athlon MP, and a 13 % speedup on a Pentium 4. 1.
Restructuring field layouts for embedded memory systems
- In DATE ’06
, 2006
"... In many computer systems with large data computations, the delay of memory access is one of the major performance bottlenecks. In this paper, we propose an enhanced field remapping scheme for dynamically allocated structures in order to provide better locality than conventional field layouts. Our pr ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
In many computer systems with large data computations, the delay of memory access is one of the major performance bottlenecks. In this paper, we propose an enhanced field remapping scheme for dynamically allocated structures in order to provide better locality than conventional field layouts. Our proposed scheme reduces cache miss rates drastically by aggregating and grouping fields from multiple instances of the same structure, which implies the performance improvement and power reduction. Our methodology will become more important in the design space exploration, especially as the embedded systems for data oriented application become prevalent. Experimental results show that average L1 and L2 data cache misses are reduced by 23 % and 17%, respectively. Due to the enhanced localities, our remapping achieves 13 % faster execution time on average than original programs. It also reduces power consumption by 18 % for data cache. 1
Combining Data Remapping and Voltage/Frequency Scaling of Second Level Memory for Energy Reduction in Embedded Systems
- Microelectronics Journal
, 2002
"... In this paper we show that the energy reductions obtained from using two techniques, data remapping and voltage/frequency scaling of off-chip bus and memory, combine to provide interesting trade offs between energy, execution time and power. Both methods aim to reduce the energy consumed by the memo ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
In this paper we show that the energy reductions obtained from using two techniques, data remapping and voltage/frequency scaling of off-chip bus and memory, combine to provide interesting trade offs between energy, execution time and power. Both methods aim to reduce the energy consumed by the memory subsystem. Data remapping is a fully automatic compile time technique applicable to pointer-intensive dynamic applications. Voltage/frequency scaling of off-chip memory is a technique applied at the hardware level. When combined together, energy reductions can be as high as 46% .The improvements are verified in the context of two OLDEN pointer-centric benchmarks, namely Perimeter and Health.
Adaptive compiler directed prefetching for epic processors
- in Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications(PDPTA
"... The widely acknowledged performance gap between processors and memory has been the subject of much research. In the Explicitly Parallel Instruction Computing (EPIC) paradigm, the combination of in-order issue and the presence of a large number of parallel functional units exacerbate the problem. Pre ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
The widely acknowledged performance gap between processors and memory has been the subject of much research. In the Explicitly Parallel Instruction Computing (EPIC) paradigm, the combination of in-order issue and the presence of a large number of parallel functional units exacerbate the problem. Prefetching, by hardware, software, or a combination of both, is one of the primary mechanisms advocated to alleviate this problem. In this paper, we propose a new software-based data prefetching mechanism that is the Adaptive Markovian Predictor (AMP). AMP is suitable for implementation in EPIC processors without significant hardware overhead. Specifically, we introduce a predicated prefetch operation which leverages the concept of an informing load to dynamically adapt to runtime memory behaviors. Furthermore, we employ predicated prefetching in a new optimization framework that also consists of data remapping and off-line learning of Markovian predictors. This distinguishes our approach from early software prefetching techniques that only involve static program analysis. Our experiments show that the proposed framework can effectively remove 10%-30 % of the stall cycles due to cache misses for benchmarks from the well-known SPEC and OLDEN suites. The results also show that the framework performs better than pure stride predictors and has lower bandwidth and instruction overheads.
A framework for compiler driven design space exploration for embedded system customization
- In Proceedings of the 9th Asian Computing Science Conference
, 2004
"... Abstract. Designing custom solutions has been central to meeting a range of stringent and specialized needs of embedded computing, along such dimensions as physical size, power consumption, and performance that includes real-time behavior. For this trend to continue, we must find ways to overcome th ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Abstract. Designing custom solutions has been central to meeting a range of stringent and specialized needs of embedded computing, along such dimensions as physical size, power consumption, and performance that includes real-time behavior. For this trend to continue, we must find ways to overcome the twin hurdles of rising non-recurring engineering (NRE) costs and decreasing time-to-market windows by providing major improvements in designer productivity. This paper presents compiler directed design space exploration as a framework for articulating, formulating, and implementing global optimizations for embedded systems customization, where the design space is spanned by parametric representations of both candidate compiler optimizations and architecture parameters, and the navigation of the design space is driven by quantifiable, machine independent metrics. This paper describes the elements of such a framework and an example of its application. 1
Layout Transformations for Heap Objects Using Static Access Patterns ⋆
"... Abstract. As the amount of data used by programs increases due to the growth of hardware storage capacity and computing power, efficient memory usage becomes a key factor for performance. Since modern applications heavily use structures allocated in the heap, this paper proposes an efficient structu ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract. As the amount of data used by programs increases due to the growth of hardware storage capacity and computing power, efficient memory usage becomes a key factor for performance. Since modern applications heavily use structures allocated in the heap, this paper proposes an efficient structure layout based on static analyses. Unlike most of the previous work, our approach is an entirely static transformation of programs. We extract access patterns from source programs and represent them with regular expressions. Repetitive accesses are usually important pieces of information for locality optimizations. The expressive power of regular expressions is appropriate to represent those repetitive accesses along with various access patterns according to the control flow of programs. By interpreting statically obtained access patterns, we choose suitable structures for pool allocation and reorganize field layouts of the chosen structures. To verify the effect of our static optimization, we implement our analyses and optimizations with the CIL compiler. Our experiments with the Olden benchmarks demonstrate that layout transformations for heap objects based on our static access pattern analysis improve cache locality by 38 % and performance by 24%. 1

