Results 1 - 10
of
20
Counting integer points in parametric polytopes using Barvinok’s rational functions
- Algorithmica
, 2007
"... Abstract Many compiler optimization techniques depend on the ability to calculate the number of elements that satisfy certain conditions. If these conditions can be represented by linear constraints, then such problems are equivalent to counting the number of integer points in (possibly) parametric ..."
Abstract
-
Cited by 15 (6 self)
- Add to MetaCart
Abstract Many compiler optimization techniques depend on the ability to calculate the number of elements that satisfy certain conditions. If these conditions can be represented by linear constraints, then such problems are equivalent to counting the number of integer points in (possibly) parametric polytopes. It is well known that the enumerator of such a set can be represented by an explicit function consisting of a set of quasi-polynomials each associated with a chamber in the parameter space. Previously, interpolation was used to obtain these quasi-polynomials, but this technique has several disadvantages. Its worstcase computation time for a single quasi-polynomial is exponential in the input size, even for fixed dimensions. The worst-case size of such a quasi-polynomial (measured in bits needed to represent the quasi-polynomial) is also exponential in the input size. Under certain conditions this technique even fails to produce a solution. Our main contribution is a novel method for calculating the required quasipolynomials analytically. It extends an existing method, based on Barvinok’s decomposition,
Multicore-aware reuse distance analysis
, 2009
"... Abstract—This paper presents and validates methods to extend reuse distance analysis of application locality characteristics to shared-memory multicore platforms by accounting for invalidation-based cache-coherence and inter-core cache sharing. Existing reuse distance analysis methods track the numb ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
Abstract—This paper presents and validates methods to extend reuse distance analysis of application locality characteristics to shared-memory multicore platforms by accounting for invalidation-based cache-coherence and inter-core cache sharing. Existing reuse distance analysis methods track the number of distinct addresses referenced between reuses of the same address by a given thread, but do not model the effects of data references by other threads. This paper shows several methods to keep reuse stacks consistent so that they account for invalidations and cache sharing, either as references arise in a simulated execution or at synchronization points. These methods are evaluated against a Simics-based coherent cache simulator running several OpenMP and transactionbased benchmarks. The results show that adding multicoreawareness substantially improves the ability of reuse distance analysis to model cache behavior, reducing the error in miss ratio prediction (relative to cache simulation for a specific cache size) by an average of 70 % for per-core caches and an average of 90 % for shared caches. I.
Discovery of locality-improving refactoring by reuse path analysis
- In Proceedings of HPCC. Springer. Lecture Notes in Computer Science
, 2006
"... Abstract. Due to the huge speed gaps in the memory hierarchy of modern computer architectures, it is important that programs maintain a good data locality. Improving temporal locality implies reducing the distance of data reuses that are far apart. The best existing tools indicate locality bottlenec ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
Abstract. Due to the huge speed gaps in the memory hierarchy of modern computer architectures, it is important that programs maintain a good data locality. Improving temporal locality implies reducing the distance of data reuses that are far apart. The best existing tools indicate locality bottlenecks by highlighting both the source locations generating the use and the subsequent cache-missing reuse. Even with this knowledge of the bottleneck locations in the source code, it often remains hard to find an effective code refactoring that improves temporal locality, due to the unclear interaction of function calls and loop iterations occurring between use and reuse. The contributions in this paper are two-fold. First, the locality analysis is enhanced to not only pinpoint the cache bottlenecks, but to also suggest code refactorings that may resolve them. The refactorings are found by analyzing the dynamic hierarchy of function calls and loops on the code path between reuses, called reuse paths. Secondly, reservoir sampling of the reuse paths results in a significant reduction of the execution time and memory requirements during profiling, enabling the analysis of realistic programs. An interactive GUI, called SLO (Suggestions for Locality Optimizations), has been used to explore the most appropriate refactorings in a number of SPEC2000 programs. After refactoring, the execution time of the selected programs was halved, on the average. 1
Program Locality Analysis Using Reuse Distance
, 2009
"... On modern computer systems, the memory performance of an application depends on its locality. For a single execution, locality-correlated measures like average miss rate or working-set size have long been analyzed using reuse distance—the number of distinct locations accessed between consecutive acc ..."
Abstract
-
Cited by 8 (4 self)
- Add to MetaCart
On modern computer systems, the memory performance of an application depends on its locality. For a single execution, locality-correlated measures like average miss rate or working-set size have long been analyzed using reuse distance—the number of distinct locations accessed between consecutive accesses to a given location. This article addresses the analysis problem at the program level, where the size of data and the locality of execution may change significantly depending on the input. The article presents two techniques that predict how the locality of a program changes with its input. The first is approximate reuse-distance measurement, which is asymptotically faster than exact methods while providing a guaranteed precision. The second is statistical prediction of locality in all executions of a program based on the analysis of a few executions. The prediction process has three steps: dividing data accesses into groups, finding the access patterns in each group, and building parameterized models. The resulting prediction may be used on-line with the help of distance-based sampling. When evaluated on fifteen benchmark applications, the new techniques predicted program locality with good accuracy, even for test executions that are orders of magnitude larger than the training executions. The two techniques are among the first to enable quantitative analysis of whole-program locality and
Accelerating Multicore Reuse Distance Analysis with Sampling and Parallelization
"... Reuse distance analysis is a well-established tool for predicting cache performance, driving compiler optimizations, and assisting visualization and manual optimization of programs. Existing reuse distance analysis methods either do not account for the effects of multithreading, or suffer severe per ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
Reuse distance analysis is a well-established tool for predicting cache performance, driving compiler optimizations, and assisting visualization and manual optimization of programs. Existing reuse distance analysis methods either do not account for the effects of multithreading, or suffer severe performance penalties. This paper presents a sampled, parallelized method of measuring reuse distance profiles for multithreaded programs, modeling private and shared cache configurations. The sampling technique allows it to spend much of its execution in a fast low-overhead mode, and allows the use of a new measurement method since sampled analysis does not need to consider the full state of the reuse stack. This measurement method uses O(1) data structures that may be made thread-private, allowing parallelization to reduce overhead in analysis mode. The performance of the resulting system is analyzed for a diverse set of parallel benchmarks and shown to generate accurate output compared to non-sampled full analysis as well as good results for the common application of locating low-locality code in the benchmarks, all with a performance overhead comparable to the best single-threaded analysis techniques.
Symbolic polynomial maximization over convex sets and its application to memory requirement estimation
, 2009
"... Memory requirement estimation is an important issue in the development of embedded systems, since memory directly influences performance, cost and power consumption. It is therefore crucial to have tools that automatically compute accurate estimates of the memory requirements of programs to better ..."
Abstract
-
Cited by 5 (4 self)
- Add to MetaCart
Memory requirement estimation is an important issue in the development of embedded systems, since memory directly influences performance, cost and power consumption. It is therefore crucial to have tools that automatically compute accurate estimates of the memory requirements of programs to better control the development process and avoid some catastrophic execution exceptions. Many important memory issues can be expressed as the problem of maximizing a parametric polynomial defined over a parametric convex domain. Bernstein expansion is a technique that has been used to compute upper bounds on polynomials defined over intervals and parametric “boxes”. In this paper, we propose an extension of this theory to more general parametric convex domains and illustrate its applicability to the resolution of memory issues with several application examples.
Improved derivation of process networks
- in Proceedings of the 4th International Workshop on Optimization for DSP and Embedded Systems (ODES ’06
, 2006
"... Abstract — Current emerging embedded System-on-Chip platforms are increasingly becoming multiprocessor architectures. System designers experience significant difficulties in programming these platforms. The applications are typically specified as sequential programs that do not reveal the available ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Abstract — Current emerging embedded System-on-Chip platforms are increasingly becoming multiprocessor architectures. System designers experience significant difficulties in programming these platforms. The applications are typically specified as sequential programs that do not reveal the available parallelism in an application, thereby hindering the efficient mapping of an application onto a parallel multiprocessor platform. In this paper we present our compiler techniques that facilitate the migration from a sequential application specification to a parallel application specification using the Process Network model of computation. Our work is inspired by a previous research project called Compaan. With our techniques we address optimization issues such as the generation of Process Networks with simplified topology and communication without sacrificing the Process Networks performance. Moreover, we describe a technique for compile-time memory requirement estimation which we consider as an important contribution of this paper. We demonstrate the usefulness of our techniques on several examples. I.
On the theory and potential of LRU-MRU collaborative cache management
- In ISMM, 2011
"... The goal of cache management is to maximize data reuse. Collaborative caching provides an interface for software to communicate access information to hardware. In theory, it can obtain optimal cache performance. In this paper, we study a collaborative caching system that allows a program to choose d ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
The goal of cache management is to maximize data reuse. Collaborative caching provides an interface for software to communicate access information to hardware. In theory, it can obtain optimal cache performance. In this paper, we study a collaborative caching system that allows a program to choose different caching methods for its data. As an interface, it may be used in arbitrary ways, sometimes optimal but probably suboptimal most times and even counter productive. We develop a theoretical foundation for collaborative caches to show the inclusion principle and the existence of a distance metric we call LRU-MRU stack distance. The new stack distance is important for program analysis and transformation to target a hierarchical collaborative cache system rather than a single cache configuration. We use 10 benchmark programs to show that optimal caching may reduce the average miss ratio by 24%, and a simple feedback-driven compilation technique can utilize collaborative cache to realize 50 % of the optimal improvement.
P-OPT: Program-Directed Optimal Cache Management ⋆
"... Abstract. As the amount of on-chip cache increases as a result of Moore’s law, cache utilization is increasingly important as the number of processor cores multiply and the contention for memory bandwidth becomes more severe. Optimal cache management requires knowing the future access sequence and b ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
Abstract. As the amount of on-chip cache increases as a result of Moore’s law, cache utilization is increasingly important as the number of processor cores multiply and the contention for memory bandwidth becomes more severe. Optimal cache management requires knowing the future access sequence and being able to communicate this information to hardware. The paper addresses the communication problem with two new optimal algorithms for Program-directed OPTimal cache management (P-OPT), in which a program designates certain accesses as bypasses and trespasses through an extended hardware interface to effect optimal cache utilization. The paper proves the optimality of the new methods, examines their theoretical properties, and shows the potential benefit using a simulation study and a simple test on a multi-core, multi-processor PC. 1
P-OPT: Program-directed Optimal Cache Management ⋆
"... Abstract. As the amount of on-chip cache increases as a result of Moore’s law, cache utilization is increasingly important as the number of processor cores multiply and the contention for memory bandwidth becomes more severe. Optimal cache management requires knowing the future access sequence and b ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Abstract. As the amount of on-chip cache increases as a result of Moore’s law, cache utilization is increasingly important as the number of processor cores multiply and the contention for memory bandwidth becomes more severe. Optimal cache management requires knowing the future access sequence and being able to communicate this information to hardware. The paper addresses the communication problem with two new optimal algorithms for Program-directed OPTimal cache management (P-OPT), in which a program designates certain accesses as bypasses and trespasses through an extended hardware interface to effect optimal cache utilization. The paper proves the optimality of the new methods, examines their theoretical properties, and shows the potential benefit using a simulation study and a simple test on a multi-core, multi-processor PC. 1

