• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Improving effective bandwidth through compiler enhancement of global cache reuse. JPDC (2004)

by C Ding, K Kennedy
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 41
Next 10 →

Predicting Whole-Program Locality Through Reuse Distance Analysis

by Chen Ding, Yutao Zhong , 2003
"... Profiling can accurately analyze program behavior for select data inputs. We show that profiling can also predict program locality for inputs other than profiled ones. Here locality is defined by the distance of data reuse. Studying whole-program data reuse may reveal global patterns not apparent in ..."
Abstract - Cited by 44 (0 self) - Add to MetaCart
Profiling can accurately analyze program behavior for select data inputs. We show that profiling can also predict program locality for inputs other than profiled ones. Here locality is defined by the distance of data reuse. Studying whole-program data reuse may reveal global patterns not apparent in short-distance reuses or local control flow. However, the analysis must meet two requirements to be useful. The first is efficiency. It needs to analyze all accesses to all data elements in full-size benchmarks and to measure distance of any length and in any required precision. The second is predication. Based on a few training runs, it needs to classify patterns as regular and irregular and, for regular ones, it should predict their (changing) behavior for other inputs. In this paper, we show that these goals are attainable through three techniques: approximate analysis of reuse distance (originally called LRU stack distance), pattern recognition, and distance-based sampling. When tested on 15 integer and floating-point programs from SPEC and other benchmark suites, our techniques predict with on average 94% accuracy for data inputs up to hundreds times larger than the training inputs. Based on these results, the paper discusses possible uses of this analysis.

Single Assignment C -- efficient support for high-level array operations in a functional setting

by Sven-Bodo Scholz , 2003
"... ..."
Abstract - Cited by 28 (11 self) - Add to MetaCart
Abstract not found

Space-Time Trade-Off Optimization for a Class of Electronic Structure Calculations

by Daniel Cociorva, Gerald Baumgartner, J. Ramanujam, Marcel Nooijen, Chi-chung Lam , 2002
"... The accurate modeling of the electronic structure of atoms and molecules is very computationally intensive. Many models of electronic structure, such as the Coupled Cluster approach, involve collections of tensor contractions. There are usually a large number of alternative ways of implementing the ..."
Abstract - Cited by 26 (19 self) - Add to MetaCart
The accurate modeling of the electronic structure of atoms and molecules is very computationally intensive. Many models of electronic structure, such as the Coupled Cluster approach, involve collections of tensor contractions. There are usually a large number of alternative ways of implementing the tensor contractions, representing different trade-offs between the space required for temporary intermediates and the total number of arithmetic operations. In this paper, we present an algorithm that starts with an operationminimal form of the computation and systematically explores the possible space-time trade-offs to identify the form with lowest cost that fits within a specified memory limit. Its utility is demonstrated by applying it to a computation representative of a component in the CCSD(T) formulation in the NWChem quantum chemistry suite from Pacific Northwest National Laboratory.

Miss rate prediction across program inputs and cache configurations

by Yutao Zhong, Steven G. Dropsho, Xipeng Shen, Ahren Studer, Chen Ding - IEEE TRANSACTIONS ON COMPUTERS , 2007
"... Improving cache performance requires understanding cache behavior. However, measuring cache performance for one or two data input sets provides little insight into how cache behavior varies across all data input sets and all cache configurations. This paper uses locality analysis to generate a para ..."
Abstract - Cited by 17 (12 self) - Add to MetaCart
Improving cache performance requires understanding cache behavior. However, measuring cache performance for one or two data input sets provides little insight into how cache behavior varies across all data input sets and all cache configurations. This paper uses locality analysis to generate a parameterized model of program cache behavior. Given a cache size and associativity, this model predicts the miss rate for arbitrary data input set sizes. This model also identifies critical data input sizes where cache behavior exhibits marked changes. Experiments show this technique is within 2 percent of the hit rate for set associative caches on a set of floating-point and integer programs using array and pointer-based data structures. Building on the new model, this paper presents an interactive visualization tool that uses a three-dimensional plot to show miss rate changes across program data sizes and cache sizes and its use in evaluating compiler transformations. Other uses of this visualization tool include assisting machine and benchmark-set design. The tool can be accessed on the Web at

A Tuning Framework for Software-Managed Memory Hierarchies

by Manman Ren, Alex Aiken, Ji Young Park, William J. Dally, Mike Houston
"... Achieving good performance on a modern machine with a multi-level memory hierarchy, and in particular on a machine with software-managed memories, requires precise tuning of programs to the machine’s particular characteristics. A large program on a multi-level machine can easily expose tens or hundr ..."
Abstract - Cited by 10 (1 self) - Add to MetaCart
Achieving good performance on a modern machine with a multi-level memory hierarchy, and in particular on a machine with software-managed memories, requires precise tuning of programs to the machine’s particular characteristics. A large program on a multi-level machine can easily expose tens or hundreds of inter-dependent parameters which require tuning, and manually searching the resultant large, non-linear space of program parameters is a tedious process of trial-and-error. In this paper we present a general framework for automatically tuning general applications to machines with software-managed memory hierarchies. We evaluate our framework by measuring the performance of benchmarks that are tuned for a range of machines with different memory hierarchy configurations: a cluster of Intel P4 Xeon processors, a single Cell processor, and a cluster of Sony Playstation3’s.

Lightweight reference affinity analysis

by Xipen Shen, Yaoqing Gao, Chen Ding, Roch Archambault - In Proceedings of the 19th ACM International Conference on Supercomputing , 2005
"... Previous studies have shown that array regrouping and structure splitting significantly improve data locality. The most effective technique relies on profiling every access to every data element. The high overhead impedes its adoption in a general compiler. In this paper, we show that for array regr ..."
Abstract - Cited by 10 (5 self) - Add to MetaCart
Previous studies have shown that array regrouping and structure splitting significantly improve data locality. The most effective technique relies on profiling every access to every data element. The high overhead impedes its adoption in a general compiler. In this paper, we show that for array regrouping in scientific programs, the overhead is not needed since the same benefit can be obtained by pure program analysis. We present an interprocedural analysis technique for array regrouping. For each global array, the analysis summarizes the access pattern by access-frequency vectors and then groups arrays with similar vectors. The analysis is context sensitive, so it tracks the exact array access. For each loop or function call, it uses two methods to estimate the frequency of the execution. The first is symbolic analysis in the compiler. The second is lightweight profiling of the code. The same interprocedural analysis is used to cumulate the overall execution frequency by considering the calling context. We implemented a prototype of both the compiler and the profiling analysis in the IBM® compiler, evaluated array regrouping on the entire set of SPEC CPU2000 FORTRAN benchmarks, and compared different analysis methods. The pure compiler-based array regrouping improves the performance for the majority of programs, leaving little room for improvement by code or data profiling.

Profitable loop fusion and tiling using model-driven empirical search

by Apan Qasem, Ken Kennedy - In ICS , 2006
"... Loop fusion and tiling are both recognized as effective transformations for improving memory performance of scientific applications. However, because of their sensitivity to the underlying cache architecture and their interaction with each other it is difficult to determine a good heuristic for appl ..."
Abstract - Cited by 8 (0 self) - Add to MetaCart
Loop fusion and tiling are both recognized as effective transformations for improving memory performance of scientific applications. However, because of their sensitivity to the underlying cache architecture and their interaction with each other it is difficult to determine a good heuristic for applying these transformations profitably across architectures. In this paper, we present a model-guided empirical tuning strategy for profitable application of loop fusion and tiling. Our strategy consists of a detailed cost model that characterizes the interaction between the two transformations at different levels of the memory hierarchy. The novelty of our approach is in exposing key architectural parameters within the model for automatic tuning through empirical search. Preliminary experiments with a set of applications on four different platforms show that our strategy achieves significant performance improvement over fully optimized code generated by state-of-the-art commercial compilers. The time spent in searching for the best parameters is considerably less than with other search strategies.

Program Locality Analysis Using Reuse Distance

by Yutao Zhong, Xipeng Shen, Chen Ding , 2009
"... On modern computer systems, the memory performance of an application depends on its locality. For a single execution, locality-correlated measures like average miss rate or working-set size have long been analyzed using reuse distance—the number of distinct locations accessed between consecutive acc ..."
Abstract - Cited by 8 (4 self) - Add to MetaCart
On modern computer systems, the memory performance of an application depends on its locality. For a single execution, locality-correlated measures like average miss rate or working-set size have long been analyzed using reuse distance—the number of distinct locations accessed between consecutive accesses to a given location. This article addresses the analysis problem at the program level, where the size of data and the locality of execution may change significantly depending on the input. The article presents two techniques that predict how the locality of a program changes with its input. The first is approximate reuse-distance measurement, which is asymptotically faster than exact methods while providing a guaranteed precision. The second is statistical prediction of locality in all executions of a program based on the analysis of a few executions. The prediction process has three steps: dividing data accesses into groups, finding the access patterns in each group, and building parameterized models. The resulting prediction may be used on-line with the help of distance-based sampling. When evaluated on fifteen benchmark applications, the new techniques predicted program locality with good accuracy, even for test executions that are orders of magnitude larger than the training executions. The two techniques are among the first to enable quantitative analysis of whole-program locality and

Restructuring computations for temporal data cache locality

by Venkata K. Pingali, Sally A. Mckee, Wilson C. Hsieh, John B. Carter Introduction - International Journal of Parallel Programming , 2003
"... withcomplexdatastructures.Athelatencyofmemoryaccessesbecomeshigh relativetoprocessorcycletimes,applicationperformanceisincreasinglylimited bymemoryperformance.Insomesituationsitisusefultotradeincreasedcomputationcostsforreducedmemorycosts.Thecontributionsofthispaperare ..."
Abstract - Cited by 7 (0 self) - Add to MetaCart
withcomplexdatastructures.Athelatencyofmemoryaccessesbecomeshigh relativetoprocessorcycletimes,applicationperformanceisincreasinglylimited bymemoryperformance.Insomesituationsitisusefultotradeincreasedcomputationcostsforreducedmemorycosts.Thecontributionsofthispaperare

Computation regrouping: Restructuring programs for temporal data cache locality

by Venkata K. Pingali, Wilson C. Hsieh - In Intl. Conf. on Supercomputing , 2002
"... Data access costs contribute significantly to the execution time of applications with complex data structures. As the latency of memory accesses becomes high relative to processor cycle times, application performance is increasingly limited by memory performance. In some situations it may be reasona ..."
Abstract - Cited by 7 (1 self) - Add to MetaCart
Data access costs contribute significantly to the execution time of applications with complex data structures. As the latency of memory accesses becomes high relative to processor cycle times, application performance is increasingly limited by memory performance. In some situations it may be reasonable to trade increased computation costs for reduced memory costs. The contributions of this paper are three-fold: we provide a detailed analysis of the memory performance of a set of seven, memory-intensive benchmarks; we describe Computation Regrouping, a general, source-level approach to improving the overall performance of these applications by improving temporal locality to reduce cache and TLB miss ratios (and thus memory stall times); and we demonstrate significant performance improvements from applying Computation Regrouping to our suite of seven benchmarks.
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University