Results 1 - 10
of
22
Cross-Architecture Performance Predictions for Scientific Applications Using Parameterized Models
, 2004
"... This paper describes a toolkit for semi-automatically measuring and modeling static and dynamic characteristics of applications in an architecture-neutral fashion. For predictable applications, models of dynamic characteristics have a convex and diferentiable profile. Our toolkit operates on applica ..."
Abstract
-
Cited by 74 (2 self)
- Add to MetaCart
This paper describes a toolkit for semi-automatically measuring and modeling static and dynamic characteristics of applications in an architecture-neutral fashion. For predictable applications, models of dynamic characteristics have a convex and diferentiable profile. Our toolkit operates on application binaries and succeeds in modeling key application characteristics that determine program performance. We use these characterizations to explore the interactions between an application and a target architecture. We apply our toolkit to SPARC binaries to develop architectureneutral models of computation and memory access patterns of the ASCI Sweep3D and the NAS SP, BT and LU benchmarks. From our models, we predict the L1, L2 and TLB cache miss counts as well as the overall execution time of these applications on an Origin 2000 system. We evaluate our predictions by comparing them against measurements collected using hardware performance counters.
Efficient Simulation of Caches under Optimal Replacement with Applications to Miss Characterization
- In Proceedings of the ACM SIGMETRICS Conference on Measurement & Modeling Computer Systems
, 1993
"... Cache miss characterization models such as the three Cs model are useful in developing schemes to reduce cache misses and their penalty. In this paper we propose the OPT model that uses cache simulation under optimal (OPT) replacement to obtain a finer and more accurate characterization of misses th ..."
Abstract
-
Cited by 72 (2 self)
- Add to MetaCart
Cache miss characterization models such as the three Cs model are useful in developing schemes to reduce cache misses and their penalty. In this paper we propose the OPT model that uses cache simulation under optimal (OPT) replacement to obtain a finer and more accurate characterization of misses than the three Cs model. However, current methods for optimal cache simulation are slow and difficult to use. We present three new techniques for optimal cache simulation. First, we propose a limited lookahead strategy with error fixing, which allows one pass simulation of multiple optimal caches. Second, we propose a scheme to group entries in the OPT stack, which allows efficient tree-based fully-associative cache simulation under OPT. Third, we propose a scheme for exploiting partial inclusion in set-associative cache simulation under OPT. Simulators based on these algorithms were used to obtain cache miss characterizations using the OPT model for nine SPEC benchmarks. The results indicate ...
Choosing Representative Slices of Program Execution for Microarchitecture Simulations: A Preliminary Application to the Data Stream
- In Workload Characterization of Emerging Applications
, 2000
"... Microarchitecture simulations are aimed at providing results representative of the behavior of a processor running an application. Due to CPU time constaints, only a few execution slices of a large application can be simulated. The aim of this paper is to propose a technique to choose a few prog ..."
Abstract
-
Cited by 65 (1 self)
- Add to MetaCart
Microarchitecture simulations are aimed at providing results representative of the behavior of a processor running an application. Due to CPU time constaints, only a few execution slices of a large application can be simulated. The aim of this paper is to propose a technique to choose a few program execution slices representative of the entire execution. We characterize the behavior of each consecutive slice executed. Then we use a statistical classification method to discriminate the execution slices and select the representative ones. In this paper, we detail this approach and apply it to the data stream. Using data cache simulations on the SPEC95 programs, we show that slices representing 1.46 % (average upon all the SPEC95 but one) of the overall program activity are as representative as trace sampling using a 10 % sampling ratio. keywords: micro-architecture simulation, trace-driven simulation, on-thefly simulation, trace sampling, clustering, classification, data stre...
Predicting Whole-Program Locality Through Reuse Distance Analysis
, 2003
"... Profiling can accurately analyze program behavior for select data inputs. We show that profiling can also predict program locality for inputs other than profiled ones. Here locality is defined by the distance of data reuse. Studying whole-program data reuse may reveal global patterns not apparent in ..."
Abstract
-
Cited by 44 (0 self)
- Add to MetaCart
Profiling can accurately analyze program behavior for select data inputs. We show that profiling can also predict program locality for inputs other than profiled ones. Here locality is defined by the distance of data reuse. Studying whole-program data reuse may reveal global patterns not apparent in short-distance reuses or local control flow. However, the analysis must meet two requirements to be useful. The first is efficiency. It needs to analyze all accesses to all data elements in full-size benchmarks and to measure distance of any length and in any required precision. The second is predication. Based on a few training runs, it needs to classify patterns as regular and irregular and, for regular ones, it should predict their (changing) behavior for other inputs. In this paper, we show that these goals are attainable through three techniques: approximate analysis of reuse distance (originally called LRU stack distance), pattern recognition, and distance-based sampling. When tested on 15 integer and floating-point programs from SPEC and other benchmark suites, our techniques predict with on average 94% accuracy for data inputs up to hundreds times larger than the training inputs. Based on these results, the paper discusses possible uses of this analysis.
Cramm: Virtual memory support for garbage-collected applications
- In USENIX Symposium on Operating Systems Design and Implementation
, 2006
"... Existing virtual memory systems usually work well with applications written in C and C++, but they do not provide adequate support for garbage-collected applications. The performance of garbage-collected applications is sensitive to heap size. Larger heaps reduce the frequency of garbage collections ..."
Abstract
-
Cited by 31 (4 self)
- Add to MetaCart
Existing virtual memory systems usually work well with applications written in C and C++, but they do not provide adequate support for garbage-collected applications. The performance of garbage-collected applications is sensitive to heap size. Larger heaps reduce the frequency of garbage collections, making them run several times faster. However, if the heap is too large to fit in the available RAM, garbage collection can trigger thrashing. Existing Java virtual machines attempt to adapt their application heap sizes to fit in RAM, but suffer performance degradations of up to 94 % when subjected to bursts of memory pressure. We present CRAMM (Cooperative Robust Automatic Memory Management), a system that solves these problems. CRAMM consists of two parts: (1) a new virtual memory system that collects detailed reference information for (2) an analytical model tailored to the underlying garbage collection algorithm. The CRAMM virtual memory system tracks recent reference behavior with low overhead. The CRAMM heap sizing model uses this information to compute a heap size that maximizes throughput while minimizing paging. We present extensive empirical results demonstrating CRAMM’s ability to maintain high performance in the face of changing application and system load. 1
Calculating Stack Distances Efficiently
, 2001
"... This paper describes our experience using the stack processing algorithm [6] for estimating the number of cache misses in scientific programs. By using a new data structure and various optimization techniques we obtain instrumented run-times within 50 to 100 times the original optimized runtimes ..."
Abstract
-
Cited by 29 (1 self)
- Add to MetaCart
This paper describes our experience using the stack processing algorithm [6] for estimating the number of cache misses in scientific programs. By using a new data structure and various optimization techniques we obtain instrumented run-times within 50 to 100 times the original optimized runtimes of our benchmarks.
Thin-Client Web Access Patterns: Measurements from a Cache-Busting Proxy
, 2001
"... This paper describes a new technique for measuring Web client request patterns and analyzes a large client trace collected using the new method. In this approach a modified proxy intercepts requests and serves all responses to clients marked uncacheable, effectively disabling browser caches and allo ..."
Abstract
-
Cited by 24 (3 self)
- Add to MetaCart
This paper describes a new technique for measuring Web client request patterns and analyzes a large client trace collected using the new method. In this approach a modified proxy intercepts requests and serves all responses to clients marked uncacheable, effectively disabling browser caches and allowing the proxy to record requests that would otherwise result in silent browser cache hits. WebTV Networks used a "cache-busting proxy" to collect an unusually large and detailed anonymized Web client trace in September 2000. It contains over 347 million requests for over 36 million documents by over 37,000 clients and spans 16 days. By most measures it is two orders of magnitude larger than existing Web client traces.
Optimal Web Cache Sizing: Scalable Methods for Exact Solutions
- Computer Communications
, 2000
"... This paper describes two approaches to the problem of determining exact optimal storage capacity for Web caches based on expected workload and the monetary costs of memory and bandwidth. The first approach considers memory/bandwidth tradeoffs in an idealized model. It assumes that workload consist ..."
Abstract
-
Cited by 21 (4 self)
- Add to MetaCart
This paper describes two approaches to the problem of determining exact optimal storage capacity for Web caches based on expected workload and the monetary costs of memory and bandwidth. The first approach considers memory/bandwidth tradeoffs in an idealized model. It assumes that workload consists of independent references drawn from a known distribution (e.g., Zipf) and caches employ a "Perfect LFU" removal policy. We derive conditions under which a shared higher-level "parent" cache serving several lower-level "child" caches is economically viable. We also characterize circumstances under which globally optimal storage capacities in such a hierarchy can be determined through a decentralized computation in which caches individually minimize local monetary expenditures. The second approach is applicable if the workload at a single cache is represented by an explicit request sequence and the cache employs any one of a large family of removal policies that includes LRU. The mis...
Random Sampling from Databases - A Survey
- Statistics and Computing
, 1994
"... This paper reviews recent literature on techniques for obtaining random samples from databases. We begin with a discussion of why one would want to include sampling facilities in database management systems. We then review basic sampling techniques used in constructing DBMS sampling algorithms, e.g. ..."
Abstract
-
Cited by 20 (0 self)
- Add to MetaCart
This paper reviews recent literature on techniques for obtaining random samples from databases. We begin with a discussion of why one would want to include sampling facilities in database management systems. We then review basic sampling techniques used in constructing DBMS sampling algorithms, e.g., acceptance/rejection and reservoir sampling. A discussion of sampling from various data structures follows: B + trees, hash files, spatial data structures (including R-trees and quadtrees)). Algorithms for sampling from simple relational queries, e.g., single relational operators such as selection, intersection, union, set difference, projection, and join are then described. We then describe sampling for estimation of aggregates (e.g., the size of query results). Here we discuss both clustered sampling, and sequential sampling approaches. Decision theoretic approaches to sampling for query optimization are reviewed. DRAFT of March 22, 1994. 1 Introduction In this paper we sur...
Multiple page size modeling and optimization
- In Proc. of the 14th International Conference on Parallel Architectures and Compilation Techniques
, 2005
"... With the growing awareness that individual hardware cores will not continue to produce the same level of performance improvement, there is a need to develop an integrated approach to performance optimization. In this paper we present a paradigm for Continuous Program Optimization (CPO), whereby auto ..."
Abstract
-
Cited by 15 (3 self)
- Add to MetaCart
With the growing awareness that individual hardware cores will not continue to produce the same level of performance improvement, there is a need to develop an integrated approach to performance optimization. In this paper we present a paradigm for Continuous Program Optimization (CPO), whereby automatic agents monitor and optimize application and system performance. The monitoring data is used to analyze and create models of application and system behavior. Using this analysis, we describe how CPO agents can improve the performance of both the application and the underlying system. Using the CPO paradigm, we implemented cooperating page size optimization agents that automatically optimize large page usage. An offline agent uses vertically integrated performance data to produce a page size benefit analysis for different categories of data structures within an application. We show how an online CPO agent can use the results of the predictive analysis to automatically improve application performance. We validate that the predictions made by the CPO agent reflect the actual performance gains of up to 60 % across a range of scientific applications including the SPECcpu2000 floating point benchmarks and two large high performance computing (HPC) applications. 1.

