Results 1 - 10
of
51
A Dollar from 15 Cents: Cross-Platform Management for Internet Services
- In USENIX
, 2008
"... As Internet services become ubiquitous, the selection and management of diverse server platforms now affects the bottom line of almost every firm in every industry. Ideally, such cross-platform management would yield high performance at low cost, but in practice, the performance consequences of such ..."
Abstract
-
Cited by 19 (7 self)
- Add to MetaCart
As Internet services become ubiquitous, the selection and management of diverse server platforms now affects the bottom line of almost every firm in every industry. Ideally, such cross-platform management would yield high performance at low cost, but in practice, the performance consequences of such decisions are often hard to predict. In this paper, we present an approach to guide cross-platform management for real-world Internet services. Our approach is driven by a novel performance model that predicts application-level performance across changes in platform parameters, such as processor cache sizes, processor speeds, etc., and can be calibrated with data commonly available in today’s production environments. Our model is structured as a composition of several empirically observed, parsimonious sub-models. These sub-models have few free parameters and can be calibrated with lightweight passive observations on a current production platform. We demonstrate the usefulness of our cross-platform model in two management problems. First, our model provides accurate performance predictions when selecting the next generation of processors to enter a server farm. Second, our model can guide platform-aware load balancing across heterogeneous server farms. 1
An approach to performance prediction for parallel applications
- Euro-Par, Springer LNCS
, 2005
"... Abstract. Accurately modeling and predicting performance for largescale applications becomes increasingly difficult as system complexity scales dramatically. Analytic predictive models are useful, but are difficult to construct, usually limited in scope, and often fail to capture subtle interactions ..."
Abstract
-
Cited by 17 (3 self)
- Add to MetaCart
Abstract. Accurately modeling and predicting performance for largescale applications becomes increasingly difficult as system complexity scales dramatically. Analytic predictive models are useful, but are difficult to construct, usually limited in scope, and often fail to capture subtle interactions between architecture and software. In contrast, we employ multilayer neural networks trained on input data from executions on the target platform. This approach is useful for predicting many aspects of performance, and it captures full system complexity. Our models are developed automatically from the training input set, avoiding the difficult and potentially error-prone process required to develop analytic models. This study focuses on the high-performance, parallel application SMG2000, a much studied code whose variations in execution times are still not well understood. Our model predicts performance on two large-scale parallel platforms within 5%-7 % error across a large, multidimensional parameter space. 1
A Statistical Multiprocessor Cache Model
, 2005
"... The introduction of general purpose microprocessors running multiple threads will put a focus on methods and tools helping a programmer to write efficient parallel applications. Such a tool should be fast enough to meet a software developer's need for short turn-around time, but also be accurate and ..."
Abstract
-
Cited by 16 (2 self)
- Add to MetaCart
The introduction of general purpose microprocessors running multiple threads will put a focus on methods and tools helping a programmer to write efficient parallel applications. Such a tool should be fast enough to meet a software developer's need for short turn-around time, but also be accurate and flexible enough to provide trendcorrect and intuitive feedback. This paper describes an efficient and flexible approach for modeling the memory system of a multiprocessor, such as those of chip multiprocessors (CMPs). Sparse data is sampled during a multithreaded execution. The data collected consist of the reuse distance and invalidation distribution for a small subset of the memory accesses. Based on the sampled data from a single run, a new mathematical formula is used to estimate the miss rate for a multiprocessor memory hierarchy built from caches of arbitrarily size, cache-line size and degree of sharing. The formula further divides the misses into six categories to aid the software developer. The method is evaluated using a large number of commercial and technical multithreaded applications. The result produced by our algorithm fed with sparse sampling data is shown to be consistent with results gathered during traditional architecture simulation.
Methods of inference and learning for performance modeling of parallel applications
- in Proc. of the International Symposium on Principles and Practices of Parallel Programming
, 2007
"... Increasing system and algorithmic complexity combined with a growing number of tunable application parameters pose significant challenges for analytical performance modeling. We propose a series of robust techniques to address these challenges. In particular, we apply statistical techniques such as ..."
Abstract
-
Cited by 13 (4 self)
- Add to MetaCart
Increasing system and algorithmic complexity combined with a growing number of tunable application parameters pose significant challenges for analytical performance modeling. We propose a series of robust techniques to address these challenges. In particular, we apply statistical techniques such as clustering, association, and correlation analysis, to understand the application parameter space better. We construct and compare two classes of effective predictive models: piecewise polynomial regression and artifical neural networks. We compare these techniques with theoretical analyses and experimental results. Overall, both regression and neural networks are accurate with median error rates ranging from 2.2 to 10.5 percent. The comparable accuracy of these models suggest differentiating features will arise from ease of use, transparency, and computational efficiency.
A predictive performance model for superscalar processors
- In International Symposium on Microarchitecture
, 2006
"... Designing and optimizing high performance microprocessors is an increasingly difficult task due to the size and complexity of the processor design space, high cost of detailed simulation and several constraints that a processor design must satisfy. In this paper, we propose the use of empirical non- ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
Designing and optimizing high performance microprocessors is an increasingly difficult task due to the size and complexity of the processor design space, high cost of detailed simulation and several constraints that a processor design must satisfy. In this paper, we propose the use of empirical non-linear modeling techniques to assist processor architects in making design decisions and resolving complex trade-offs. We propose a procedure for building accurate non-linear models that consists of the following steps: (i) selection of a small set of representative design points spread across processor design space using latin hypercube sampling, (ii) obtaining performance measures at the selected design points using detailed simulation, (iii) building non-linear models for performance using the function approximation capabilities of radial basis function networks, and (iv) validating the models using an independently and randomly generated set of design points. We evaluate our model building procedure by constructing non-linear performance models for programs from the SPEC CPU2000 benchmark suite with a microarchitectural design space that consists of 9 key parameters. Our results show that the models, built using a relatively small number of simulations, achieve high prediction accuracy (only 2.8 % error in CPI estimates on average) across a large processor design space. Our models can potentially replace detailed simulation for common tasks such as the analysis of key microarchitectural trends or searches for optimal processor design points. 1.
Scalable cross-architecture predictions of memory hierarchy response for scientific applications
- In Proceedings of the Symposium of the Las Alamo s Computer Science Institute, Sante Fe
, 2005
"... The gap between processor and memory speeds has been growing with each new generation of microprocessors. As a result, memory hierarchy response has become a critical factor limiting application performance. For this reason, we have been working to model the impact of memory access latency on progra ..."
Abstract
-
Cited by 11 (1 self)
- Add to MetaCart
The gap between processor and memory speeds has been growing with each new generation of microprocessors. As a result, memory hierarchy response has become a critical factor limiting application performance. For this reason, we have been working to model the impact of memory access latency on program performance. We build upon our prior work on constructing machine-independent characterizations of application behavior [9] by improving instructionlevel modeling of the structure and scaling of data access patterns. This paper makes two contributions. First, it describes static analysis techniques that help us build accurate reference-level characterizations of memory reuse patterns in the presence of complex interactions between loop unrolling and multi-word memory blocks. Second, it describes a strategy for combining memory hierarchy response characterizations suitable for predicting behavior for fully associative caches with a probabilistic technique that enables us to predict misses for set-associative caches. We validate our approach by comparing our predictions (at loop, routine and program level) against measurements using hardware performance counters for several benchmarks on two platforms with different memory hierarchy characteristics over a large range of problem sizes.
A Genetic Algorithms Approach to Modeling the Performance of Memory-bound Computations
, 2007
"... Benchmarks that measure memory bandwidth, such as STREAM, Apex-MAPS and MultiMAPS, are increasingly popular due to the "Von Neumann" bottleneck of modern processors which causes many calculations to be memory-bound. We present a scheme for predicting the performance of HPC applications based on the ..."
Abstract
-
Cited by 10 (3 self)
- Add to MetaCart
Benchmarks that measure memory bandwidth, such as STREAM, Apex-MAPS and MultiMAPS, are increasingly popular due to the "Von Neumann" bottleneck of modern processors which causes many calculations to be memory-bound. We present a scheme for predicting the performance of HPC applications based on the results of such benchmarks. A Genetic Algorithm approach is used to "learn" bandwidth as a function of cache hit rates per machine with MultiMAPS as the fitness test. The specific results are 56 individual performance predictions including 3 full-scale parallel applications run on 5 different modern HPC architectures, with various CPU counts and inputs, predicted within 10 % average difference with respect to independently verified runtimes.
An adaptive performance modeling tool for gpu architectures
- In PPoPP
, 2010
"... This paper presents an analytical model to predict the performance of general-purpose applications on a GPU architecture. The model is designed to provide performance information to an auto-tuning compiler and assist it in narrowing down the search to the more promising implementations. It can also ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
This paper presents an analytical model to predict the performance of general-purpose applications on a GPU architecture. The model is designed to provide performance information to an auto-tuning compiler and assist it in narrowing down the search to the more promising implementations. It can also be incorporated into a tool to help programmers better assess the performance bottlenecks in their code. We analyze each GPU kernel and identify how the kernel exercises major GPU microarchitecture features. To identify interpretation of a GPU kernel, work flow graph, based on which we estimate the execution time of a GPU kernel. We validated our performance model on the NVIDIA GPUs using CUDA (Compute Unified Device Architecture). For this purpose, we used data parallel benchmarks that stress different GPU microarchitecture events such as uncoalesced memory accesses, scratch-pad memory bank conflicts, and control flow divergence, which must be accurately modeled but represent challenges to the analytical performance models. The proposed model captures full system complexity and shows high accuracy in predicting the performance trends of different optimized kernel implementations. We also describe our approach to extracting the performance model automatically from a kernel code.
Is Reuse Distance Applicable to Data Locality Analysis on Chip Multiprocessors?
"... Abstract. On Chip Multiprocessors (CMP), it is common that multiple cores share certain levels of cache. The sharing increases the contention in cache and memory-to-chip bandwidth, further highlighting the importance of data locality analysis. As a rigorous and hardware-independent locality metric, ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
Abstract. On Chip Multiprocessors (CMP), it is common that multiple cores share certain levels of cache. The sharing increases the contention in cache and memory-to-chip bandwidth, further highlighting the importance of data locality analysis. As a rigorous and hardware-independent locality metric, reuse distance has served for a variety of locality analysis, program transformations, and performance prediction. However, previous studies have concentrated on sequential programs running on unicore processors. On CMP, accesses by different threads (or jobs) interact in the shared cache. How reuse distance applies to the new architecture remains an open question—particularly, how the interactions in shared cache affect the collection and application of reuse distance, and how reuse-distance–based locality analysis should adapt to such architecture changes. This paper presents our explorations towards answering those questions. It first introduces the concept of concurrent reuse distance, a direct extension of the traditional concept of reuse distance with data references by all co-running threads (or jobs) considered. It then discusses the properties of concurrent reuse distance, revealing the special challenges facing the collection and application of concurrent reuse distance on CMP platforms. Finally, it presents the solutions to those challenges for a class of multithreading applications. The solutions center on a probabilistic model that connects concurrent reuse distance with the data locality of each individual thread. Experiments demonstrate the effectiveness of the proposed techniques in facilitating the uses of concurrent reuse distance for CMP computing. 1

