Results 1 - 10
of
18
An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness
, 2009
"... GPU architectures are increasingly important in the multi-core era due to their high number of parallel processors. Programming thousands of massively parallel threads is a big challenge for software engineers, but understanding the performance bottlenecks of those parallel programs on GPU architect ..."
Abstract
-
Cited by 25 (2 self)
- Add to MetaCart
GPU architectures are increasingly important in the multi-core era due to their high number of parallel processors. Programming thousands of massively parallel threads is a big challenge for software engineers, but understanding the performance bottlenecks of those parallel programs on GPU architectures to improve application performance is even more difficult. Current approaches rely on programmers to tune their applications by exploiting the design space exhaustively without fully understanding the performance characteristics of their applications. To provide insights into the performance bottlenecks of parallel applications on GPU architectures, we propose a simple analytical model that estimates the execution time of massively parallel programs. The key component of our model is estimating the number of parallel memory requests (we call this the memory warp parallelism) by considering the number of running threads and memory bandwidth. Based on the degree of memory warp parallelism, the model estimates the cost of memory requests, thereby estimating the overall execution time of a program. Comparisons between the outcome of the model and the actual execution time in several GPUs show that the geometric mean of absolute error of our model on micro-benchmarks is 5.4 % and on GPU computing applications is 13.3%. All the applications are written in the CUDA programming language.
Rapid development of a flexible validated processor model
- In Proceedings of the 2005 Workshop on Modeling, Benchmarking, and Simulation
, 2005
"... For a variety of reasons, most architectural evaluations use simulation models. An accurate baseline model validated against existing hardware provides confidence in the results of these evaluations. Meanwhile, a meaningful exploration of the design space requires a wide range of quickly-obtainable ..."
Abstract
-
Cited by 22 (11 self)
- Add to MetaCart
For a variety of reasons, most architectural evaluations use simulation models. An accurate baseline model validated against existing hardware provides confidence in the results of these evaluations. Meanwhile, a meaningful exploration of the design space requires a wide range of quickly-obtainable variations of the baseline. Unfortunately, these two goals are generally considered to be at odds; the set of validated models is considered exclusive of the set of easily malleable models. Vachharajani et al. challenge this belief and propose a modeling methodology they claim allows rapid construction of flexible validated models. Unfortunately, they only present anecdotal and secondary evidence to support their claims. In this paper, we present our experience using this methodology to construct a validated flexible model of Intel’s Itanium 2 processor. Our practical experience lends support to the above claims. Our initial model was constructed by a single researcher in only 11 weeks and predicts processor cycles-per-instruction (CPI) to within 7.9 % on average for the entire SPEC CINT2000 benchmark suite. Our experience with this model showed us that aggregate accuracy for a metric like CPI is not sufficient. Aggregate measures like CPI may conceal remaining internal “offsetting errors ” which can adversely affect conclusions drawn from the model. Using this as our motivation, we explore the flexibility of the model by modifying it to target specific error constituents, such as front-end stall errors. In 2 1 2 person-weeks, average CPI error was reduced to 5.4%. The targeted error constituents were reduced more dramatically; front-end stall errors were reduced from 5.6 % to 1.6%. The swift implementation of significant new architectural features on this model further demonstrated its flexibility. 1
An analytical model of the working-set sizes in decision-support systems
- In SIGMETRICS
, 2000
"... This paper presents an analytical model to study how working sets scale with database size and other applications parameters in decision-support systems (DSS). The model uses application parameters, that are measured on down-scaled database executions, to predict cache miss ratios for executions of ..."
Abstract
-
Cited by 12 (0 self)
- Add to MetaCart
This paper presents an analytical model to study how working sets scale with database size and other applications parameters in decision-support systems (DSS). The model uses application parameters, that are measured on down-scaled database executions, to predict cache miss ratios for executions of large databases. By applying the model to two database engines and typical DSS queries we find that, even for large databases, the most performance-critical working set is small and is caused by the instructions and private data that are required to access a single tuple. Consequently, its size is not affected by the database size. Surprisingly, database data may also exhibit temporal locality but the size of its working set critically depends on the structure of the query, the method of scanning, and the size and the content of the database. 1.
Microarchitecture Modeling for Design-Space Exploration Design-Space Exploration
, 2004
"... To identify the best processor designs, designers explore a vast design space. To assess the quality of candidate designs, designers construct and use simulators. Unfortunately, simulator construction is a bottleneck in this design-space exploration because existing simulator construction methodolog ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
To identify the best processor designs, designers explore a vast design space. To assess the quality of candidate designs, designers construct and use simulators. Unfortunately, simulator construction is a bottleneck in this design-space exploration because existing simulator construction methodologies lead to long simulator development times. This bottleneck limits exploration to a small set of designs, potentially diminishing quality of the final design.
AMVA Techniques for High Service Time Variability
- IN PROC. ACM SIGMETRICS
, 2000
"... Motivated by experience gained during the validation of a recent Approximate Mean Value Analysis (AMVA) model of modern shared memory architectures, this paper re-examines the "standard" AMVA approximation for non-exponential FCFS queues. We find that this approximation is often inaccurate for FCFS ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
Motivated by experience gained during the validation of a recent Approximate Mean Value Analysis (AMVA) model of modern shared memory architectures, this paper re-examines the "standard" AMVA approximation for non-exponential FCFS queues. We find that this approximation is often inaccurate for FCFS queues with high service time variability. For such queues, we propose and evaluate: (1) AMVA estimates of the mean residual service time at an arrival instant that are much more accurate than the standard AMVA estimate, (2) a new AMVA technique that provides a much more accurate estimate of mean center residence time than the standard AMVA estimate, and (3) a new AMVA technique for computing the mean residence time at a "downstream" queue which has a more bursty arrival process than is assumed in the standard AMVA equations. Together, these new techniques increase the range of applications to which AMVA may be fruitfully applied, so that for example, the memory system architecture of shared memory systems with complex modern processors can be analyzed with these computationally efficient methods.
Performance Prediction for Random Write Reductions: A Case Study in Modeling Shared Memory Programs
- ACM SIGMETRICS intern.conf. on Measur. and Model. of computer systems, p117 – 128, 2002
, 2002
"... In this paper, we revisit the problem of performance prediction on shared memory parallel machines, motivated by the need for selecting parallelization strategy for random write reductions. Such reductions frequently arise in data mining algorithms. ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
In this paper, we revisit the problem of performance prediction on shared memory parallel machines, motivated by the need for selecting parallelization strategy for random write reductions. Such reductions frequently arise in data mining algorithms.
Techniques for Accurate, Accelerated Processor Simulation: An Analysis of Reduced Inputs and Sampling
, 2002
"... Detailed execution-driven simulation is an important tool for computer architecture research. It is desirable to drive these simulations with standard benchmark programs that are commonly used to evaluate existing computer systems, such as the SPEC2000 suite. Unfortunately, simulating these benchm ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Detailed execution-driven simulation is an important tool for computer architecture research. It is desirable to drive these simulations with standard benchmark programs that are commonly used to evaluate existing computer systems, such as the SPEC2000 suite. Unfortunately, simulating these benchmark programs to completion using full-detail, cycle-accurate simulation on the designated reference input sets results in intractably long simulation durations. This study evaluates and compares two techniques for combating long simulation times: reduced inputs and sampling. Our objective is to assess the ability of each to reduce simulation running times, while simultaneously minimizing the difference in the results generated by using these techniques relative to the results generated by simulating the benchmark programs to completion using the reference inputs. With the reduced input technique, new input sets are carefully generated by hand to produce run-time characteristics of the benchmark programs that are comparable to the overall characteristics produced when the programs are run with the standard inputs.
Interval simulation: Raising the level of abstraction in architectural simulation
- In High Performance Computer Architecture (HPCA), 2010 IEEE 16th International Symposium on
, 2010
"... Detailed architectural simulators suffer from a long development cycle and extremely long evaluation times. This longstanding problem is further exacerbated in the multicore processor era. Existing solutions address the simulation problem by either sampling the simulated instruction stream or by map ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Detailed architectural simulators suffer from a long development cycle and extremely long evaluation times. This longstanding problem is further exacerbated in the multicore processor era. Existing solutions address the simulation problem by either sampling the simulated instruction stream or by mapping the simulation models on FPGAs; these approaches achieve substantial simulation speedups while simulating performance in a cycle-accurate manner. This paper proposes interval simulation which takes a completely different approach: interval simulation raises the level of abstraction and replaces the core-level cycleaccurate simulation model by a mechanistic analytical model. The analytical model estimates core-level performance by analyzing intervals, or the timing between two miss events (branch mispredictions and TLB/cache misses); the miss events are determined through simulation of the memory hierarchy, cache coherence protocol, interconnection network and branch predictor. By raising the level of abstraction, interval simulation reduces both development time and evaluation time. Our experimental results using the SPEC CPU2000 and PARSEC benchmark suites and the M5 multi-core simulator, show good accuracy up to eight cores (average error of 4.6 % and max error of 11 % for the multi-threaded fullsystem workloads), while achieving a one order of magnitude simulation speedup compared to cycle-accurate simulation. Moreover, interval simulation is easy to implement: our implementation of the mechanistic analytical model incurs only one thousand lines of code. Its high accuracy, fast simulation speed and ease-of-use make interval simulation a useful complement to the architect’s toolbox for exploring system-level and high-level micro-architecture trade-offs. 1
AUTOMATED DESIGN OF APPLICATION-SPECIFIC SUPERSCALAR PROCESSORS
, 2006
"... Automated design of superscalar processors can provide future system-on-chip (SOC) designers with a key-turn method of generating superscalar processors that are Pareto-optimal in terms of performance, energy consumption, and area for the target application program(s). Unfortunately, current optimiz ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Automated design of superscalar processors can provide future system-on-chip (SOC) designers with a key-turn method of generating superscalar processors that are Pareto-optimal in terms of performance, energy consumption, and area for the target application program(s). Unfortunately, current optimization methods are based on time-consuming cycle-accurate simulation, unsuitable for analysis of hundreds of thousands of design options that is required to arrive at Pareto-optimal designs. This dissertation bridges the gap between a large design space of superscalar processors and the inability of cycle-accurate simulation to analyze a large design space, by providing a computationally and conceptually simple analytical method for generating Pareto-optimal superscalar processor designs. The proposed and evaluated analytical method consists of three parts: (1) a method for analytically estimating the performance in terms a cycles-per-instruction (CPI) using the application program statistics and the superscalar processor parameters, (2) a method of analytically estimating various energy consuming activities using the application program statistics and the superscalar processor parameters, and (3) a method of finding the Pareto-

