Results 1 - 10
of
11
CPR: Composable Performance Regression for Scalable Multiprocessor Models
"... Uniprocessor simulators track resource utilization cycle by cycle to estimate performance. Multiprocessor simulators, however, must account for synchronization events that increase the cost of every cycle simulated and shared resource contention that increases the total number of cycles simulated. T ..."
Abstract
-
Cited by 7 (3 self)
- Add to MetaCart
Uniprocessor simulators track resource utilization cycle by cycle to estimate performance. Multiprocessor simulators, however, must account for synchronization events that increase the cost of every cycle simulated and shared resource contention that increases the total number of cycles simulated. These effects cause multiprocessor simulation times to scale superlinearly with the number of cores. Composable performance regression (CPR) fundamentally addresses these intractable multiprocessor simulation times, estimating multiprocessor performance with a combination of uniprocessor, contention, and penalty models. The uniprocessor model predicts baseline performance of each core while the contention models predict interfering accesses from other cores. Uniprocessor and contention model outputs are composed by a penalty model to produce the final multiprocessor performance estimate. Trained with a production quality simulator, CPR is accurate with median errors of 6.63, 4.83 percent for dual-, quad-core multiprocessors. Furthermore, composable regression is scalable, requiring 0.33x the simulations required by prior regression strategies. 1.
Microarchitectural Design Space Exploration Using An Architecture-Centric Approach
"... The microarchitectural design space of a new processor is too large for an architect to evaluate in its entirety. Even with the use of statistical simulation, evaluation of a single configuration can take excessive time due to the need to run a set of benchmarks with realistic workloads. This paper ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
The microarchitectural design space of a new processor is too large for an architect to evaluate in its entirety. Even with the use of statistical simulation, evaluation of a single configuration can take excessive time due to the need to run a set of benchmarks with realistic workloads. This paper proposes a novel machine learning model that can quickly and accurately predict the performance and energy consumption of any set of programs on any microarchitectural configuration. This architecture-centric approach uses prior knowledge from off-line training and applies it across benchmarks. This allows our model to predict the performance of any new program across the entire microarchitecture configuration space with just 32 further simulations. We compare our approach to a state-of-the-art programspecific predictor and show that we significantly reduce prediction error. We reduce the average error when predicting performance from 24 % to just 7 % and increase the correlation coefficient from 0.55 to 0.95. We then show that this predictor can be used to guide the search of the design space, selecting the best configuration for energy-delay in just 3 further simulations, reducing it to 0.85. We also evaluate the cost of off-line learning and show that we can still achieve a high level of accuracy when using just 5 benchmarks to train. Finally, we analyse our design space and show how different microarchitectural parameters can affect the cycles, energy and energy-delay of the architectural configurations. 1.
Efficiency Trends and Limits from Comprehensive Microarchitectural Adaptivity
- ASPLOS'08
, 2008
"... Increasing demand for power-efficient, high-performance computing requires tuning applications and/or the underlying hardware to improve the mapping between workload heterogeneity and computational resources. To assess the potential benefits of hardware tuning, we propose a framework that leverages ..."
Abstract
-
Cited by 7 (3 self)
- Add to MetaCart
Increasing demand for power-efficient, high-performance computing requires tuning applications and/or the underlying hardware to improve the mapping between workload heterogeneity and computational resources. To assess the potential benefits of hardware tuning, we propose a framework that leverages synergistic interactions between recent advances in (a) sampling, (b) predictive modeling, and (c) optimization heuristics. This framework enables qualitatively new capabilities in analyzing the performance and power characteristics of adaptive microarchitectures. For the first time, we are able to simultaneously consider high temporal and comprehensive spatial adaptivity. In particular, we optimize efficiency for many, short adaptive intervals and identify the best configuration of 15 parameters, which define a space of 240B points. With frequent sub-application reconfiguration and a fully reconfigurable hardware substrate, adaptive microarchitectures achieve bips 3 /w efficiency gains of up to 5.3x (median 2.4x) relative to their static counterparts already optimized for a given application. This 5.3x efficiency gain is derived from a 1.6x performance gain and 0.8x power reduction. Although several applications achieve a significant fraction of their potential efficiency with as few as three adaptive parameters, the three most significant parameters differ across applications. These differences motivate a hardware substrate capable of comprehensive adaptivity to meet these diverse application requirements.
Efficient Architectural Design Space Exploration via Predictive Modeling
"... Efficiently exploring exponential-size architectural design spaces with many interacting parameters remains an open problem: the sheer number of experiments required renders detailed simulation intractable. We attack this via an automated approach that builds accurate predictive models. We simulate ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
Efficiently exploring exponential-size architectural design spaces with many interacting parameters remains an open problem: the sheer number of experiments required renders detailed simulation intractable. We attack this via an automated approach that builds accurate predictive models. We simulate sampled points, using results to teach our models the function describing relationships among design parameters. The models can be queried and are very fast, enabling efficient design tradeoff discovery. We validate our approach via two uniprocessor sensitivity studies, predicting IPC with only 1-2 % error. In an experimental study using the approach, training on 1 % of a 250Kpoint CMP design space allows our models to predict performance with only 4-5 % error. Our predictive modeling combines well with techniques that reduce the time taken by each simulation experiment, achieving net time savings of three-four orders of magnitude.
Roughness of Microarchitectural Design Topologies and its Implications for Optimization
"... Recent advances in statistical inference and machine learning close the divide between simulation and classical optimization, thereby enabling more rigorous and robust microarchitectural studies. To most effectively utilize these now computationally tractable techniques, we characterize design topol ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
Recent advances in statistical inference and machine learning close the divide between simulation and classical optimization, thereby enabling more rigorous and robust microarchitectural studies. To most effectively utilize these now computationally tractable techniques, we characterize design topology roughness and leverage this characterization to guide our usage of analysis and optimization methods. In particular, we compute roughness metrics that require high-order derivatives and multi-dimensional integrals of design metrics, such as performance and power. These roughness metrics exhibit noteworthy correlations (1) against regression model error, (2) against non-linearities and non-monotonicities of contour maps, and (3) against the effectiveness of optimization heuristics such as gradient ascent. Thus, this work quantifies the implications of design topology roughness for commonly used methods and practices in microarchitectural analysis. 1
A Tutorial in Spatial Sampling and Regression Strategies for Microarchitectural Analysis
"... We present a new simulation paradigm for microarchitectural design evaluation and optimization. This paradigm counters increasing simulation costs attributed to the exponentially increasing size of design spaces and the need for more thorough, comprehensive studies when evaluating increasingly diver ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
We present a new simulation paradigm for microarchitectural design evaluation and optimization. This paradigm counters increasing simulation costs attributed to the exponentially increasing size of design spaces and the need for more thorough, comprehensive studies when evaluating increasingly diverse design options. We present a tutorial for (1) obtaining a more comprehensive understanding of the design space by (2) selectively simulating a modest number of designs from that space and then (3) more effectively leveraging that simulation data using techniques in statistical inference. We survey techniques in spatial sampling to obtain designs for simulation. We also detail the statistical techniques required to derive efficient and robust models, interleaving code segments from scripts performing these analyses. The predictive ability and computational efficiency of these regression models enable new capabilities in microarchitectural design space studies. Collectively, our experiences with this paradigm suggest significant potential for accurate, efficient statistical inference in the microarchitectural domain. 1
Characterizing Performance in Virtualized Execution
"... Workload execution in virtualized machines is rapidly becoming commonplace while support for self-virtualization is already available in mass market processors. From a performance analysis and prediction standpoint, virtualization introduces new sources of uncertainty in reasoning about the factors ..."
Abstract
- Add to MetaCart
Workload execution in virtualized machines is rapidly becoming commonplace while support for self-virtualization is already available in mass market processors. From a performance analysis and prediction standpoint, virtualization introduces new sources of uncertainty in reasoning about the factors that impact overall system performance and efficiency. This paper describes a general methodology for profiling the vertical software and hardware stack of a virtualized platform. A key contribution of this paper is an instrumentation approach that enforces alignment between measurements of potentially correlated factors, so that statistical analysis techniques such as regression modeling can be used effectively against the data. The paper shows, through the use of preliminary examples, how one can use such instrumentation to obtain decompositions of the CPI (cycles-per-instruction) metric for two workloads when each is executed first on native hardware and then in an identically configured virtual machine 1.
NC State University
"... Although the best processor design for executing a specific workload does depend on the characteristics of the workload, it can not be determined without factoring-in the effect of the interdependencies between different architectural subcomponents. Consequently, workload characteristics alone do no ..."
Abstract
- Add to MetaCart
Although the best processor design for executing a specific workload does depend on the characteristics of the workload, it can not be determined without factoring-in the effect of the interdependencies between different architectural subcomponents. Consequently, workload characteristics alone do not provide accurate indication of which workloads can perform close-to-optimal on the same architectural configuration. The primary goal of this paper is to demonstrate that, in the design of a heterogeneous CMP, reducing the set of essential benchmarks based on relative similarity in raw workload behavior may direct the design process towards options that result in sub-optimality of the ultimate design. It is shown that the design parameters of the customized processor configurations, what we refer to as the configurational characteristics, can yield a more accurate indication of the best way to partition the workload space for the cores of a heterogeneous system to be customized to. In order to automate the extraction of the configurationalcharacteristics of workloads, a design exploration tool based on the Simplescalar timing simulator and the CACTI modeling tool is presented. Results from this tool are used to display how a systematic methodology can be employed to determine the optimal set of core configurations for a heterogeneous CMP under different design objectives. In addition, it is shown that reducing the set of workloads based on even a single widely documented benchmark similarity (between bzip and gzip) can lead to a slowdown in the overall performance of a heterogeneous-CMP design.
Applied Inference: Case Studies in Microarchitectural Design
"... We propose and apply a new simulation paradigm for microarchitectural design evaluation and optimization. This paradigm enables more comprehensive design studies by combining spatial sampling and statistical inference. Specifically, this paradigm (1) defines a large, comprehensive design space, (2) ..."
Abstract
- Add to MetaCart
We propose and apply a new simulation paradigm for microarchitectural design evaluation and optimization. This paradigm enables more comprehensive design studies by combining spatial sampling and statistical inference. Specifically, this paradigm (1) defines a large, comprehensive design space, (2) samples points from the space for simulation, and (3) constructs regression models based on sparse simulations. This approach greatly improves the computational efficiency of microarchitectural simulation and enables new capabilities in design space exploration. We illustrate new capabilities in three case studies for a large design space of approximately 260,000 points: (1) Pareto frontier, (2) pipeline depth, and (3) multiprocessor heterogeneity analyses. In particular, regression models are exhaustively evaluated to identify Pareto optimal designs that maximize performance for given power budgets. These models enable pipeline depth studies in which all parameters vary simultaneously with depth, thereby more effectively revealing interactions with non-depth parameters. Heterogeneity analysis combines regression based optimization with clustering heuristics to identify efficient design compromises between similar optimal architectures. These compromises are potential core designs in a heterogeneous multicore architecture. Increasing heterogeneity can improve bips3 /w efficiency by as much as 2.4x, a theoretical upper bound on heterogeneity benefits that neglects contention between shared resources as well as design complexity. Collectively these studies demonstrate regression models ’ ability to expose trends and identify optima in diverse design regions, motivating the application of such models in statistical inference for more effective use of modern simulator infrastructure.
ExploringandPredictingthe Architecture/OptimisingCompilerCo-DesignSpace
"... Embedded processor performance is dependent on both the underlying architecture and the compiler optimisations applied. However, designing both simultaneously is extremely difficult to achieve due to the time constraints designers must work under. Therefore, current methodology involves designing co ..."
Abstract
- Add to MetaCart
Embedded processor performance is dependent on both the underlying architecture and the compiler optimisations applied. However, designing both simultaneously is extremely difficult to achieve due to the time constraints designers must work under. Therefore, current methodology involves designing compiler and architecture in isolation, leading to sub-optimal performance of the final product. This paper develops a novel approach to this co-design space problem. For any microarchitectural configuration we automatically predict the performance that an optimising compiler would achieve without actually building it. Once trained, a single run of-O1 on the new architecture is enough to make a prediction with just a 1.6 % error rate. This allows the designer to accurately choose an architectural configuration with knowledge of how an optimising compiler will perform on it. We use this to find the best optimising compiler/architectural configuration in our co-design space and demonstrate that it achieves an average 13 % performance improvement and energy savings of 23 % compared to the baseline, leading to an energy-delay (ED) value of 0.67.

