Results 1 - 10
of
42
ZSim: Fast and Accurate Microarchitectural Simulation of Thousand-Core Systems
"... Architectural simulation is time-consuming, and the trend towards hundreds of cores is making sequential simulation even slower. Existing parallel simulation techniques either scale poorly due to excessive synchronization, or sacrifice accuracybyallowingeventreorderingandusingsimplisticcontention mo ..."
Abstract
-
Cited by 16 (5 self)
- Add to MetaCart
(Show Context)
Architectural simulation is time-consuming, and the trend towards hundreds of cores is making sequential simulation even slower. Existing parallel simulation techniques either scale poorly due to excessive synchronization, or sacrifice accuracybyallowingeventreorderingandusingsimplisticcontention models. As a result, most researchers use sequential simulators and model small-scale systems with 16-32 cores. With 100-core chips already available, developing simulators that scale to thousands of cores is crucial. We present three novel techniques that, together, make thousand-core simulation practical. First, we speed up detailed core models (including OOO cores) with instructiondriven timing models that leverage dynamic binary translation. Second, we introduce bound-weave, a two-phase parallelization technique that scales parallel simulation on multicore hosts efficiently with minimal loss of accuracy. Third, we implement lightweight user-level virtualization to support complex workloads, including multiprogrammed, client-server, and managed-runtime applications, without the need for full-system simulation, sidestepping the lack of scalable OSs and ISAs that support thousands of cores. We use these techniques to build zsim, a fast, scalable, and accurate simulator. On a 16-core host, zsim models a 1024-core chip at speeds of up to 1,500 MIPS using simple cores and up to 300 MIPS using detailed OOO cores, 2-3 orders of magnitude faster than existing parallel simulators. Simulator performance scales well with both the number of modeled cores and the number of host cores. We validate zsim against a real Westmere system on a wide variety of workloads, and find performance and microarchitectural events to be within a narrow range of the real system. 1.
ESESC: A Fast Multicore Simulator Using Time-Based Sampling
- in International Symposium on High Performance Computer Architecture
, 2013
"... Architects rely on simulation in their exploration of the design space. However, slow simulation speed caps their productivity and limits the depth of their exploration. Sampling has been a commonly used remedy. While sampling is shown to be an effective technique for single core processors, its app ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
(Show Context)
Architects rely on simulation in their exploration of the design space. However, slow simulation speed caps their productivity and limits the depth of their exploration. Sampling has been a commonly used remedy. While sampling is shown to be an effective technique for single core processors, its application has been limited to simulation of multiprogram, throughput applications only. This work presents Time-Based Sampling (TBS), a framework that is the first to enable sampling in simulation of multicore processors with virtually no limitation in terms of application type (multiprogrammed or multithreaded), number of cores, homogeneity or heterogeneity of the simulated configuration (4.99 % error averaged across all the evaluated configurations). TBS also is the first to enable integrated power and temperature evaluation in statistically sampled simulation of multicore systems (with 5.5 % and 2.4 % error on average, respectively). We implement an architectural simulator based on TBS, called ESESC, that provides a holistic set of tools for a fair evaluation of different architectures. 1.
Using Cycle Stacks to Understand Scaling Bottlenecks in Multi-Threaded Workloads
"... This paper proposes a methodology for analyzing parallel performance by building cycle stacks. A cycle stack quantifies where the cycles have gone, and provides hints towards optimization opportunities. We make the case that this is particularly interesting for analyzing parallel performance: unders ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
(Show Context)
This paper proposes a methodology for analyzing parallel performance by building cycle stacks. A cycle stack quantifies where the cycles have gone, and provides hints towards optimization opportunities. We make the case that this is particularly interesting for analyzing parallel performance: understanding how cycle components scale with increasing core counts and/or input data set sizes leads to insight with respect to scaling bottlenecks due to synchronization, load imbalance, poor memory performance, etc. We present several case studies illustrating the use of cycle stacks. As a subsequent step, we further extend the methodology to analyze sets of parallel workloads using statistical data analysis, and perform a workload characterization to understand behavioral differences across benchmark suites. We analyze the SPLASH-2, PARSEC and Rodinia benchmark suites and conclude that the three benchmark suites cover similar areas in the workload space. However, scaling behavior of these benchmarks towards larger input sets and/or higher core counts is highly dependent on the benchmark, the way in which the inputs have been scaled, and on the machine configuration. 1
Sampled Simulation of Multi-Threaded Applications
"... Sampling is a well-known workload reduction technique that allows one to speed up architectural simulation while accurately predicting performance. Previous sampling methods have been shown to accurately predict single-threaded application runtime based on its overall IPC. However, these previous a ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
Sampling is a well-known workload reduction technique that allows one to speed up architectural simulation while accurately predicting performance. Previous sampling methods have been shown to accurately predict single-threaded application runtime based on its overall IPC. However, these previous approaches are unsuitable for general multi-threaded applications, for which IPC is not a good proxy for runtime. Additionally, we find that issues such as application periodicity and inter-thread synchronization play a significant role in determining how best to sample these applications. The proposed multi-threaded application sampling methodology is able to derive an effective sampling strategy for candidate applications using architecture-independent metrics. Using this methodology, large input sets can now be simulated which would otherwise be infeasible, allowing for more accurate conclusions to be made than from studies using scaled-down input sets. Through the use of the proposed methodology, we can simulate less than 10 % of the total application runtime in detail. On the SPEComp, NPB and PARSEC benchmarks, running on an 8-core simulated system, we achieve an average absolute error of 3.5%.
Selecting benchmark combinations for the evaluation of multicore throughput
- In The IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS
, 2013
"... HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte p ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
(Show Context)
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et a ̀ la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
An evaluation of high-level mechanistic core models
- ACM Trans. Archit. Code Optim
, 2014
"... Large core counts and complex cache hierarchies are increasing the burden placed on commonly used simulation and modeling techniques. Although analytical models provide fast results, they do not apply to complex, many-core shared-memory systems. In contrast, detailed cycle-level simulation can be ac ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
(Show Context)
Large core counts and complex cache hierarchies are increasing the burden placed on commonly used simulation and modeling techniques. Although analytical models provide fast results, they do not apply to complex, many-core shared-memory systems. In contrast, detailed cycle-level simulation can be accurate but also tends to be slow, which limits the number of configurations that can be evaluated. A middle ground is needed that provides for fast simulation of complex many-core processors while still providing accurate results. In this article, we explore, analyze, and compare the accuracy and simulation speed of high-abstraction core models as a potential solution to slow cycle-level simulation. We describe a number of enhancements to interval simulation to improve its accuracy while maintaining simulation speed. In addition, we introduce the instruction-window centric (IW-centric) core model, a new mechanistic core model that bridges the gap between interval simulation and cycle-accurate simulation by enabling high-speed simulations with higher levels of detail. We also show that using accurate core models like these are important for memory subsystem studies, and that simple, naivemodels, like a one-IPC coremodel, can lead tomisleading and incorrect results and conclusions in practical design studies. Validation against real hardware shows good accuracy, with an average single-core error of 11.1 % and a maximum of 18.8 % for the IW-centric model with a 1.5 × slowdown
SynFull: Synthetic traffic models capturing a full range of cache coherent behaviour
- in ISCA
, 2014
"... Modern and future many-core systems represent complex ar-chitectures. The communication fabrics of these large systems heavily influence their performance and power consumption. Current simulation methodologies for evaluating networks-on-chip (NoCs) are not keeping pace with the increased com-plexit ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
(Show Context)
Modern and future many-core systems represent complex ar-chitectures. The communication fabrics of these large systems heavily influence their performance and power consumption. Current simulation methodologies for evaluating networks-on-chip (NoCs) are not keeping pace with the increased com-plexity of our systems; architects often want to explore many different design knobs quickly. Methodologies that capture workload trends with faster simulation times are highly ben-eficial at early stages of architectural exploration. We pro-pose SynFull, a synthetic traffic generation methodology that captures both application and cache coherence behaviour to rapidly evaluate NoCs. SynFull allows designers to quickly indulge in detailed performance simulations without the cost of long-running full-system simulation. By capturing a full range of application and coherence behaviour, architects can avoid the over or underdesign of the network as may occur when using traditional synthetic traffic patterns such as uni-form random. SynFull has errors as low as 0.3 % and provides 50 × speedup on average over full-system simulation. 1.
A Cache Reconfiguration Approach for Saving Leakage and Refresh Energy
- Iowa State University
, 2013
"... ar ..."
(Show Context)
Accelerating Sparse Matrix-Matrix Multiplication with 3D-Stacked Logic-in-Memory Hardware
- IEEE HPEC 2013
"... Abstract—This paper introduces a 3D-stacked logic-in-memory (LiM) system to accelerate the processing of sparse matrix data that is held in a 3D DRAM system. We build a customized content addressable memory (CAM) hardware structure to exploit the inherent sparse data patterns and model the LiM based ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
(Show Context)
Abstract—This paper introduces a 3D-stacked logic-in-memory (LiM) system to accelerate the processing of sparse matrix data that is held in a 3D DRAM system. We build a customized content addressable memory (CAM) hardware structure to exploit the inherent sparse data patterns and model the LiM based hardware accelerator layers that are stacked in between DRAM dies for the efficient sparse matrix operations. Through silicon vias (TSVs) are used to provide the required high inter-layer bandwidth. Furthermore, we adapt the algorithm and data structure to fully leverage the underlying hardware capabilities, and develop the necessary design framework to facilitate the design space evalu-ation and LiM hardware synthesis. Our simulation demonstrates more than two orders of magnitude of performance and energy efficiency improvements compared with the traditional multi-threaded software implementation on modern processors.
EFetch: Optimizing Instruction Fetch for Event-Driven Web Applications
"... Web 2.0 applications written in JavaScript are increasingly popular as they are easy to use, easy to update and maintain, and portable across a wide variety of computing platforms. Web applications receive frequent input from a rich array of sensors, network, and user input modalities. To handle the ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
(Show Context)
Web 2.0 applications written in JavaScript are increasingly popular as they are easy to use, easy to update and maintain, and portable across a wide variety of computing platforms. Web applications receive frequent input from a rich array of sensors, network, and user input modalities. To handle the resulting asynchrony due to these inputs, web appli-cations are developed using an event-driven programming model. These event-driven web applications have dramati-cally different characteristics, which provides an opportunity to create a customized processor core to improve the respon-siveness of web applications. In this paper, we take one step towards creating a core customized to event-driven applications. We observe that instruction cache misses of web applications are substan-tially higher than conventional server and desktop workloads due to large working sets caused by distant re-use. To mit-igate this bottleneck, we propose an instruction prefetcher (EFetch) that is tuned to exploit the characteristics of web applications. We find that an event signature, which cap-tures the current event and function calling context, is a good predictor of the control flow inside a function of an event-driven program. It allows us to accurately predict a function’s callees and their function bodies and prefetch them in a timely manner. For a set of real-world web appli-cations, we show that the proposed prefetcher outperforms commonly implemented next-2-line prefetcher by 17%. Also, it consumes 5.2 times less area than a recently proposed prefetcher, while outperforming it.