Results 1 - 10
of
16
Accelerating multiprocessor simulation with a memory timestamp record
- In ISPASS-2005
, 2005
"... We introduce a fast and accurate technique for initializing the directory and cache state of a multiprocessor system based on a novel software structure called the memory timestamp record (MTR). The MTR is a versatile, compressed snapshot of memory reference patterns which can be rapidly updated dur ..."
Abstract
-
Cited by 14 (4 self)
- Add to MetaCart
We introduce a fast and accurate technique for initializing the directory and cache state of a multiprocessor system based on a novel software structure called the memory timestamp record (MTR). The MTR is a versatile, compressed snapshot of memory reference patterns which can be rapidly updated during fast-forwarded simulation, or stored as part of a checkpoint. We evaluate MTR using a full-system simulation of a directory-based cachecoherent multiprocessor running a range of multithreaded workloads. Both MTR and a multiprocessor version of functional fast-forwarding (FFW) make similar performance estimates, usually within 15 % of our detailed model. In addition to other benefits, we show that MTR has up to a 1.45× speedup over FFW, and a 7.7 × speedup over our detailed baseline. 1
Verifying Causality Between Distant Performance Phenomena
- in Large-Scale MPI Applications, in: Proc. of the 17th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP
, 2009
"... Abstract—In message-passing applications, the temporal or spatial distance between cause and symptom of a performance problem constitutes a major difficulty in deriving helpful conclusions from performance data. Just knowing the locations of wait states in the program is often insufficient to unders ..."
Abstract
-
Cited by 8 (4 self)
- Add to MetaCart
Abstract—In message-passing applications, the temporal or spatial distance between cause and symptom of a performance problem constitutes a major difficulty in deriving helpful conclusions from performance data. Just knowing the locations of wait states in the program is often insufficient to understand the reason for their occurrence. We present a method for verifying hypotheses on causality between temporally or spatially distant performance phenomena in message-passing applications without altering the application itself. The verification is accomplished by modifying MPI event traces and using them to simulate the hypothetical message-passing behavior. By performing a parallel real-time reenactment of the communication to be simulated using the original execution configuration, we can achieve high scalability and good predictive accuracy in relation to the measured behavior. Not relying on a potentially complex model of the message-passing subsystem, our method is also platform independent. I.
Scaling applications to massively parallel machines using projections performance analysis tool
- In Future Generation Computer Systems Special Issue on: Large-Scale System Performance Modeling and Analysis
, 2005
"... Some of the most challenging applications to parallelize scalably are the ones that present a relatively small amount of computation per iteration. Multiple interacting performance challenges must be identified and solved to attain high parallel efficiency in such cases. We present case studies invo ..."
Abstract
-
Cited by 7 (4 self)
- Add to MetaCart
Some of the most challenging applications to parallelize scalably are the ones that present a relatively small amount of computation per iteration. Multiple interacting performance challenges must be identified and solved to attain high parallel efficiency in such cases. We present case studies involving NAMD, a parallel classic molecular dynamics application for large biomolecular systems, and CPAIMD, Car-Parrinello ab initio molecular dynamics application, and efforts to scale them to large number of processors. Both applications are implemented in Charm++, and the performance analysis was carried out using Projections, the performance visualization/analysis tool associated with Charm++. We will showcase a series of optimizations facilitated by Projections. The resultant performance of NAMD led to a Gordon Bell award at SC2002 with unprecedented speedup on 3,000 processors with teraflops level peak performance. We also explore the techniques for applying the performance visualization/analysis tool on future generation extreme-scale parallel machines and discuss the scalability issues with Projections. 1
Performance modeling and programming environments for petaflops computers and the blue gene machine
- in Computer Science in
, 2004
"... We present a performance modeling and programming environment for petaflops computers and the Blue Gene machine. It consists of a parallel simulator, BigSim, for predicting performance of machines with a very large number of processors, and BigNetSim, an ongoing effort to incorporate a pluggable mod ..."
Abstract
-
Cited by 5 (5 self)
- Add to MetaCart
We present a performance modeling and programming environment for petaflops computers and the Blue Gene machine. It consists of a parallel simulator, BigSim, for predicting performance of machines with a very large number of processors, and BigNetSim, an ongoing effort to incorporate a pluggable module of a detailed contentionbased network model. It provides the ability to make performance predictions for machines such as BlueGene/L. We also explore the programming environments for several planned applications on the machines including Finite Element Method (FEM) simulation. 1
Behavioral Simulations in MapReduce
"... In many scientific domains, researchers are turning to large-scale behavioral simulations to better understand real-world phenomena. While there has been a great deal of work on simulation tools from the high-performance computing community, behavioral simulations remain challenging to program and a ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
In many scientific domains, researchers are turning to large-scale behavioral simulations to better understand real-world phenomena. While there has been a great deal of work on simulation tools from the high-performance computing community, behavioral simulations remain challenging to program and automatically scale in parallel environments. In this paper we present BRACE (Big Red Agent-based Computation Engine), which extends the MapReduce framework to process these simulations efficiently across a cluster. We can leverage spatial locality to treat behavioral simulations as iterated spatial joins and greatly reduce the communication between nodes. In our experiments we achieve nearly linear scale-up on several realistic simulations. Though processing behavioral simulations in parallel as iterated spatial joins can be very efficient, it can be much simpler for the domain scientists to program the behavior of a single agent. Furthermore, many simulations include a considerable amount of complex computation and message passing between agents, which makes it important to optimize the performance of a single node and the communication across nodes. To address both of these challenges, BRACE includes a high-level language called BRASIL (the Big Red Agent SImulation Language). BRASIL has object-oriented features for programming simulations, but can be compiled to a dataflow representation for automatic parallelization and optimization. We show that by using various optimization techniques, we can achieve both scalability and single-node performance similar to that of a hand-coded simulation. 1.
Fast Functional Simulation with Parallel Embra
"... A shift towards chip multiprocessor (CMP) designs has rekindled interest in full-system simulation of multiprocessors. Recent work has attempted to address the problem of linear (or worse) slowdown incurred during sequential simulation of a multiprocessor. Methods such as exploitation of the speed/d ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
A shift towards chip multiprocessor (CMP) designs has rekindled interest in full-system simulation of multiprocessors. Recent work has attempted to address the problem of linear (or worse) slowdown incurred during sequential simulation of a multiprocessor. Methods such as exploitation of the speed/detail trade-off, division (decimation) in simulation time, and statistical simulation share the common requirement of a fast functional simulator, which executes a workload at high speed with functional correctness but relaxed requirements for detailed microarchitectural or timing simulation. Although functional simulators execute at increased speed relative to detailed simulators, they also incur significant slowdown when simulating multiprocessor workloads, a problem which motivates this work. This paper introduces Parallel Embra, a fast functional simulator for shared-memory multiprocessors which is part of the Parallel SimOS complete machine simulator. Parallel Embra takes an aggressive approach to parallel simulation; while it runs at user level and does not make use of the MMU hardware, it combines binary translation with loose timing constraints and relies on the underlying shared memory system for event ordering, time synchronization, and memory synchronization. Although this approach results in non-deterministic execution, it does not compromise functional correctness. Workload tests using the SPLASH-2 shared-memory applications show that Parallel Embra’s approach to fast functional simulation scales up to 64-way parallel simulation without appreciable overhead compared to sequential simulation, and supports complete machine simulation of up to 1024processor systems with practical performance. 1.
LogGOPSim – Simulating Large-Scale Applications in the LogGOPS Model
"... We introduce LogGOPSim—a fast simulation framework for parallel algorithms at large-scale. LogGOPSim utilizes a slightly extended version of the well-known LogGPS model in combination withfullMPImessage matchingsemantics anddetailedsimulation of collective operations. In addition, it enables simulat ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
We introduce LogGOPSim—a fast simulation framework for parallel algorithms at large-scale. LogGOPSim utilizes a slightly extended version of the well-known LogGPS model in combination withfullMPImessage matchingsemantics anddetailedsimulation of collective operations. In addition, it enables simulation in the traditional LogP, LogGP, and LogGPS models. Its simple and fast single-queue design computes more than 1 million events per second on a single processor and enables large-scale simulations of more than8millionprocesses. LogGOPSimalso supports the simulation of full MPI applications by reading and simulating MPI profiling traces. We analyze the accuracy and the performance of the simulation and propose a simple extrapolation scheme for parallel applications. Our scheme extrapolates collective operations withhighaccuracy by rebuilding the communication pattern. Point-to-point operation patterns can be copied in the extrapolation and thus retain the main characteristics of scalable parallel applications. 1.
COREMU: A Scalable and Portable Parallel Full-system Emulator
, 2010
"... NOTES: This report has been submitted for early dissemination of its contents. It will thus be subjective to change without prior notice. It will also be probabaly copyrighted if accepted for publication in a referred conference of journal. Parallel Processing Institute makes no gurantee on the cons ..."
Abstract
- Add to MetaCart
NOTES: This report has been submitted for early dissemination of its contents. It will thus be subjective to change without prior notice. It will also be probabaly copyrighted if accepted for publication in a referred conference of journal. Parallel Processing Institute makes no gurantee on the consequences of using the viewpoints and results in the technical report. It
NIH Resource for Macromolecular Modeling and Bioinformatics Theoretical Biophysics Group
, 2008
"... is an expert in theoretical and computational biophysics directing for 18 years the NIH Resource for Macromolecular Modeling and Bioinformatics. The Resource’s molecular analysis and dynamics programs VMD and NAMD are used by over 100,000 registered users and are considered among the fastest in the ..."
Abstract
- Add to MetaCart
is an expert in theoretical and computational biophysics directing for 18 years the NIH Resource for Macromolecular Modeling and Bioinformatics. The Resource’s molecular analysis and dynamics programs VMD and NAMD are used by over 100,000 registered users and are considered among the fastest in the field [13]. His research in photobiology, published in Nature, Science, PNAS,
Full System Simulation of Many-Core Heterogeneous SoCs using GPU and QEMU Semihosting
"... Modern system-on-chips are evolving towards complex and heterogeneous platforms with general purpose processors coupled with massively parallel manycore accelerator fabrics (e.g. embedded GPUs). Platform developers are looking for efficient full-system simulators capable of simulating complex applic ..."
Abstract
- Add to MetaCart
Modern system-on-chips are evolving towards complex and heterogeneous platforms with general purpose processors coupled with massively parallel manycore accelerator fabrics (e.g. embedded GPUs). Platform developers are looking for efficient full-system simulators capable of simulating complex applications, middleware and operating systems on these heterogeneous targets. Unfortunately current virtual platforms are not able to tackle the complexity and heterogeneity of state-of-the-art SoCs. Software emulators, such as the open-source QEMU project, cope quite well in terms of simulation speed and functional accuracy with homogeneous coarse-grained multi-cores. The main contribution of this paper is the introduction of a novel virtual prototyping technique which exploits the heterogeneous accelerators available in commodity PCs to tackle the heterogeneity challenge in full-SoC system simulation. In a nutshell, our approach makes it possible to partition simulation between the host CPU and GPU. More specifically, QEMU runs on the host CPU and the simulation of manycore accelerators is offloaded, through semi-hosting, to the host GPU. Our experimental results confirm the flexibility and efficiency of our enhanced QEMU environment.

