• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Performance Analysis of a Multithreaded PDES Simulator on Multicore Clusters

by Jingjing Wang, Dmitry Ponomarev, Nael Abu-ghazaleh
Add To MetaCart

Tools

Sorted by:
Results 1 - 2 of 2

pdes scale in environments with heterogeneous delays?” in Proc. of the 2013 ACM SIGSIM conference on Principles of advanced discrete simulation, 2013. TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 11 Jingjing Wang is a PhD student in the Department of

by Jingjing Wang, Ketan Bahulkar, Dmitry Ponomarev, Nael Abu-ghazaleh - Professor in the Department of Computer Science at SUNY Binghamton. His , 1997
"... The performance and scalability of Parallel Discrete Event Simulation (PDES) is often limited by communication latencies and overheads. The emergence of multi-core processors and their expected evolution into many-cores offers the promise of low latency communication and tight memory integration bet ..."
Abstract - Cited by 3 (3 self) - Add to MetaCart
The performance and scalability of Parallel Discrete Event Simulation (PDES) is often limited by communication latencies and overheads. The emergence of multi-core processors and their expected evolution into many-cores offers the promise of low latency communication and tight memory integration between cores; these properties should significantly improve the performance of PDES in such environments. However, on clusters of multi-cores (CMs), the latency and processing overheads incurred when communicating between different machines (nodes) far outweigh those between cores on the same chip, especially when commodity networking fabrics and communication software are used. It is unclear if there is any benefit to the low latency among cores on the same node given that communication links across nodes are significantly worse. In this study, we examine the performance of a multi-threaded implementation of PDES on CMs. We demonstrate that the internode communication costs impose a substantial bottleneck on PDES and demonstrate that without optimizations addressing these long latencies, multi-threaded PDES does not significantly outperform the multiprocess version despite direct communication through shared memory on the individual nodes. We then propose three optimizations: message consolidation and routing, infrequent polling and latencysensitive model partitioning. We show that with these optimizations in place, threaded implementation of PDES significantly outperforms process-based implementation even on
(Show Context)

Citation Context

... message copying, forming, and other MPI overheads. ROSS-MT is limited to a single node. To reach higher scales, an extended version of ROSS-MT, called ROSS-CMT, was recently developed to support CMs =-=[33]-=-. In ROSS-CMT, in order to avoid the overhead when multiple threads invoke MPI functions simultaneously, only one communication thread on each node performs communication across the network. The commu...

Parallel Discrete Event Simulation for Multi-core Systems: Analysis and Optimization

by Jingjing Wang, Student Member, Deepak Jagtap, Student Member, Nael Abu-ghazaleh, Dmitry Ponomarev
"... Abstract—Parallel Discrete Event Simulation (PDES) can substantially improve the performance and capacity of simulation, allowing the study of larger, more detailed models, in less time. PDES is a fine-grained parallel application whose performance and scalability is limited by communication latenci ..."
Abstract - Cited by 2 (2 self) - Add to MetaCart
Abstract—Parallel Discrete Event Simulation (PDES) can substantially improve the performance and capacity of simulation, allowing the study of larger, more detailed models, in less time. PDES is a fine-grained parallel application whose performance and scalability is limited by communication latencies. Traditionally, PDES simulation kernels use message passing; often these simulators are written for distributed environments, and shared memory is used to optimize message passing among processes on the same machine. In this paper, we develop, characterize and optimize a thread-based version of a PDES simulator on three representative multi-core platforms. The multi-threaded implementation eliminates multiple message copying and significantly minimizes synchronization delays. We study the performance of the simulator on three hardware platforms: an Intel Core i7 machine, and a 48-core AMD Opteron Magny-Cours system, and a 64-core Tilera TilePro64. We discover that the three platforms encounter substantially different bottlenecks because of their different architectures. We identify these bottlenecks and propose mechanisms to overcome them. Our results show that multi-threaded implementation improves the performance over an MPI-based version by up to a factor of 3 on the Core i7, 1.4 on the AMD Magny-Cours, and 2.8 on the Tilera Tile64.
(Show Context)

Citation Context

...ores on the same machine (intra-node communication). To study the impact of such heterogeneous delays on CMs, we extended ROSSMT to support CMs. We call this version CM Multi-threaded ROSS (ROSS-CMT) =-=[86]-=-. We first show that the inter-node communication seriously impacts the performance of ROSS-CMT [84]. We then propose several optimizations to reduce the cost of inter-node communication. The first te...

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University