Results 1 -
6 of
6
Optimizing a Parallel Runtime System for Multicore Clusters: A Case Study
- in TeraGrid’10
, 2010
"... Clusters of multicore nodes have become the most popular option for new HPC systems due to their scalability and performance/cost ratio. The complexity of programming multicore systems under-scores the need for powerful and efficient runtime systems that manage resources such as threads and communic ..."
Abstract
-
Cited by 10 (9 self)
- Add to MetaCart
(Show Context)
Clusters of multicore nodes have become the most popular option for new HPC systems due to their scalability and performance/cost ratio. The complexity of programming multicore systems under-scores the need for powerful and efficient runtime systems that manage resources such as threads and communication sub-systems on behalf of the applications. In this paper, we study several multicore performance issues on clusters using Intel, AMD and IBM processors in the context of the CHARM++ runtime system. We then present the optimization tech-niques that overcome these performance issues. The techniques presented are general enough to apply to other runtime systems as well. We demonstrate the benefits of these optimizations through both synthetic benchmarks and production quality applications in-cluding NAMD and ChaNGa on several popular multicore plat-forms. We demonstrate performance improvement of NAMD and ChaNGa by about 20 % and 10%, respectively. 1.
Exploring cross-layer power management for pgas applications on the scc platform.
- In Proc. of the 21st international symposium on High-Performance Parallel and Distributed Computing (HPDC’12),
, 2012
"... ABSTRACT High-performance parallel computing architectures are increasingly based on multi-core processors. While current commercially available processors are at 8 and 16 cores, technological and power constraints are limiting the performance growth of the cores and are resulting in architectures ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
(Show Context)
ABSTRACT High-performance parallel computing architectures are increasingly based on multi-core processors. While current commercially available processors are at 8 and 16 cores, technological and power constraints are limiting the performance growth of the cores and are resulting in architectures with much higher core counts, such as the experimental many-core Intel Single-chip Cloud Computer (SCC) platform. These trends are presenting new sets of challenges to HPC applications including programming complexity and the need for extreme energy efficiency. In this paper, we first investigate the power behavior of scientific Partitioned Global Address Space (PGAS) application kernels on the SCC platform, and explore opportunities and challenges for power management within the PGAS framework. Results obtained via empirical evaluation of Unified Parallel C (UPC) applications on the SCC platform under different constraints, show that, for specific operations, the potential for energy savings in PGAS is large; and power/performance trade-offs can be effectively managed using a cross-layer approach. We investigate cross-layer power management using PGAS language extensions and runtime mechanisms that manipulate power/performance tradeoffs. Specifically, we present the design, implementation and evaluation of such a middleware for application-aware cross-layer power management of UPC applications on the SCC platform. Finally, based on our observations, we provide a set of insights that can be used to support similar power management for PGAS applications on other many-core platforms.
Performance Evaluation of Parallel Computing on Agent-Based Models: The Minority Game Case
"... Abstract: Simulations of agent-based models developed for topics of learning and inductive reasoning in artificial intelligence, social behavior, decision making, etc., are progressively requiring higher power processes while they increase their participation as management and political decisions su ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract: Simulations of agent-based models developed for topics of learning and inductive reasoning in artificial intelligence, social behavior, decision making, etc., are progressively requiring higher power processes while they increase their participation as management and political decisions support. In this work we develop the implementation of the Minority Game Model for HPC platforms in order to analyze the performance of simulations related to contexts of agent-based models for large scales. We compare times to parallel and sequential processes for several instances and get the corresponding speedup. For this work we use the MPI system with a hardware configuration of Master-Worker (Slave) paradigm with a cluster of upto 10 processors as workers. In order to improve efficiency, we evaluate performances for several sizes of clusters varying the size of the instances of the problem and detect optimum configurations for some instances of simulation.
TOD-Tree: Task-Overlapped Direct send Tree Image Compositing for Hybrid MPI Parallelism
"... Modern supercomputers have very powerful multi-core CPUs. The programming model on these supercomputer is switching from pure MPI to MPI for inter-node communication, and shared memory and threads for intra-node communication. Consequently the bottleneck in most systems is no longer computation but ..."
Abstract
- Add to MetaCart
Modern supercomputers have very powerful multi-core CPUs. The programming model on these supercomputer is switching from pure MPI to MPI for inter-node communication, and shared memory and threads for intra-node communication. Consequently the bottleneck in most systems is no longer computation but communication be-tween nodes. In this paper, we present a new compositing algorithm for hybrid MPI parallelism that focuses on communication avoidance and overlapping communication with computation at the expense of evenly balancing the workload. The algorithm has three stages: a direct send stage where nodes are arranged in groups and ex-change regions of an image, followed by a tree compositing stage and a gather stage. We compare our algorithm with radix-k and binary-swap from the IceT library in a hybrid OpenMP/MPI setting, show strong scaling results and explain how we generally achieve better performance than these two algorithms.