Results 1 - 10
of
19
Entering the Petaflop Era: The Architecture and Performance of Roadrunner
"... precision) hybrid-architecture supercomputer developed by LANL and IBM. It contains 12,240 IBM PowerXCell 8i processors and 12,240 AMD Opteron cores in 3,060 compute nodes. Roadrunner is the first supercomputer to run Linpack at a sustained speed in excess of 1 Pflop/s. In this paper we present a de ..."
Abstract
-
Cited by 75 (8 self)
- Add to MetaCart
(Show Context)
precision) hybrid-architecture supercomputer developed by LANL and IBM. It contains 12,240 IBM PowerXCell 8i processors and 12,240 AMD Opteron cores in 3,060 compute nodes. Roadrunner is the first supercomputer to run Linpack at a sustained speed in excess of 1 Pflop/s. In this paper we present a detailed architectural description of Roadrunner and a detailed performance analysis of the system. A case study of optimizing the MPI-based application Sweep3D to exploit Roadrunner’s hybrid architecture is also included. The performance of Sweep3D is compared to that of the code on a previous implementation of the Cell Broadband Engine architecture—the Cell BE—and on multicore processors. Using validated performance models combined with Roadrunner-specific microbenchmarks we identify performance issues in the early pre-delivery system and infer how well the final Roadrunner configuration will perform once the system software stack has matured. Keywords—Petascale computing, heterogeneous, accelerators, performance analysis, Roadrunner. I.
High-Performance Multi-Rail Support with the NewMadeleine Communication Library
- In: The Sixteenth International Heterogeneity in Computing Workshop (HCW 2007), workshop held in conjunction with IPDPS
, 2007
"... This paper focuses on message transfers across multiple heterogeneous high-performance networks in the NEW-MADELEINE Communication Library. NEWMADELEINE features a modular design that allows the user to easily implement load-balancing strategies efficiently exploiting the underlying network but with ..."
Abstract
-
Cited by 13 (4 self)
- Add to MetaCart
This paper focuses on message transfers across multiple heterogeneous high-performance networks in the NEW-MADELEINE Communication Library. NEWMADELEINE features a modular design that allows the user to easily implement load-balancing strategies efficiently exploiting the underlying network but without being aware of the lowlevel interface. Several strategies are studied and preliminary results are given. They show that performance of network transfers can be improved by using carefully designed strategies that take into account NIC activity. 1
Efficient Large-Scale Model Checking
, 2009
"... Model checking is a popular technique to systematically and automatically verify system properties. Unfortunately, the well-known state explosion problem often limits the extent to which it can be applied to realistic specifications, due to the huge resulting memory requirements. Distributedmemory m ..."
Abstract
-
Cited by 12 (6 self)
- Add to MetaCart
(Show Context)
Model checking is a popular technique to systematically and automatically verify system properties. Unfortunately, the well-known state explosion problem often limits the extent to which it can be applied to realistic specifications, due to the huge resulting memory requirements. Distributedmemory model checkers exist, but have thus far only been evaluated on small-scale clusters, with mixed results. We examine one well-known distributed model checker in detail, and show how a number of additional optimizations in its runtime system enable it to efficiently check very demanding problem instances on a large-scale, multi-core compute cluster. We analyze the impact of the distributed algorithms employed, the problem instance characteristics and network overhead. Finally, we show that the model checker can even obtain good performance in a high-bandwidth computational grid environment.
The PlayStation 3 for High Performance Scientific Computing
"... The heart of the Sony PlayStation 3, the STI CELL processor, was not originally intended for scientific number crunching, and the PlayStation 3 itself was not meant primarily to serve such purposes. Yet, both these items may impact the High Performance Computing world. This introductory article take ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
The heart of the Sony PlayStation 3, the STI CELL processor, was not originally intended for scientific number crunching, and the PlayStation 3 itself was not meant primarily to serve such purposes. Yet, both these items may impact the High Performance Computing world. This introductory article takes a closer look at the cause of this potential disturbance.
L.V.: Programming heterogeneous clusters with accelerators using object-based programming
- Scientific Programming
, 2011
"... Abstract. Heterogeneous clusters that include accelerators have become more common in the realm of high performance computing because of the high GFlop/s rates such clusters are capable of achieving. However, heterogeneous clusters are typically considered hard to program as they usually require pr ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
(Show Context)
Abstract. Heterogeneous clusters that include accelerators have become more common in the realm of high performance computing because of the high GFlop/s rates such clusters are capable of achieving. However, heterogeneous clusters are typically considered hard to program as they usually require programmers to interleave architecture-specific code within application code. We have extended the Charm++ programming model and runtime system to support heterogeneous clusters (with host cores that differ in their architecture) that include accelerators. We are currently focusing on clusters that include commodity processors, Cell processors, and Larrabee devices. When our extensions are used to develop code, the resulting code is portable between various homogeneous and heterogeneous clusters that may or may not include accelerators. Using a simple example molecular dynamics (MD) code, we demonstrate our programming model extensions and runtime system modifications on a heterogeneous cluster comprised of Xeon and Cell processors. Even though there is no architecture-specific code in the example MD program, it is able to successfully make use of three core types, each with a different ISA (Xeon, PPE, SPE), three SIMD instruction extensions (SSE, AltiVec/VMX and the SPE's SIMD instructions), and two memory models (cache hierarchies and scratchpad memories) in a single execution. Our programming model extensions abstract away hardware complexities while our runtime system modifications automatically adjust application data to account for architectural differences between the various cores.
Workflow for Performance Evaluation and Tuning
- IEEE Cluster
, 2008
"... Abstract — We report our experiences with using highthroughput techniques to run large sets of performance experiments on collections of grid accessible parallel computer systems for the purpose of deploying optimally compiled and configured scientific applications. In these environments, the set of ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
(Show Context)
Abstract — We report our experiences with using highthroughput techniques to run large sets of performance experiments on collections of grid accessible parallel computer systems for the purpose of deploying optimally compiled and configured scientific applications. In these environments, the set of variable parameters (compiler, link, and runtime flags; application and library options; partition size) can be very large, so running the performance ensembles is labor intensive, tedious, and prone to errors. Automating this process improves productivity, reduces barriers to deploying and maintaining multi-platform codes, and facilitates the tracking of application and system performance over time. We describe the design and implementation of our system for running performance ensembles and we use two case studies as the basis for evaluating the long term potential for this approach. The architecture of a prototype benchmarking system is presented along with results on the efficacy of the workflow approach. I.
D.: Efficient message passing on multiclusters: An IPv6 extension to Open MPI
- In: Proceedings of KiCC’07, Chemnitzer Informatik Berichte
, 2007
"... At our university, different institutes have installed their own cluster computers. Connecting several of these clusters to perform distributed high-performance computing requires message passing spanning heterogeneous network structures. One problem is that private IPv4 addresses inside clusters, a ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
(Show Context)
At our university, different institutes have installed their own cluster computers. Connecting several of these clusters to perform distributed high-performance computing requires message passing spanning heterogeneous network structures. One problem is that private IPv4 addresses inside clusters, although common and suitable for internal communication, preclude end-to-end connectivity. To establish multi-cluster message passing in such a context, we propose to use MPI over IPv6. In this article, we present our IPv6 extension to Open MPI, which is able to cope with mixed IPv4/IPv6 environments and delivers high performance levels. 1
A software system for scalable parameter estimation on clusters
"... Advancements in data collection and high performance computing are making sophisticated model calibration possible throughout the modeling and simulation com-munity. The model calibration process, in which the appropriate input values are estimated for unknown parameters, is typically a computationa ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
(Show Context)
Advancements in data collection and high performance computing are making sophisticated model calibration possible throughout the modeling and simulation com-munity. The model calibration process, in which the appropriate input values are estimated for unknown parameters, is typically a computationally intensive task and necessitates the use of distributed software components. These components are often heteroge-neous due to the combination of the model and the optimization software, making scalability difficult to achieve. We have developed a hybrid software system for parameter estimation consisting of an optimiza-tion algorithm implemented in a mathematical script-ing language, a legacy Fortran model, and an MPI client program. Through a series of optimizations, we achieved near-linear speedup when the model is ex-ecuted as a standalone process, and achieved super-linear speedup when the model is executed as a sub-routine. We report on our optimization techniques and performance results of an estimation problem within the context of an ongoing modeling study. 1
www.hydrol-earth-syst-sci.net/13/1467/2009/ © Author(s) 2009. This work is distributed under the Creative Commons Attribution 3.0 License.
, 2009
"... Calibration of a crop model to irrigated water use using a genetic algorithm ..."
Abstract
- Add to MetaCart
(Show Context)
Calibration of a crop model to irrigated water use using a genetic algorithm
Author manuscript, published in "International Parallel and Distributed Processing Symposium, Rome: Italy (2009)" NewMadeleine: An Efficient Support for High-Performance Networks in
"... for implementations to be able to take advantage as much as possible of such hardware’s evolutions. One of the major difficulties is that current MPI implementations have to take into account multiple hardware features while maintaining a strict compliance to the actual standard. Implementations now ..."
Abstract
- Add to MetaCart
for implementations to be able to take advantage as much as possible of such hardware’s evolutions. One of the major difficulties is that current MPI implementations have to take into account multiple hardware features while maintaining a strict compliance to the actual standard. Implementations now have to take into consideration the increasing number of CPUs and cores available in a computing node. They will also need to take into consideration the memory hierarchy as well as the NUMA factor when accessing data. As far as the network is concerned, exploiting multiple and possibly heterogeneous interconnects raises issues: How can an MPI implementation efficiently utilize all NIC resources despite their different natures? How can we avoid contention on the NICs in the case where all the MPI processes on a given node are sending messages? Could some cores be dedicated to optimize communication progress instead of executing regular application code? Building a complete MPI stack is a complex task and such sophisticated optimizations are often overlooked. We believe that specialized software tailored to efficiently exploit complex and hierarchical architectures is one of the keys to an efficient MPI implementation in such environments. Indeed a low-level runtime system upon which the MPI stack is ported offers both portability and performance to the application using MPI. All optimization mechanisms developed in such a low-level system can benefit the upper, more generic layers of the resulting MPI implementation. The PM 2 software suite [11] developed in the Runtime team is able to provide such services. Several software elhal-00360275,