Results 1 - 10
of
126
High Performance VMM-Bypass I/O in Virtual Machines
, 2006
"... Currently, I/O device virtualization models in virtual machine (VM) environments require involvement of a virtual machine monitor (VMM) and/or a privileged VM for each I/O operation, which may turn out to be a performance bottleneck for systems with high I/O demands, especially those equipped with m ..."
Abstract
-
Cited by 71 (2 self)
- Add to MetaCart
Currently, I/O device virtualization models in virtual machine (VM) environments require involvement of a virtual machine monitor (VMM) and/or a privileged VM for each I/O operation, which may turn out to be a performance bottleneck for systems with high I/O demands, especially those equipped with modern high speed interconnects such as InfiniBand. In this paper, we propose a new device virtualization model called VMM-bypass I/O, which extends the idea of OS-bypass originated from user-level communication. Essentially, VMM-bypass allows time-critical I/O operations to be carried out directly in guest VMs without involvement of the VMM and/or a privileged VM. By exploiting the intelligence found in modern high speed network interfaces, VMM-bypass can significantly improve I/O and communication performance for VMs without sacrificing safety or isolation. To demonstrate the idea of VMM-bypass, we have developed a prototype called Xen-IB, which offers Infini-Band virtualization support in the Xen 3.0 VM environment. Xen-IB runs with current InfiniBand hardware and does not require modifications to existing user-level applications or kernel-level drivers that use InfiniBand. Our performance measurements show that Xen-IB is able to achieve nearly the same raw performance as the original InfiniBand driver running in a non-virtualized environment.
Scalable algorithms for molecular dynamics simulations on commodity clusters
- In SC ’06: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing
, 2006
"... Although molecular dynamics (MD) simulations of biomolecular systems often run for days to months, many events of great scientific interest and pharmaceutical relevance occur on long time scales that remain beyond reach. We present several new algorithms and implementation techniques that significan ..."
Abstract
-
Cited by 68 (5 self)
- Add to MetaCart
(Show Context)
Although molecular dynamics (MD) simulations of biomolecular systems often run for days to months, many events of great scientific interest and pharmaceutical relevance occur on long time scales that remain beyond reach. We present several new algorithms and implementation techniques that significantly accelerate parallel MD simulations compared with current stateof-the-art codes. These include a novel parallel decomposition method and message-passing techniques that reduce communication requirements, as well as novel communication primitives that further reduce communication time. We have also developed numerical techniques that maintain high accuracy while using single precision computation in order to exploit processor-level vector instructions. These methods are embodied in a newly developed MD code called Desmond that achieves unprecedented simulation throughput and parallel scalability on commodity clusters. Our results suggest that Desmond’s parallel performance substantially surpasses that of any previously described code. For example, on a standard benchmark, Desmond’s performance on a conventional Opteron cluster with 2K processors slightly exceeded the reported performance of IBM’s Blue Gene/L machine with 32K processors running its Blue Matter MD code. 1.
Optimizing bandwidth limited problems using one-sided communication and overlap
- In 20th International Parallel and Distributed Processing Symposium (IPDPS
, 2006
"... Abstract ..."
(Show Context)
Performance Evaluation of Adaptive MPI
, 2006
"... Processor virtualization via migratable objects is a powerful technique that enables the runtime system to carry out intelligent adaptive optimizations like dynamic resource management. CHARM++ is an early language/system that supports migratable objects. This paper describes Adaptive MPI (or AMPI), ..."
Abstract
-
Cited by 51 (19 self)
- Add to MetaCart
Processor virtualization via migratable objects is a powerful technique that enables the runtime system to carry out intelligent adaptive optimizations like dynamic resource management. CHARM++ is an early language/system that supports migratable objects. This paper describes Adaptive MPI (or AMPI), an MPI implementation and extension, that supports processor virtualization. AMPI implements virtual MPI processes (VPs), several of which may be mapped to a single physical processor. AMPI includes a powerful runtime support system that takes advantage of the degree of freedom afforded by allowing it to assign VPs onto processors. With this runtime system, AMPI supports such features as automatic adaptive overlapping of communication and computation, automatic load balancing, flexibility of running on arbitrary number of processors, and checkpoint/restart support. It also inherits communication optimization from CHARM++ framework. This paper describes AMPI, illustrates its performance benefits through a series of benchmarks, and shows that AMPI is a portable and mature MPI implementation that offers various performance benefits to dynamic applications.
Implementation and performance analysis of non-blocking collective operations for MPI
- SC07
, 2007
"... Collective operations and non-blocking point-to-point operations have always been part of MPI. Although non-blocking collective operations are an obvious extension to MPI, there have been no comprehensive studies of this functionality. In this paper we present LibNBC, a portable high-performance lib ..."
Abstract
-
Cited by 49 (24 self)
- Add to MetaCart
(Show Context)
Collective operations and non-blocking point-to-point operations have always been part of MPI. Although non-blocking collective operations are an obvious extension to MPI, there have been no comprehensive studies of this functionality. In this paper we present LibNBC, a portable high-performance library for implementing non-blocking collective MPI communication operations. LibNBC provides non-blocking versions of all MPI collective operations, is layered on top of MPI-1, and is portable to nearly all parallel architectures. To measure the performance characteristics of our implementation, we also present a microbenchmark for measuring both latency and overlap of computation and communication. Experimental results demonstrate that the blocking performance of the collective operations in our library is comparable to that of collective operations in other highperformance MPI implementations. Our library introduces a very low overhead between the application and the underlying MPI and thus, in conjunction with the potential to overlap communication with computation, offers the potential for optimizing real-world applications.
Open MPI: A Flexible High Performance MPI
- In The 6th Annual International Conference on Parallel Processing and Applied Mathematics
, 2005
"... Abstract. A large number of MPI implementations are currently available, each of which emphasize different aspects of high-performance computing or are intended to solve a specific research problem. The result is a myriad of incompatible MPI implementations, all of which require separate installatio ..."
Abstract
-
Cited by 33 (0 self)
- Add to MetaCart
(Show Context)
Abstract. A large number of MPI implementations are currently available, each of which emphasize different aspects of high-performance computing or are intended to solve a specific research problem. The result is a myriad of incompatible MPI implementations, all of which require separate installation, and the combination of which present significant logistical challenges for end users. Building upon prior research, and influenced by experience gained from the code bases of the LAM/MPI, LA-MPI, FT-MPI, and PACX-MPI projects, Open MPI is an all-new, productionquality MPI-2 implementation that is fundamentally centered around component concepts. Open MPI provides a unique combination of novel features previously unavailable in an open-source, production-quality implementation of MPI. Its component architecture provides both a stable platform for third-party research as well as enabling the run-time composition of independent software add-ons. This paper presents a high-level overview the goals, design, and implementation of Open MPI, as well as performance results for it’s point-to-point implementation. 1
Application-transparent checkpoint/restart for mpi programs over infiniband
- icpp
"... Ultra-scale computer clusters with high speed intercon-nects, such as InfiniBand, are being widely deployed for their excellent performance and cost effectiveness. However, the failure rate on these clusters also increases along with their augmented number of components. Thus, it becomes criti-cal f ..."
Abstract
-
Cited by 33 (3 self)
- Add to MetaCart
(Show Context)
Ultra-scale computer clusters with high speed intercon-nects, such as InfiniBand, are being widely deployed for their excellent performance and cost effectiveness. However, the failure rate on these clusters also increases along with their augmented number of components. Thus, it becomes criti-cal for such systems to be equipped with fault tolerance sup-port. In this paper, we present our design and implementation of checkpoint/restart framework for MPI programs running over InfiniBand clusters. Our design enables low-overhead, application-transparent checkpointing. It uses coordinated protocol to save the current state of the whole MPI job to reliable storage, which allows users to perform rollback re-covery if the system runs into faulty states later. Our solution has been incorporated into MVAPICH2, an open-source high performance MPI-2 implementation over InfiniBand. Perfor-mance evaluation of this implementation has been carried out using NAS benchmarks, HPL benchmark, and a real-world application called GROMACS. Experimental results indicate that in our design, the overhead to take checkpoints is low, and the performance impact for checkpointing applications periodically is insignificant. For example, time for check-pointing GROMACS is less than 0.3 % of the execution time, and its performance only decreases by 4 % with checkpoints taken every minute. To the best of our knowledge, this work is the first report of checkpoint/restart support for MPI over InfiniBand clusters in the literature. 1
High Performance Virtual Machine Migration with RDMA over Modern Interconnects
"... Abstract — One of the most useful features provided by virtual machine (VM) technologies is the ability to migrate running OS instances across distinct physical nodes. As a basis for many administration tools in modern clusters and data-centers, VM migration is desired to be extremely efficient to r ..."
Abstract
-
Cited by 31 (1 self)
- Add to MetaCart
(Show Context)
Abstract — One of the most useful features provided by virtual machine (VM) technologies is the ability to migrate running OS instances across distinct physical nodes. As a basis for many administration tools in modern clusters and data-centers, VM migration is desired to be extremely efficient to reduce both migration time and performance impact on hosted applications. Currently, most VM environments use the Socket interface and the TCP/IP protocol to transfer VM migration traffic. In this paper, we propose a high performance VM migration design by using RDMA (Remote Direct Memory Access). RDMA is a feature provided by many modern high speed interconnects that are currently being widely deployed in data-centers and clusters. By taking advantage of the low software overhead and the one-sided nature of RDMA, our design significantly improves the efficiency of VM migration. We also contribute a set of micro-benchmarks and application-level benchmark evaluations aimed at evaluating important metrics of VM migration. The evaluations using our prototype implementation over Xen and InfiniBand show that RDMA can drastically reduce the migration overhead: up to 80 % on total migration time and up to 77 % on application observed downtime. I.
Building multirail InfiniBand clusters: MPIlevel design and performance evaluation
- In Proceedings of the 2004 ACM/IEEE Conference on Supercomputing
, 2004
"... In the area of cluster computing, InfiniBand is becoming increasingly popular due to its open standard and high performance. However, even with InfiniBand, network bandwidth can still become the performance bottleneck for some of today’s most demanding applications. In this paper, we study the probl ..."
Abstract
-
Cited by 29 (6 self)
- Add to MetaCart
(Show Context)
In the area of cluster computing, InfiniBand is becoming increasingly popular due to its open standard and high performance. However, even with InfiniBand, network bandwidth can still become the performance bottleneck for some of today’s most demanding applications. In this paper, we study the problem of how to overcome the bandwidth bottleneck by using multirail networks. We present different ways of setting up multirail networks with InfiniBand and propose a unified MPI design that can support all these approaches. We have also discussed various important design issues and provided in-depth discussions of different policies of using multirail networks, including an adaptive striping scheme that can dynamically change the striping parameters based on current system condition. We have implemented our design and evaluated it using both microbenchmarks and applications. Our performance results show that multirail networks can significant improve MPI communication performance. With a two rail InfiniBand cluster, we have achieved almost twice the bandwidth and half the latency for large messages compared with the original MPI. At the application level, the multirail MPI can significantly reduce communication time as well as running time depending on the communication pattern. We have also shown that the adaptive striping scheme can achieve excellent performance without a priori knowledge of the bandwidth of each rail.
Parallel genomic sequence-searching on an ad-hoc grid: Experiences, lessons learned, and implications. InACM/IEEESC2006:TheInternationalConferenceon High-PerformanceComputing,Networking,andStorage
- Supercomputing, 2006. SC ’06. Proceedings of the ACM/IEEE SC 2006 Conference
, 2006
"... bioinformaticists to characterize an unknown sequence by comparing it against a database of known sequences. The similarity between sequences enables biologists to detect evolutionary relationships and infer biological properties of the unknown sequence. mpiBLAST, our parallel BLAST, decreases the s ..."
Abstract
-
Cited by 29 (12 self)
- Add to MetaCart
(Show Context)
bioinformaticists to characterize an unknown sequence by comparing it against a database of known sequences. The similarity between sequences enables biologists to detect evolutionary relationships and infer biological properties of the unknown sequence. mpiBLAST, our parallel BLAST, decreases the search time of a 300 KB query on the current NT database from over two full days to under 10 minutes on a 128processor cluster and allows larger query files to be compared. Consequently, we propose to compare the largest query available, the entire NT database, against the largest database available, the entire NT database. The result of this comparison will provide critical information to the biology community, including insightful evolutionary, structural, and functional relationships between every sequence and family in the NT database. Preliminary projections indicated that to complete the above task in a reasonable length of time required more processors than were available to us at a single site. Hence, we assembled GreenGene, an ad-hoc grid that was constructed “on the fly ” from donated computational, network, and storage resources during last year’s SC|05. GreenGene consisted of 3048 processors from machines that were distributed across the United States. This paper presents a case study of mpiBLAST on GreenGene — specifically, a pre-run characterization of the computation, the hardware and software architectural design, experimental results, and future directions.