Results 1 - 10
of
148
NAMD: biomolecular simulation on thousands of processors
- In Supercomputing
, 2002
"... Abstract NAMD is a fully featured, production molecular dynamics program for high performance simulation of large biomolecular systems. We have previously, at SC2000, presented scaling results for simulations with cutoff electrostatics on up to 2048 processors of the ASCI Red machine, achieved with ..."
Abstract
-
Cited by 113 (33 self)
- Add to MetaCart
(Show Context)
Abstract NAMD is a fully featured, production molecular dynamics program for high performance simulation of large biomolecular systems. We have previously, at SC2000, presented scaling results for simulations with cutoff electrostatics on up to 2048 processors of the ASCI Red machine, achieved with an object-based hybrid force and spatial decomposition scheme and an aggressive measurement-based predictive load balancing framework. We extend this work by demonstrating similar scaling on the much faster processors of the PSC Lemieux Alpha cluster, and for simulations employing efficient (order N log N) particle mesh Ewald full electrostatics. This unprecedented scalability in a biomolecular simulation code has been attained through latency tolerance, adaptation to multiprocessor nodes, and the direct use of the Quadrics Elan library in place of MPI by the Charm++/Converse parallel runtime system.
Microarchitecture of a high-radix router
- IN ISCA ’05: PROCEEDINGS OF THE 32ND ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE
, 2005
"... Evolving semiconductor and circuit technology has greatly increased the pin bandwidth available to a router chip. In the early 90s, routers were limited to 10Gb/s of pin bandwidth. Today 1Tb/s is feasible, and we expect 20Tb/s of I/O bandwidth by 2010. A high-radix router that provides many narrow p ..."
Abstract
-
Cited by 65 (14 self)
- Add to MetaCart
(Show Context)
Evolving semiconductor and circuit technology has greatly increased the pin bandwidth available to a router chip. In the early 90s, routers were limited to 10Gb/s of pin bandwidth. Today 1Tb/s is feasible, and we expect 20Tb/s of I/O bandwidth by 2010. A high-radix router that provides many narrow ports is more effective in converting pin bandwidth to reduced latency and reduced cost than the alternative of building a router with a few wide ports. However, increasing the radix (or degree) of a router raises several challenges as internal switches and allocators scale as the square of the radix. This paper addresses these challenges by proposing and evaluating alternative microarchitectures for high radix routers. We show that the use of a hierarchical switch organization with per-virtual-channel buffers in each subswitch enables an area savings of 40 % compared to a fully buffered crossbar and a throughput increase of 20-60% compared to a conventional crossbar implementation.
Concurrent Direct Network Access for Virtual Machine Monitors
- HPCA 2007
, 2007
"... This paper presents hardware and software mechanisms to enable concurrent direct network access (CDNA) by operating systems running within a virtual machine monitor. In a conventional virtual machine monitor, each operating system running within a virtual machine must access the network through a so ..."
Abstract
-
Cited by 61 (9 self)
- Add to MetaCart
(Show Context)
This paper presents hardware and software mechanisms to enable concurrent direct network access (CDNA) by operating systems running within a virtual machine monitor. In a conventional virtual machine monitor, each operating system running within a virtual machine must access the network through a software-virtualized network interface. These virtual network interfaces are multiplexed in software onto a physical network interface, incurring significant performance overheads. The CDNA architecture improves networking efficiency and performance by dividing the tasks of traffic multiplexing, interrupt delivery, and memory protection between hardware and software in a novel way. The virtual machine monitor delivers interrupts and provides protection between virtual machines, while the network interface performs multiplexing of the network data. In effect, the CDNA architecture provides the abstraction that each virtual machine is connected directly to its own network interface. Through the use of CDNA, many of the bottlenecks imposed by software multiplexing can be eliminated without sacrificing protection, producing substantial efficiency improvements.
Optimizing 10-gigabit ethernet for networks of workstations, clusters and grids: A case study
- In Supercomputing Conference 2003
, 2003
"... This paper presents a case study of the 10-Gigabit Ether-net (10GbE) adapter from Intel R. Specifically, with appropri-ate optimizations to the configurations of the 10GbE adapter and TCP, we demonstrate that the 10GbE adapter can perform well in local-area, storage-area, system-area, and wide-area ..."
Abstract
-
Cited by 37 (7 self)
- Add to MetaCart
(Show Context)
This paper presents a case study of the 10-Gigabit Ether-net (10GbE) adapter from Intel R. Specifically, with appropri-ate optimizations to the configurations of the 10GbE adapter and TCP, we demonstrate that the 10GbE adapter can perform well in local-area, storage-area, system-area, and wide-area networks. For local-area, storage-area, and system-area networks in support of networks of workstations, network-attached stor-age, and clusters, respectively, we can achieve over 7-Gb/s end-to-end throughput and 12-s end-to-end latency between applications running on Linux-based PCs. For the wide-area network in support of grids, we broke the recently-set Inter-net2 Land Speed Record by 2.5 times by sustaining an end-to-end TCP/IP throughput of 2.38 Gb/s between Sunnyvale,
Interviewed by Author
- Interview
, 1998
"... In this work we describe a performance model of the Parallel Ocean Program (POP). In particular the latest version of POP (v2.0) is considered which has similarities and differences to the earlier version (v1.4.3) as commonly used in climate simulations. The performance model encapsulates an underst ..."
Abstract
-
Cited by 35 (1 self)
- Add to MetaCart
(Show Context)
In this work we describe a performance model of the Parallel Ocean Program (POP). In particular the latest version of POP (v2.0) is considered which has similarities and differences to the earlier version (v1.4.3) as commonly used in climate simulations. The performance model encapsulates an understanding of POP’s data decomposition, processing flow, and scaling characteristics. The model is parameterized in many of the main input parameters to POP as well as characteristics of a processing system such as network latency and bandwidth. The performance model has been validated to date on a medium sized (128 processor) AlphaServer ES40 system with the QsNet-1 interconnection network, and also on a larger scale (2,048 processor) Blue Gene/Light system. The accuracy of the performance model is high when using two standard benchmark configurations, one of which represents a realistic configuration similar to that used in Community Climate System Model coupled climate simulations. The performance model is also used to explore the performance of POP after possible optimizations to the code, and different task to processor assignment strategies, whose performance cannot be currently measured. 3 1
Transformations to parallel codes for communication-computation overlap
- In Supercomputing 2005
, 2005
"... This paper presents program transformations directed toward improving communication-computation overlap in parallel programs that use MPI’s collective operations. Our transformations target a wide variety of applications focusing on scientific codes with computation loops that exhibit limited depend ..."
Abstract
-
Cited by 34 (4 self)
- Add to MetaCart
(Show Context)
This paper presents program transformations directed toward improving communication-computation overlap in parallel programs that use MPI’s collective operations. Our transformations target a wide variety of applications focusing on scientific codes with computation loops that exhibit limited dependence among iterations. We include guidance for developers for transforming an application code in order to exploit the communicationcomputation overlap available in the underlying cluster, as well as a discussion of the performance improvements achieved by our transformations. We present results from a detailed study of the effect of the problem and message size, level of communication-computation overlap, and amount of communication aggregation on runtime performance in a cluster environment based on an RDMA-enabled network. The targets of our study are two scientific codes written by domain scientists, but the applicability of our work extends far beyond the scope of these two applications. 1.
A framework for collective personalized communication
- In Proceedings of IPDPS’03
, 2003
"... This paper explores collective personalized communication. For example, in all-to-all personalized communication (AAPC), each processor sends a distinct message to every other processor. However, for many applications, the collective communication pattern is many-to-many, where each processor sends ..."
Abstract
-
Cited by 33 (8 self)
- Add to MetaCart
(Show Context)
This paper explores collective personalized communication. For example, in all-to-all personalized communication (AAPC), each processor sends a distinct message to every other processor. However, for many applications, the collective communication pattern is many-to-many, where each processor sends a distinct message to a subset of processors. In this paper we first present strategies that reduce per-message cost to optimize AAPC. We then present performance results of these strategies in both all-to-all and many-to-many scenarios. These strategies are implemented in a flexible, asynchronous library with a non-blocking interface, and a message-driven runtime system. This allows the collective communication to run concurrently with the application, if desired. As a result the computational overhead of the communication is substantially reduced, at least on machines such as PSC Lemieux, which sport a coprocessor capable of remote DMA. We demonstrate the advantages of our framework with performance results on several benchmarks and applications, 1
Building multirail InfiniBand clusters: MPIlevel design and performance evaluation
- In Proceedings of the 2004 ACM/IEEE Conference on Supercomputing
, 2004
"... In the area of cluster computing, InfiniBand is becoming increasingly popular due to its open standard and high performance. However, even with InfiniBand, network bandwidth can still become the performance bottleneck for some of today’s most demanding applications. In this paper, we study the probl ..."
Abstract
-
Cited by 29 (6 self)
- Add to MetaCart
(Show Context)
In the area of cluster computing, InfiniBand is becoming increasingly popular due to its open standard and high performance. However, even with InfiniBand, network bandwidth can still become the performance bottleneck for some of today’s most demanding applications. In this paper, we study the problem of how to overcome the bandwidth bottleneck by using multirail networks. We present different ways of setting up multirail networks with InfiniBand and propose a unified MPI design that can support all these approaches. We have also discussed various important design issues and provided in-depth discussions of different policies of using multirail networks, including an adaptive striping scheme that can dynamically change the striping parameters based on current system condition. We have implemented our design and evaluated it using both microbenchmarks and applications. Our performance results show that multirail networks can significant improve MPI communication performance. With a two rail InfiniBand cluster, we have achieved almost twice the bandwidth and half the latency for large messages compared with the original MPI. At the application level, the multirail MPI can significantly reduce communication time as well as running time depending on the communication pattern. We have also shown that the adaptive striping scheme can achieve excellent performance without a priori knowledge of the bandwidth of each rail.
Performance Comparison of MPI Implementations over InfiniBand, Myrinet and Quadrics
- in proceedings of Int'l Conference on Supercomputing, (SC'03
, 2003
"... comparison of MPI implementations over InfiniBand, Myrinet and Quadrics. Our performance evaluation consists of two major parts. The first part consists of a set of MPI level micro-benchmarks that characterize different aspects of MPI implementations. The second part of the performance evaluation co ..."
Abstract
-
Cited by 28 (0 self)
- Add to MetaCart
comparison of MPI implementations over InfiniBand, Myrinet and Quadrics. Our performance evaluation consists of two major parts. The first part consists of a set of MPI level micro-benchmarks that characterize different aspects of MPI implementations. The second part of the performance evaluation consists of application level benchmarks. We have used the NAS Parallel Benchmarks and the sweep3D benchmark. We not only present the overall performance results, but also relate application communication characteristics to the information we acquired from the micro-benchmarks. Our results show that the three MPI implementations all have their advantages and disadvantages. For our 8-node cluster, InfiniBand can offer significant performance improvements for a number of applications compared with Myrinet and Quadrics when using the PCI-X bus. Even with just the PCI bus, InfiniBand can still perform better if the applications are bandwidth-bound.
Overview of recent supercomputers
, 1997
"... In this report we give an overview of parallel and vector computers which are currently available or will become available within a short time frame from vendors; no attempt is made to list all machines that are still in the research phase. The machines are described according to their architectura ..."
Abstract
-
Cited by 28 (2 self)
- Add to MetaCart
(Show Context)
In this report we give an overview of parallel and vector computers which are currently available or will become available within a short time frame from vendors; no attempt is made to list all machines that are still in the research phase. The machines are described according to their architectural class. Shared and distributed memory SIMD and MIMD machines are discerned. The information about each machine is kept as compact as possible. Moreover, no attempt is made to quote prices as these are often even more elusive than the performance of a system.