• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Portals 3.0: Protocol Building Blocks for Low Overhead Communication," presented at 2002 Workshop on Communication Architecture for Clusters (2002)

by R Brightwell, R Riesen, B Lawry, A B Maccabe
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 33
Next 10 →

Software architecture of the light weight kernel, Catamount

by Suzanne M. Kelly, Ron Brightwell - In Cray User Group , 2005
"... Catamount is designed to be a low overhead operating system for a parallel computing environment. Functionality is limited to the minimum set needed to run a scientific computation. The design choices and implementations will be presented. ..."
Abstract - Cited by 21 (8 self) - Add to MetaCart
Catamount is designed to be a low overhead operating system for a parallel computing environment. Functionality is limited to the minimum set needed to run a scientific computation. The design choices and implementations will be presented.

COMB: A portable benchmark suite for assessing MPI overlap

by William Lawry, Christopher Wilson, Arthur B. Maccabe, Ron Brightwell - IEEE Cluster , 2002
"... This paper describes a portable benchmark suite that assesses the ability of cluster networking hardware and software to overlap MPI communication and computation. The Communication Offload MPI-based Benchmark, or COMB, uses two different methods to characterize the ability of messages to make progr ..."
Abstract - Cited by 18 (6 self) - Add to MetaCart
This paper describes a portable benchmark suite that assesses the ability of cluster networking hardware and software to overlap MPI communication and computation. The Communication Offload MPI-based Benchmark, or COMB, uses two different methods to characterize the ability of messages to make progress concurrently with computational processing on the host processor(s). COMB measures the relationship between overall MPI communication bandwidth and host CPU availability. In this paper, we describe the two different approaches used by the benchmark suite, and we present results from several systems. We demonstrate the utility of the suite by examining the results and comparing and contrasting different systems. 1

Improving Effective Bandwidth of Networks on Clusters using Load Balancing for Communication-Intensive Applications

by Xiao Qin, Hong Jiang - Proc. 24th IEEE Int’l Performance, Computing, and Communications Conf , 2005
"... Clusters have emerged as a primary and cost-effective infrastructure for parallel applications, including communication-intensive applications that transfer a large amount of data among nodes of a cluster via interconnection networks. Conventional load balancers have been proven effective in increas ..."
Abstract - Cited by 10 (4 self) - Add to MetaCart
Clusters have emerged as a primary and cost-effective infrastructure for parallel applications, including communication-intensive applications that transfer a large amount of data among nodes of a cluster via interconnection networks. Conventional load balancers have been proven effective in increasing utilization of CPU, memory, and disk I/O resources in a cluster. However, most of the existing load-balancing schemes ignore network resources, leaving open an opportunity for improving effective bandwidth of networks on clusters running parallel applications. For this reason, we propose a communication-aware load balancing technique that is capable of improving performance of communicationintensive applications by increasing effective utilization of networks in cluster environments. Our load-balancing scheme can make use of an application model to quickly and accurately determine the load induced by a variety of parallel applications. Simulation results on executing a wide range of parallel applications on a cluster show that the proposed scheme can significantly improve the performance in slowdown and turn-around time over three existing schemes by up to 206 % (with an average of 74%) and 235 % (with an average of 82%), respectively. 1.

Design, Implementation, and Performance of MPI on Portals 3.0

by Ron Brightwell, Rolf Riesen, Arthur B. MacCabe
"... ..."
Abstract - Cited by 10 (4 self) - Add to MetaCart
Abstract not found

Application-Bypass Broadcast in MPICH over GM

by Darius Buntinas, Dhabaleswar K. Panda, Ron Brightwell , 2003
"... Processes of a parallel program can become unsynchronized, or skewed, during the course of running an application. Processes can become skewed as a result of unbalanced or asymmetric code, or through the use of heterogeneous systems, where nodes in the system have different performance characteristi ..."
Abstract - Cited by 8 (4 self) - Add to MetaCart
Processes of a parallel program can become unsynchronized, or skewed, during the course of running an application. Processes can become skewed as a result of unbalanced or asymmetric code, or through the use of heterogeneous systems, where nodes in the system have different performance characteristics, as well as random, unpredictable effects such as the processes not being started at exactly the same time, or processors receiving interrupts during computation. Geographically distributed systems may have more severe skew because of variable communication times. Such skew can have a significant impact on the performance of collective communication operations which impose an implicit synchronization. The broadcast operation in MPICH is one such operation. An application-bypass broadcast operation is one which does not depend on the application running at a process to make progress. Such an operation would not be as sensitive to process skew. This paper describes the design and implementation of an application-bypass broadcast operation. We evaluated the implementation and find a factor of improvement of up to 16 for application-bypass broadcast compared to non-application-bypass broadcast when processes are skewed. Furthermore we see that as the system size increases, the effects of skew on non-application-bypass broadcast also increase. The application-bypass broadcast is much less sensitive to process skew which makes it more scalable than the non-application-bypass broadcast operation.

A hardware acceleration unit for mpi queue processing

by Keith D. Underwood, K. Scott Hemmert, Arun Rodrigues, Richard Murphy, Ron Brightwell - In Proceedings of IPDPS ’05 , 2005
"... With the heavy reliance of modern scientific applications upon the MPI Standard, it has become critical for the implementation of MPI to be as capable and as fast as possible. This has led some of the fastest modern networks to introduce the capability to offload aspects of MPI processing to an embe ..."
Abstract - Cited by 7 (3 self) - Add to MetaCart
With the heavy reliance of modern scientific applications upon the MPI Standard, it has become critical for the implementation of MPI to be as capable and as fast as possible. This has led some of the fastest modern networks to introduce the capability to offload aspects of MPI processing to an embedded processor on the network interface. With this important capability has come significant performance implications. Most notably, the time to process long queues of posted receives or unexpected messages is substantially longer on embedded processors. This paper presents an associative list matching structure to accelerate the processing of moderate length queues in MPI. Simulations are used to compare the performance of an embedded processor augmented with this capability to a baseline implementation. The proposed enhancement significantly reduces latency for moderate length queues while adding virtually no overhead for extremely short queues. 1.

Analyzing the impact of overlap, offload, and independent progress for message passing interface applications

by Ron Brightwell, Rolf Riesen, Keith D. Underwood, Ron Brightwell, Rolf Riesen, Keith D. Underwood - Int. J. High Perform. Comput. Appl , 2005
"... The overlap of computation and communication has long been considered to be a significant performance benefit for applications. Similarly, the ability of the Message Passing Interface (MPI) to make independent progress (that is, to make progress on outstanding communication operations while not in t ..."
Abstract - Cited by 6 (0 self) - Add to MetaCart
The overlap of computation and communication has long been considered to be a significant performance benefit for applications. Similarly, the ability of the Message Passing Interface (MPI) to make independent progress (that is, to make progress on outstanding communication operations while not in the MPI library) is also believed to yield performance benefits. Using an intelligent network interface to offload the work required to support overlap and independent progress is thought to be an ideal solution, but the benefits of this approach have not been studied in depth at the application level. This lack of analysis is complicated by the fact that most MPI implementations do not sufficiently support overlap or independent progress. Recent work has demonstrated a quantifiable

Designing a common communication subsystem

by Darius Buntinas - In Proceedings of the 12th European Parallel Virtual Machine and Message Passing Interface Conference (Euro PVM MPI , 2005
"... Abstract. Communication subsystems are used in high-performance parallel computing systems to abstract the lower network layer. By using a communication subsystem, an upper middleware library or runtime system can be more easily ported to different interconnects. However by abstracting the network l ..."
Abstract - Cited by 6 (1 self) - Add to MetaCart
Abstract. Communication subsystems are used in high-performance parallel computing systems to abstract the lower network layer. By using a communication subsystem, an upper middleware library or runtime system can be more easily ported to different interconnects. However by abstracting the network layer, the designer will typically make the communication subsystem more specialized for that particular middleware library, and less general, making it ineffective for supporting middleware for other programming models. In previous work we analyzed the requirements of various programming model middleware and the communication subsystems that support them. We found that although the are no mutually exclusive requirements, none of the existing communication subsystems could efficiently support the programming model middleware we considered. In this paper, we describe our design of a common communication subsystem, called CCS, that can efficiently support various programming model middleware. 1

Implementing Efficient and Scalable Flow Control Schemes in MPI over Infiniband

by Jiuxing Liu, Dhabaleswar K. Panda
"... In this paper, we present a detailed study of how to design efficient and scalable flow control mechanisms in MPI over the InfiniBand Architecture. Two of the central issues in flow control are performance and scalability in terms of buffer usage. We propose three different flow control schemes (har ..."
Abstract - Cited by 6 (1 self) - Add to MetaCart
In this paper, we present a detailed study of how to design efficient and scalable flow control mechanisms in MPI over the InfiniBand Architecture. Two of the central issues in flow control are performance and scalability in terms of buffer usage. We propose three different flow control schemes (hardware-based, user-level static and userlevel dynamic) and describe their respective design issues. We have implemented all three schemes in our MPI implementation over InfiniBand and conducted performance evaluation using both micro-benchmarks and the NAS Parallel Benchmarks. Our performance analysis shows that in our testbed, most NAS applications only require a very small number of pre-posted buffers for every connection to achieve good performance. We also show that the user-level dynamic scheme can achieve both performance and buffer efficiency by adapting itself according to the application communication pattern. These results have significant impact in designing large-scale clusters (in the order of 1,000 to 10,000 nodes) with InfiniBand.

DAChe: Direct access cache system for parallel I/O

by Kenin Coloma, Alok Choudhary, Wei-keng Liao - In to appear in Proceedings of the 2005 International Supercomputer Conference , 2005
"... One of the largest challenges in client-side caching in extremely large-scale environments is consistency and coherency. By handling a user-space cache, we can offer applications much closer control over our client-side cache and scale the cache with the size of the compute resources (i.e. compute n ..."
Abstract - Cited by 5 (3 self) - Add to MetaCart
One of the largest challenges in client-side caching in extremely large-scale environments is consistency and coherency. By handling a user-space cache, we can offer applications much closer control over our client-side cache and scale the cache with the size of the compute resources (i.e. compute nodes). Cache data is shared among each compute node analagous to a traditional shared memory machine. Our approach to maintaining the integrity of the distributed cache turns out to be quite scalable and offers potentially sizable performance gains. 1.
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University