Results 1 - 10
of
21
Efficient Reliable Multicast on Myrinet
- IN PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING
, 1996
"... Although multicast is an important communication primitive for parallel programming, many modern networks do not support it in hardware. Multicast can be implemented in software on such networks, using some spanning tree protocol. Making multicast reliable, however, is a difficult problem, even if ..."
Abstract
-
Cited by 62 (14 self)
- Add to MetaCart
Although multicast is an important communication primitive for parallel programming, many modern networks do not support it in hardware. Multicast can be implemented in software on such networks, using some spanning tree protocol. Making multicast reliable, however, is a difficult problem, even if the hardware point-to-point communication is reliable. The key issue is that a flow control mechanism is needed to prevent overflow of software buffers. Without flow control, messages may have to be dropped, resulting in unreliable communication. Flow control for multicast communication is hard, because buffer space at many processors is involved. This paper describes a reliable multicast algorithm, using a flow control method based on a credit scheme. It also describes the implementation of the algorithm on Myrinet, which supports reliable point-to-point communication but no multicast. Our multicast algorithm has been implemented by extending the Illinois Fast Messages software. To ...
Bandwidth-efficient Collective Communication for Clustered Wide Area Systems
- In Proc. International Parallel and Distributed Processing Symposium (IPDPS 2000), Cancun
, 1999
"... Metacomputing infrastructures couple multiple clusters (or MPPs) via wide-area networks and thus allow parallel programs to run on geographically distributed resources. A major problem in programming such wide-area parallel applications is the difference in communication costs inside and between clu ..."
Abstract
-
Cited by 24 (3 self)
- Add to MetaCart
Metacomputing infrastructures couple multiple clusters (or MPPs) via wide-area networks and thus allow parallel programs to run on geographically distributed resources. A major problem in programming such wide-area parallel applications is the difference in communication costs inside and between clusters. Latency and bandwidth of WANs often are orders of magnitude worse than those of local networks. Our MagPIe library eases wide-area parallel programming by providing an efficient implementation of MPI's collective communication operations. MagPIe exploits the hierarchical structure of clustered wide-area systems and minimizes the communication overhead over the WAN links. In this paper, we present improved algorithms for collective communication that achieve shorter completion times by simultaneously using the aggregate bandwidth of the available wide-area links. Our new algorithms split messages into multiple segments that are sent in parallel over different WAN links, thus resulting ...
Performance of a High-Level Parallel Language on a High-Speed Network
- Journal of Parallel and Distributed Computing
, 1997
"... Clusters of workstations are often claimed to be a good platform for parallel processing, especially if a fast network is used to interconnect the workstations. Indeed, high performance can be obtained for low-level message passing primitives on modern networks like ATM and Myrinet. Most applicati ..."
Abstract
-
Cited by 21 (12 self)
- Add to MetaCart
Clusters of workstations are often claimed to be a good platform for parallel processing, especially if a fast network is used to interconnect the workstations. Indeed, high performance can be obtained for low-level message passing primitives on modern networks like ATM and Myrinet. Most application programmers, however, want to use higherlevel communication primitives. Unfortunately, implementing such primitives efficiently on a modern network is a difficult task, because their software overhead is relatively much higher than on a traditional, slow network (such as Ethernet). In this paper we investigate the issues involved in implementing a high-level programming environment on a fast network. We have implemented a portable runtime system for an object-based language (Orca) on a collection of processors connected by a Myrinet network. Many performance optimizations were required in order to let application programmers benefit sufficiently from the faster network. In particul...
Schematic: A Concurrent Object-Oriented Extension to Scheme
- In Proceedings of Workshop on Object-Based Parallel and Distributed Computation, number 1107 in Lecture Notes in Computer Science
, 1996
"... A concurrent object-oriented extension to the programming language Scheme, called Schematic, is described. Schematic supports familiar constructs often used in typical parallel programs (future and higher-level macros such as plet and pbegin), which are actually defined atop a very small number of f ..."
Abstract
-
Cited by 18 (12 self)
- Add to MetaCart
A concurrent object-oriented extension to the programming language Scheme, called Schematic, is described. Schematic supports familiar constructs often used in typical parallel programs (future and higher-level macros such as plet and pbegin), which are actually defined atop a very small number of fundamental primitives. In this way, Schematic achieves both the convenience for typical concurrent programming and simplicity and flexibility of the language kernel. Schematic also supports concurrent objects which exhibit more natural and intuitive behavior than the "bare" (unprotected) shared memory, and permit intra-object concurrency. Schematic will be useful for intensive parallel applications on parallel machines or networks of workstations, concurrent graphical user interface programming, distributed programming over network, and even concurrent shell programming.
Efficient Replicated Method Invocation in Java
- In ACM 2000 Java Grande Conference
, 2000
"... We describe a new approach to object replication in Java, aimed at improving the performance of parallel programs. Our programming model allows the programmer to define groups of objects that can be replicated and updated as a whole, using totally-ordered broadcast to send update methods to all mach ..."
Abstract
-
Cited by 18 (9 self)
- Add to MetaCart
We describe a new approach to object replication in Java, aimed at improving the performance of parallel programs. Our programming model allows the programmer to define groups of objects that can be replicated and updated as a whole, using totally-ordered broadcast to send update methods to all machines containing a copy. The model has been implemented in the Manta high-performance Java system. Performance measurements on a Myrinet cluster show that the replication mechanism is efficient (e.g., updating 16 replicas of a simple object takes 68 microseconds, only slightly longer than the Manta RMI latency). Example applications that use object replication perform as fast as manually optimized versions based on RMI.
Buffered coscheduling: A new methodology for multitasking parallel jobs on distributed systems
- In Proceedings of the International Parallel and Distributed Processing Symposium 2000, IPDPS2000, Cancun, MX
, 2000
"... Buffered coscheduling is a scheduling methodology for time-sharing communicating processes in parallel and distributed systems. The methodology has two primary features: communication buffering and strobing. With communication buffering, communication generated by each processor is buffered and perf ..."
Abstract
-
Cited by 18 (11 self)
- Add to MetaCart
Buffered coscheduling is a scheduling methodology for time-sharing communicating processes in parallel and distributed systems. The methodology has two primary features: communication buffering and strobing. With communication buffering, communication generated by each processor is buffered and performed at the end of regular intervals to amortize communication and scheduling overhead. This infrastructure is then leveraged by a strobing mechanism to perform a total exchange of information at the end of each interval, thus providing global information to more efficiently schedule communicating processes. This paper describes how buffered coscheduling can optimize resource utilization by analyzing workloads with varying computational granularities, load imbalances, and communication patterns. The experimental results, performed using a detailed simulation model, show that buffered coscheduling is very effective on fast SANs such as Myrinet as well as slower switch-based LANs.
Parallel Application Experience with Replicated Method Invocation
, 2001
"... We describe and evaluate a new approach to object replication in Java, aimed at improving the performance of parallel programs. Our programming model allows the programmer to define groups of objects that can be replicated and updated as a whole, using totally-ordered broadcast to send update method ..."
Abstract
-
Cited by 17 (11 self)
- Add to MetaCart
We describe and evaluate a new approach to object replication in Java, aimed at improving the performance of parallel programs. Our programming model allows the programmer to define groups of objects that can be replicated and updated as a whole, using totally-ordered broadcast to send update methods to all machines containing a copy. The model has been implemented in the Manta high-performance Java system. We evaluate system performance both with micro benchmarks and with a set of five parallel applications. For the applications, we also evaluate ease of programming, compared to RMI implementations. We present performance results for a Myrinet-based workstation cluster as well as for a wide-area distributed system consisting of four such clusters. The micro benchmarks show that updating a replicated object on 64 machines only takes about three times the RMI latency in Manta. Applications using Manta’s object replication mechanism perform at least as fast as manually optimized versions based on RMI, while keeping the application code as simple as with naive versions that use shared objects without taking locality into account. Using a replication mechanism in Manta’s runtime system enables several unmodified applications to run efficiently even on the wide-area system.
Wide-Area Parallel Programming using the Remote Method Invocation Model
, 1999
"... Java’s support for parallel and distributed processing makes the language attractive for metacomputing applications, such as parallel applications that run on geographically distributed (wide-area) systems. To obtain actual experience with a Java-centric approach to metacomputing, we have built and ..."
Abstract
-
Cited by 16 (10 self)
- Add to MetaCart
Java’s support for parallel and distributed processing makes the language attractive for metacomputing applications, such as parallel applications that run on geographically distributed (wide-area) systems. To obtain actual experience with a Java-centric approach to metacomputing, we have built and used a high-performance wide-area Java system, called Manta. Manta implements the Java Remote Method Invocation (RMI) model using different communication protocols (active messages and TCP/IP) for different networks. The paper shows how widearea parallel applications can be expressed and optimized using Java RMI. Also, it presents performance results of several applications on a wide-area system consisting of four Myrinetbased clusters connected by ATM WANs. We finally discuss alternative programming models, namely object replication, JavaSpaces, and MPI for Java.
Dynamic Power Management for Power Optimization of Interconnection Networks Using On/Off Links
"... Power consumption in interconnection networks has become an increasingly important architectural issue. The links which interconnect network node routers are a major consumer of power and will devour an ever-increasing portion of total available power as network bandwidth and operating frequencies u ..."
Abstract
-
Cited by 15 (0 self)
- Add to MetaCart
Power consumption in interconnection networks has become an increasingly important architectural issue. The links which interconnect network node routers are a major consumer of power and will devour an ever-increasing portion of total available power as network bandwidth and operating frequencies upscale. In this paper we propose a dynamic power management policy where network links are turned off and switched back on depending on network utilization in a distributed fashion. We have devised a systematic approach based on the derivation of a connectivity graph that balances power and performance for a 2D mesh topology. This coupled with a deadlock-free, fullyadaptive routing algorithm guarantees packet delivery. Our approach realizes up to ##### reduction in overall network link power for an 8-ary 2-mesh topology with a moderate network latency increase.
Evaluation of Architectural Support for Global Address-Based Communication in Large-Scale Parallel Machines
- In Proc. of the 7th Int. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS
, 1996
"... Large-scale parallel machines are incorporating increasingly sophisticated architectural support for user-level messaging and global memory access. We provide a systematic evaluation of a broad spectrum of current design alternatives based on our implementations of a global address language on the T ..."
Abstract
-
Cited by 14 (4 self)
- Add to MetaCart
Large-scale parallel machines are incorporating increasingly sophisticated architectural support for user-level messaging and global memory access. We provide a systematic evaluation of a broad spectrum of current design alternatives based on our implementations of a global address language on the Thinking Machines CM-5, Intel Paragon, Meiko CS-2, Cray T3D, and Berkeley NOW. This evaluation includes a range of compilation strategies that make varying use of the network processor � each is optimized for the target architecture and the particular strategy. We analyze a family of interacting issues that determine the performance tradeo s in each implementation, quantify the resulting latency, overhead, and bandwidth of the global access operations, and demonstrate the e ects on application performance. 1

