Results 1 - 10
of
45
GLUnix: a Global Layer Unix for a Network of Workstations
, 1997
"... ions To provide remote execution of both parallel and sequential jobs, GLUnix extends some existing UNIX abstractions and introduces new abstractions, borrowing heavily from MPP environments such as that of the CM-5. The new abstractions include network programs and globally unique network program ..."
Abstract
-
Cited by 77 (0 self)
- Add to MetaCart
ions To provide remote execution of both parallel and sequential jobs, GLUnix extends some existing UNIX abstractions and introduces new abstractions, borrowing heavily from MPP environments such as that of the CM-5. The new abstractions include network programs and globally unique network program identifiers (NPIDs) for GLUnix jobs and virtual node numbers (VNNs) to name the nodes running a network program. The existing abstractions of signal delivery to remote applications and I/O redirection were extended to support parallel and remote jobs. GLUnix provides both programming and command-line interfaces to access these abstractions. Network Programs A network program is an executing parallel or sequential job that is controlled by GLUnix. Network programs can be located anywhere in the cluster and are identified using a 32-bit, cluster-unique network program identifier (NPID) which is assigned and tracked by GLUnix. Using a cluster-wide, location-independent identifier provides th...
VMMC-2: Efficient Support for Reliable, Connection-Oriented Communication
- IN PROCEEDINGS OF HOT INTERCONNECTS
, 1997
"... The basic virtual memory-mapped communication (VMMC) model provides protected, direct communication between the sender's and receiver's virtual address spaces, but it does not support high-level connection-oriented communication APIs well. This paper presents VMMC-2, an extension to the basic VMMC.W ..."
Abstract
-
Cited by 71 (18 self)
- Add to MetaCart
The basic virtual memory-mapped communication (VMMC) model provides protected, direct communication between the sender's and receiver's virtual address spaces, but it does not support high-level connection-oriented communication APIs well. This paper presents VMMC-2, an extension to the basic VMMC.We describe the design, implementation, and evaluate the performance of three mechanisms in VMMC-2: (1) a user-managed TLB mechanism for address translation which enables user libraries to dynamically manage the amount of pinned space and requires only driver support from many operating systems# (2) a transfer redirection mechanism whichavoids copying on the receiver 's side# (3) a reliable communication protocol at the data link layer whichavoids copying on the sender's side. Tovalidate our extensions we implemented stream sockets on top of the VMMC-2 running on a Myrinet network of Pentium PCs. This zero-copysockets implementation provides a maximum bandwidth of over 84 Mbytes/s and a one-way latency of 20 µs.
SOVIA: A User-level Sockets Layer over Virtual Interface Architecture
- In Cluster Computing
, 2001
"... The Virtual Interface Architecture (VIA) is an industry standard user-level communication architecture for system area networks. The VIA provides a protected, directlyaccessible interface to a network hardware, removing the operating system from the critical communication path. In this paper, we des ..."
Abstract
-
Cited by 45 (0 self)
- Add to MetaCart
The Virtual Interface Architecture (VIA) is an industry standard user-level communication architecture for system area networks. The VIA provides a protected, directlyaccessible interface to a network hardware, removing the operating system from the critical communication path. In this paper, we design and implement a user-level Sockets layer over VIA, named SOVIA (Sockets Over VIA). Our objective is to use the SOVIA layer to accelerate the existing Sockets-based applications with a reasonable effort and to provide a portable and high performance communication library based on VIA to the application developers. SOVIA realizes comparable performance to native VIA, showing the minimum latency of 10.5�sec and the peak bandwidth of 814Mbps on Giganet’s cLAN. We have verified the functional compatibility with the existing Sockets API by porting FTP (File Transfer Protocol) and RPC (Remote Procedure Call) applications over the SOVIA layer. Compared to the Giganet’s LANE driver which emulates TCP/IP inside the kernel, SOVIA easily doubles the file transfer bandwidth in FTP and reduces the latency of calling an empty remote procedure by 77 % in RPC applications. 1.
User-Space Communication: A Quantitative Study
, 1998
"... Powerful commodity systems and networks o#er a promising direction for high performance computing because they are inexpensive and they closely track technology progress. However, high, raw--hardware performance is rarely delivered to the end user. Previous work has shown that the bottleneck in thes ..."
Abstract
-
Cited by 31 (5 self)
- Add to MetaCart
Powerful commodity systems and networks o#er a promising direction for high performance computing because they are inexpensive and they closely track technology progress. However, high, raw--hardware performance is rarely delivered to the end user. Previous work has shown that the bottleneck in these architectures is the overheads imposed by the software communication layer. To reduce these overheads, researchers have proposed a number of user-space communication models. The common feature of these models is that applications have direct access to the network, bypassing the operating system in the common case and thus avoiding the cost of send/receive system calls. In this paper we examine five user--space communication layers, that represent di#erent points in the configuration space: Generic AM, BIP-0.92, FM-2.02, PM-1.2, and VMMC-2. Although these systems support di#erent communication paradigms and employ a variety of di#erent implementation tradeo#s, we are able to quantitatively...
Fine-Grain Distributed Shared Memory on Clusters of Workstations
, 1997
"... Shared memory, one of the most popular models for programming parallel platforms, is becoming ubiquitous both in low-end workstations and high-end servers. With the advent of low-latency networking hardware, clusters of workstations strive to offer the same processing power as high-end servers for a ..."
Abstract
-
Cited by 30 (8 self)
- Add to MetaCart
Shared memory, one of the most popular models for programming parallel platforms, is becoming ubiquitous both in low-end workstations and high-end servers. With the advent of low-latency networking hardware, clusters of workstations strive to offer the same processing power as high-end servers for a fraction of the cost. In such environments, shared memory has been limited to page-based systems that control access to shared memory using the memory's page protection to implement shared memory coherence protocols. Unfortunately, false sharing and fragmentation problems force such systems to resort to weak consistency shared memory models that complicate the shared memory programming model.
Realizing the Performance Potential of the Virtual Interface Architecture
, 1999
"... The Virtual Interface (VI) Architecture provides protected userlevel communication with high delivered bandwidth and low permessage latency, particularly for small messages. The VI Architecture attempts to reduce latency by eliminating user/kernel transitions on routine data transfers and by allowin ..."
Abstract
-
Cited by 30 (1 self)
- Add to MetaCart
The Virtual Interface (VI) Architecture provides protected userlevel communication with high delivered bandwidth and low permessage latency, particularly for small messages. The VI Architecture attempts to reduce latency by eliminating user/kernel transitions on routine data transfers and by allowing direct use of user memory for network buffering. This results in significantly lower latencies than those achieved by network protocols such as TCP/IP and UDP. In this paper we examine the low-level performance of two VI implementations, one implemented in hardware, the other implemented in device driver software. Using a set of low-level benchmarks, we measure bandwidth, latency, and processor utilization as a function of message size for the GigaNet cLAN and Tandem ServerNet VI implementations. We report that both VI implementations offer significant performance advantage relative to the corresponding UDP implementation on the same hardware. We also investigate the problems associated wi...
Experiences in Design and Implementation of a High Performance Transport
- In SC
, 2004
"... This paper describes our experiences in the development of the UDP-based Data Transport (UDT) protocol, an application level transport protocol used in distributed data intensive applications. The new protocol is motivated by the emergence of wide area high-speed optical networks, in which TCP is of ..."
Abstract
-
Cited by 24 (6 self)
- Add to MetaCart
This paper describes our experiences in the development of the UDP-based Data Transport (UDT) protocol, an application level transport protocol used in distributed data intensive applications. The new protocol is motivated by the emergence of wide area high-speed optical networks, in which TCP is often found to fail to utilize the abundant bandwidth. UDT demonstrates good efficiency and fairness (including RTT fairness and TCP friendliness) characteristics in high performance computing applications where a small number of bulk sources share the abundant bandwidth. It combines both rate and window control and uses bandwidth estimation to determine the control parameters automatically. This paper presents the rationale behind UDT: how UDT integrates these schemes to support high performance data transfer, why these schemes are used, and what the main issues are in the design and implementation of this high performance transport protocol.
Transformations to parallel codes for communication-computation overlap
- In Supercomputing 2005
, 2005
"... This paper presents program transformations directed toward improving communication-computation overlap in parallel programs that use MPI’s collective operations. Our transformations target a wide variety of applications focusing on scientific codes with computation loops that exhibit limited depend ..."
Abstract
-
Cited by 19 (3 self)
- Add to MetaCart
This paper presents program transformations directed toward improving communication-computation overlap in parallel programs that use MPI’s collective operations. Our transformations target a wide variety of applications focusing on scientific codes with computation loops that exhibit limited dependence among iterations. We include guidance for developers for transforming an application code in order to exploit the communicationcomputation overlap available in the underlying cluster, as well as a discussion of the performance improvements achieved by our transformations. We present results from a detailed study of the effect of the problem and message size, level of communication-computation overlap, and amount of communication aggregation on runtime performance in a cluster environment based on an RDMA-enabled network. The targets of our study are two scientific codes written by domain scientists, but the applicability of our work extends far beyond the scope of these two applications. 1.
Address Translation Mechanisms in Network Interfaces
, 1998
"... Good network hardware performance is often squandered by overheads for accessing the network interface (NI) within a host. NIs that support user-level messaging avoid frequent operating system (OS) action yet unnecessary copying can still result in low performance. We explore improving application m ..."
Abstract
-
Cited by 14 (1 self)
- Add to MetaCart
Good network hardware performance is often squandered by overheads for accessing the network interface (NI) within a host. NIs that support user-level messaging avoid frequent operating system (OS) action yet unnecessary copying can still result in low performance. We explore improving application messaging performance by eliminating all unnecessary copies (minimal messaging). For minimal messaging, NIs must support address translation and must do so more richly than has been done in the past. NI address translation should flexibly support higher-level abstractions, map all user space, exploit translation locality, and degrade gracefully when locality is poor. We classify NI address translation implementations based on where the lookup and the miss handling are performed (CPU or NI). We present alternative designs and we consider how they interact with the OS. We provide simulation results that evaluate the alternative design points and we demonstrate feasibility with a real implement...

