Results 1 - 10
of
93
Using Network Interface Support to Avoid Asynchronous Protocol Processing in Shared Virtual Memory Systems
- In Proceedings of the 26th International Symposium on Computer Architecture
, 1999
"... The performance of page-based software shared virtual memory (SVM) is still far from that achieved on hardwarecoherent distributed shared memory (DSM) systems. The interrupt cost for asynchronous protocol processing has been found to be a key source of performance loss and complexity. This paper sho ..."
Abstract
-
Cited by 40 (7 self)
- Add to MetaCart
The performance of page-based software shared virtual memory (SVM) is still far from that achieved on hardwarecoherent distributed shared memory (DSM) systems. The interrupt cost for asynchronous protocol processing has been found to be a key source of performance loss and complexity. This paper shows that by providing simple and general support for asynchronous message handling in a commodity network interface (NI), and by altering SVM protocols appropriately, protocol activity can be decoupled from asynchronous message handling and the need for interrupts or polling can be eliminated. The NI mechanisms needed are generic, not SVM-dependent. They also require neither visibility into the node memory system nor code instrumentation to identify memory operations. We prototype the mechanisms and such a synchronous home-based LRC protocol, called GeNIMA (GEneral-purpose Network Interface support in a shared Memory Abstraction), on a cluster of SMPs with a programmable NI, though the mechan...
Experiences with VI Communication for Database Storage
- In Proceedings of the 29th annual international symposium on Computer architecture
, 2002
"... This paper examines how VI–based interconnects can be used to improve I/O path performance between a database server and the storage subsystem. We design and implement a software layer, DSA, that is layered between the application and VI. DSA takes advantage of specific VI features and deals with ma ..."
Abstract
-
Cited by 33 (9 self)
- Add to MetaCart
This paper examines how VI–based interconnects can be used to improve I/O path performance between a database server and the storage subsystem. We design and implement a software layer, DSA, that is layered between the application and VI. DSA takes advantage of specific VI features and deals with many of its shortcomings. We provide and evaluate one kernel–level and two user–level implementations of DSA. These implementations trade transparency and generality for performance at different degrees, and unlike research prototypes are designed to be suitable for real– world deployment. We present detailed measurements using a commercial database management system with both micro-benchmarks and industrial database workloads on a mid–size, 4 CPU, and a large, 32 CPU, database server. Our results show that VI–based interconnects and user– level communication can improve all aspects of the I/O path between the database system and the storage back-end. We also find that to make effective use of VI in I/O intensive environments we need to provide substantial additional functionality than what is currently provided by VI. Finally, new storage APIs that help minimize kernel involvement in the I/O path are needed to fully exploit the benefits of VI–based communication. 1
Fine-Grain Distributed Shared Memory on Clusters of Workstations
, 1997
"... Shared memory, one of the most popular models for programming parallel platforms, is becoming ubiquitous both in low-end workstations and high-end servers. With the advent of low-latency networking hardware, clusters of workstations strive to offer the same processing power as high-end servers for a ..."
Abstract
-
Cited by 30 (8 self)
- Add to MetaCart
Shared memory, one of the most popular models for programming parallel platforms, is becoming ubiquitous both in low-end workstations and high-end servers. With the advent of low-latency networking hardware, clusters of workstations strive to offer the same processing power as high-end servers for a fraction of the cost. In such environments, shared memory has been limited to page-based systems that control access to shared memory using the memory's page protection to implement shared memory coherence protocols. Unfortunately, false sharing and fragmentation problems force such systems to resort to weak consistency shared memory models that complicate the shared memory programming model.
High Performance VMM-Bypass I/O in Virtual Machines
, 2006
"... Currently, I/O device virtualization models in virtual machine (VM) environments require involvement of a virtual machine monitor (VMM) and/or a privileged VM for each I/O operation, which may turn out to be a performance bottleneck for systems with high I/O demands, especially those equipped with m ..."
Abstract
-
Cited by 30 (1 self)
- Add to MetaCart
Currently, I/O device virtualization models in virtual machine (VM) environments require involvement of a virtual machine monitor (VMM) and/or a privileged VM for each I/O operation, which may turn out to be a performance bottleneck for systems with high I/O demands, especially those equipped with modern high speed interconnects such as InfiniBand. In this paper, we propose a new device virtualization model called VMM-bypass I/O, which extends the idea of OS-bypass originated from user-level communication. Essentially, VMM-bypass allows time-critical I/O operations to be carried out directly in guest VMs without involvement of the VMM and/or a privileged VM. By exploiting the intelligence found in modern high speed network interfaces, VMM-bypass can significantly improve I/O and communication performance for VMs without sacrificing safety or isolation. To demonstrate the idea of VMM-bypass, we have developed a prototype called Xen-IB, which offers Infini-Band virtualization support in the Xen 3.0 VM environment. Xen-IB runs with current InfiniBand hardware and does not require modifications to existing user-level applications or kernel-level drivers that use InfiniBand. Our performance measurements show that Xen-IB is able to achieve nearly the same raw performance as the original InfiniBand driver running in a non-virtualized environment.
User-Level Communication in Cluster-Based Servers
- In Proceedings of the 8th IEEE International Symposium on High-Performance Computer Architecture (HPCA 8
, 2002
"... Clusters of commodity computers are currently being used to provide the scalability required by several popular Internet services. In this paper we evaluate an efficient cluster-based WWW server, as a function of the characteristicsof the intra-cluster communication architecture. More specifically, ..."
Abstract
-
Cited by 29 (11 self)
- Add to MetaCart
Clusters of commodity computers are currently being used to provide the scalability required by several popular Internet services. In this paper we evaluate an efficient cluster-based WWW server, as a function of the characteristicsof the intra-cluster communication architecture. More specifically, we evaluate the impact of processor overhead, networkbandwidth, remote memory writes, and zero-copy data transfers on the performance of our server. Our experimental results with an 8-node cluster and four real WWW traces show that networkbandwidth affects the performanceof our server by only 6%. In contrast, user-level communication can improve performance by as much as 29%. Low processor overhead, remote memory writes, and zero-copyall make small contributions towardsthis overall gain. Tobe able to extrapolate fromour experimental results, we usean analytical model to assess the performance of our server under different workload characteristics, different numbers of cluster nodes, and higher performance systems. Our modeling results show that higher gains (of up to 55%) can be accrued for workloads with large working sets and next-generation servers running on large clusters. 1
Eta: Experience with an intel xeon processor as a packet processing engine
- IEEE Micro
, 2004
"... Server-based networks have welldocumented performance limitations. 1,2 These limitations outline a major goal of Intel’s Embedded Transport Acceleration (ETA) project, the ability to deliver high-performance server communication and I/O over standard Ethernet and Transmission Control Protocol/Intern ..."
Abstract
-
Cited by 28 (3 self)
- Add to MetaCart
Server-based networks have welldocumented performance limitations. 1,2 These limitations outline a major goal of Intel’s Embedded Transport Acceleration (ETA) project, the ability to deliver high-performance server communication and I/O over standard Ethernet and Transmission Control Protocol/Internet Protocol (TCP/IP) networks. By developing this capability, Intel hopes to take advantage of the large knowledge base and ubiquity of these standard technologies. With the advent of 10 gigabit Ethernet, these standards promise to provide the bandwidth required of the most demanding server applications. In addition, a substantial
Collective Operations in an Application-level Fault Tolerant MPI System
- In International Conference on Supercomputing (ICS) 2003
, 2003
"... The running times of many computational science programs are now significantly greater than the mean-time-betweenfailures (MTBF) of the hardware they run on. Therefore, fault-tolerance is becoming a critical issue on highperformance platforms. ..."
Abstract
-
Cited by 25 (11 self)
- Add to MetaCart
The running times of many computational science programs are now significantly greater than the mean-time-betweenfailures (MTBF) of the hardware they run on. Therefore, fault-tolerance is becoming a critical issue on highperformance platforms.
Concurrent Direct Network Access for Virtual Machine Monitors
- HPCA 2007
, 2007
"... This paper presents hardware and software mechanisms to enable concurrent direct network access (CDNA) by operating systems running within a virtual machine monitor. In a conventional virtual machine monitor, each operating system running within a virtual machine must access the network through a so ..."
Abstract
-
Cited by 22 (3 self)
- Add to MetaCart
This paper presents hardware and software mechanisms to enable concurrent direct network access (CDNA) by operating systems running within a virtual machine monitor. In a conventional virtual machine monitor, each operating system running within a virtual machine must access the network through a software-virtualized network interface. These virtual network interfaces are multiplexed in software onto a physical network interface, incurring significant performance overheads. The CDNA architecture improves networking efficiency and performance by dividing the tasks of traffic multiplexing, interrupt delivery, and memory protection between hardware and software in a novel way. The virtual machine monitor delivers interrupts and provides protection between virtual machines, while the network interface performs multiplexing of the network data. In effect, the CDNA architecture provides the abstraction that each virtual machine is connected directly to its own network interface. Through the use of CDNA, many of the bottlenecks imposed by software multiplexing can be eliminated without sacrificing protection, producing substantial efficiency improvements.
The design for a high performance MPI implementation on the Myrinet network
, 1999
"... . We present our MPI-BIP implementation, designed for Myrinet networks, and based on MPICH. By using our Basic Interface for Parallelism: BIP software layer, we obtain in this implementation of the MPI protocols results close to the peak hardware performance of the high speed Myrinet network. We pre ..."
Abstract
-
Cited by 21 (3 self)
- Add to MetaCart
. We present our MPI-BIP implementation, designed for Myrinet networks, and based on MPICH. By using our Basic Interface for Parallelism: BIP software layer, we obtain in this implementation of the MPI protocols results close to the peak hardware performance of the high speed Myrinet network. We present the protocols we used to implement the MPI semantics, and the overall design of the implementation. We, then, present benchmarks and application results to show that this design leads to parallel multicomputer-like throughput and latency on a cluster of PC workstations. 1 Introduction In the last decade, researchers tried to use COWs (Cluster Of Workstations) as parallel computers. These clusters are typically connected by Ethernet networks and are often programmed with communication libraries like PVM (Parallel Virtual Machine [6]), or MPI over IP (Internet Protocol). There is two bottlenecks in these solutions that can restrict application programmers to coarse grain paral...
Transformations to parallel codes for communication-computation overlap
- In Supercomputing 2005
, 2005
"... This paper presents program transformations directed toward improving communication-computation overlap in parallel programs that use MPI’s collective operations. Our transformations target a wide variety of applications focusing on scientific codes with computation loops that exhibit limited depend ..."
Abstract
-
Cited by 19 (3 self)
- Add to MetaCart
This paper presents program transformations directed toward improving communication-computation overlap in parallel programs that use MPI’s collective operations. Our transformations target a wide variety of applications focusing on scientific codes with computation loops that exhibit limited dependence among iterations. We include guidance for developers for transforming an application code in order to exploit the communicationcomputation overlap available in the underlying cluster, as well as a discussion of the performance improvements achieved by our transformations. We present results from a detailed study of the effect of the problem and message size, level of communication-computation overlap, and amount of communication aggregation on runtime performance in a cluster environment based on an RDMA-enabled network. The targets of our study are two scientific codes written by domain scientists, but the applicability of our work extends far beyond the scope of these two applications. 1.

