Results 1 -
8 of
8
Efficiency vs. Portability in Cluster-Based Network Servers
"... Efficiency and portability are usually conflicting objectives for cluster-based network servers that distribute the clients ’ requests across the cluster based on the actual content requested. Our work is based on the observation that this efficiency vs. portability tradeoff has not been discussed b ..."
Abstract
-
Cited by 47 (21 self)
- Add to MetaCart
Efficiency and portability are usually conflicting objectives for cluster-based network servers that distribute the clients ’ requests across the cluster based on the actual content requested. Our work is based on the observation that this efficiency vs. portability tradeoff has not been discussed before in the literature. To fill this gap, in this paper we study this tradeoff in the context of an interesting class of content-based network servers, the locality-conscious servers, using modeling and experimentation. Our analytical model gauges the potential performance benefits of portable and non-portable localityconscious request distribution with respect to a traditional, locality-oblivious server, as a function of multiple parameters. Based on our experience with the model, we design and evaluate a portable, locality-conscious server. Experiments with our server, a nonportable server, and a traditional server validate and confirm our modeling results under several real workloads. Based on our modeling and experimental results, our main conclusion is that portability should be promoted in cluster-based network servers with low processor overhead communication, given its relatively low cost 15%) in terms of efficiency. For clusters with high processor overhead communication, efficiency should be the overriding concern, as the cost of portability can be very high (as high as 98 % on 32 nodes). We also conclude that user-level communication can be useful even for non-scientific applications such as network servers.
User-Level Communication in Cluster-Based Servers
- In Proceedings of the 8th IEEE International Symposium on High-Performance Computer Architecture (HPCA 8
, 2002
"... Clusters of commodity computers are currently being used to provide the scalability required by several popular Internet services. In this paper we evaluate an efficient cluster-based WWW server, as a function of the characteristicsof the intra-cluster communication architecture. More specifically, ..."
Abstract
-
Cited by 29 (11 self)
- Add to MetaCart
Clusters of commodity computers are currently being used to provide the scalability required by several popular Internet services. In this paper we evaluate an efficient cluster-based WWW server, as a function of the characteristicsof the intra-cluster communication architecture. More specifically, we evaluate the impact of processor overhead, networkbandwidth, remote memory writes, and zero-copy data transfers on the performance of our server. Our experimental results with an 8-node cluster and four real WWW traces show that networkbandwidth affects the performanceof our server by only 6%. In contrast, user-level communication can improve performance by as much as 29%. Low processor overhead, remote memory writes, and zero-copyall make small contributions towardsthis overall gain. Tobe able to extrapolate fromour experimental results, we usean analytical model to assess the performance of our server under different workload characteristics, different numbers of cluster nodes, and higher performance systems. Our modeling results show that higher gains (of up to 55%) can be accrued for workloads with large working sets and next-generation servers running on large clusters. 1
In-network cache coherence
- In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
, 2006
"... Abstract — We propose implementing cache coherence protocols within the network, demonstrating how an in-network implementation of the MSI directory-based protocol allows for in-transit optimizations of read and write delay. Our results show 15 % and 24 % savings on average in memory access latency ..."
Abstract
-
Cited by 23 (4 self)
- Add to MetaCart
Abstract — We propose implementing cache coherence protocols within the network, demonstrating how an in-network implementation of the MSI directory-based protocol allows for in-transit optimizations of read and write delay. Our results show 15 % and 24 % savings on average in memory access latency for SPLASH-2 parallel benchmarks running on a 4x4 and a 16x16 multiprocessor respectively.
Multi-level Shared State for Distributed Systems
- In Proc. of the 2002 Intl. Conf. on Parallel Processing
, 2002
"... As a result of advances in processor and network speeds, more and more applications can productively be spread across geographically distributed machines. In this paper we present a transparent system for memory sharing, InterWeave, developed with such applications in mind. InterWeave can accommodat ..."
Abstract
-
Cited by 12 (8 self)
- Add to MetaCart
As a result of advances in processor and network speeds, more and more applications can productively be spread across geographically distributed machines. In this paper we present a transparent system for memory sharing, InterWeave, developed with such applications in mind. InterWeave can accommodate hardware coherence and consistency within multiprocessors (level-1 sharing), software distributed shared memory (S-DSM) within tightly coupled clusters (level-2 sharing), and version-based coherence and consistency across the Internet (level-3 sharing). InterWeave allows processes written in multiple languages, running on heterogeneous machines, to share arbitrary typed data structures as if they resided in local memory. Application-specific knowledge of minimal coherence requirements is used to minimize communication. Consistency information is maintained in a manner that allows scaling to large amounts of shared data. In C, operations on shared data, including pointers, take precisely the same form as operations on non-shared data. We demonstrate the ease of use and efficiency of the system through an evaluation of several applications. In particular, we demonstrate that InterWeave's support for sharing at higher (more distributed) levels does not reduce the performance of sharing at lower (more tightly coupled) levels.
Cables: Thread control and memory management extensions for shared virtual memory clusters
- In Proc. of The 8th IEEE Symposium on High-Performance Computer Architecture (HPCA8
, 2002
"... Clusters of high-end workstations and PCs are currently used in many application domains to perform large-scale computations or as scalable servers for I/O bound tasks. Although clusters have many advantages, their applicability in emerging areas of applications has been limited. One of the main rea ..."
Abstract
-
Cited by 7 (3 self)
- Add to MetaCart
Clusters of high-end workstations and PCs are currently used in many application domains to perform large-scale computations or as scalable servers for I/O bound tasks. Although clusters have many advantages, their applicability in emerging areas of applications has been limited. One of the main reasons for this is the fact that clusters do not provide a single system image and thus are hard to program. In this work we address this problem by providing a single cluster image with respect to thread and memory management. We implement our system, CableS (Cluster enabledthreadS), on a 32-processor cluster interconnected with a low-latency, high-bandwidth system area network and conduct an early exploration of the costs involved in providing the extra functionality. We demonstrate the versatility of CableS with a wide range of applications and show that clusters can be used to support applications that have been written for more expensive tightly–coupled systems, with very little effort on the programmer side: (a) We run legacy pthreads applications without any major modifications. (b) We use a public domain OpenMP compiler (OdinMP [8]) to translate OpenMP programs to pthreads and execute them on our system, with no or few modifications to the translated pthreads source code. (c) We provide an implementation of the M4 macros for our pthreads system and run the SPLASH-2 applications. We also show that the overhead introduced by the extra functionality of CableS affects the parallel section of applications that have been tuned for the shared memory abstraction only in cases where the data placement is affected by operating system (WindowsNT) limitations in virtual memory mappings granularity.
Multithreaded Home-Based Lazy Release Consistency over VIA
- In Proc. 19th IEEE Intl. Parallel and Distributed Processing Symp. (IPDPS’04
, 2004
"... ..."
The Implementation of Cashmere
"... Cashmere is a software distributed shared memory (SDSM) system designed for today’s high performance cluster architectures. These clusters typically consist of symmetric multiprocessors (SMPs) connected by a low-latency system area network. Cashmere introduces several novel techniques for delegating ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Cashmere is a software distributed shared memory (SDSM) system designed for today’s high performance cluster architectures. These clusters typically consist of symmetric multiprocessors (SMPs) connected by a low-latency system area network. Cashmere introduces several novel techniques for delegating intra-node sharing to the hardware coherence mechanism available within the SMPs, and also for leveraging advanced network features such as remote memory access. The efficacy of the Cashmere design has been borne out through head-to-head comparisons with other well-known, mature SDSMs and with Cashmere variants that do not take advantage of the various hardware features. In this paper, we describe the implementation of the Cashmere SDSM. Our discussion is organized around the core components that comprise Cashmere. We discuss both component interactions and lowlevel implementation details. We hope this paper provides researchers with the background needed to
Implementing TreadMarks over GM on Myrinet: Challenges, Design
"... Software based DSM systems like TreadMarks have traditionally not performed well compared to message passing applications because of the high overhead of communication associated with traditional stack based protocols like UDP. Modern interconnects like Myrinet offer reliable message delivery with v ..."
Abstract
- Add to MetaCart
Software based DSM systems like TreadMarks have traditionally not performed well compared to message passing applications because of the high overhead of communication associated with traditional stack based protocols like UDP. Modern interconnects like Myrinet offer reliable message delivery with very low communication overhead through user level protocols. This paper examines the viability of implementing a thin communication substrate between TreadMarks and Myrinet GM, the rationale being that a layer tuned to the needs of the application would offer better performance and scalability as opposed to a generic UDP layer. Trade-offs for various design alternatives for buffer management, connection setup, advance posting of descriptors and asynchronous messages are discussed. We have implemented the best of these strategies in a layer that is bound to TreadMarks at compile time. Results from micro-benchmarks and applications show that not only does the specialized implementation perform better, it also exhibits better parallel speedup and scalability. A reduction in total application execution time of up to a factor of 6.3 for a 16 node system is demonstrated in comparison with the original implementation. The implementation also exhibits superior scaling properties as the application size is increased.

