Results 1 - 10
of
12
Architecture and implementation of memory channel 2
- Digital Technical Journal
, 1997
"... The MEMORY CHANNEL network is a dedicated cluster interconnect that provides virtual shared memory among nodes by means of internodal address space mapping. The interconnect implements direct user-level messaging and guarantees strict message ordering under all conditions, including transmission err ..."
Abstract
-
Cited by 18 (0 self)
- Add to MetaCart
The MEMORY CHANNEL network is a dedicated cluster interconnect that provides virtual shared memory among nodes by means of internodal address space mapping. The interconnect implements direct user-level messaging and guarantees strict message ordering under all conditions, including transmission errors. These characteristics allow industry-standard communication interfaces and parallel programming paradigms to achieve much higher efficiency than on conventional networks. This paper presents an overview of the MEMORY CHANNEL network architecture and describes DIGITAL‘s crossbarbased implementation of the second-generation MEMORY CHANNEL network, MEMORY CHANNEL 2. This network provides bisection bandwidths of 1,000 to 2,000 megabytes per second and a sustained process-to-process bandwidth of 88 megabytes per second. One-way, processto-process message latency is less than 2.2 microseconds.
High-Performance Cluster Computing Using SCI
, 1997
"... The Scalable Coherent Interface (SCI) is a recent communication standard for cluster interconnects. We study the use of SCI in a high-performance parallel computing setting, using a cluster of UltraSparcs connected via Dolphin SCI SBus-2 adapters. We chose SCI as network fabric since it offers very ..."
Abstract
-
Cited by 18 (2 self)
- Add to MetaCart
The Scalable Coherent Interface (SCI) is a recent communication standard for cluster interconnects. We study the use of SCI in a high-performance parallel computing setting, using a cluster of UltraSparcs connected via Dolphin SCI SBus-2 adapters. We chose SCI as network fabric since it offers very low latencies and high bandwidth. In this paper, we study how to map a variety of programming models efficiently onto the SCI hardware. We focus on message passing and global address space support, implementing Active Messages and Split-C. We present implementation trade-offs and present performance measurements. We found that the user-level load/store programming interface of SCI is very convenient to use, achieves low latencies, and is fully virtualized, simultaneously supporting multiple parallel programs and communication channels. On the other hand, neither of the programming models studied maps directly to SCI. Issues such as notification, atomic operations, and virtual address space l...
Performance of Low-Cost UltraSparc Multiprocessors Connected by SCI
- In Proceedings of Communication Networks and Distributed Systems Modeling and Simulation
, 1996
"... In bringing high end performance to low cost workstation hardware, the level of efficiency in the interconnection of many individual nodes will be a key factor for the overall system performance. In this paper basic performance characteristics as throughput and latency is measured on state-of-the-ar ..."
Abstract
-
Cited by 17 (5 self)
- Add to MetaCart
In bringing high end performance to low cost workstation hardware, the level of efficiency in the interconnection of many individual nodes will be a key factor for the overall system performance. In this paper basic performance characteristics as throughput and latency is measured on state-of-the-art workstations connected by SCI (Scalable Coherent Interface) based I/O adapters. The flexibility of the SCI interconnect give options for customizable internal bandwidths and systems with up to 64K nodes. Thus point-to-point throughput is limited by different parts of the interface towards each node, and not by the interconnect itself. Point-to-point performance is investigated and compared to similar measurements done for other interconnects. Even though options for improvement are pointed out, results demonstrate clearly that although this interconnect has latency characteristics inferior to common SMPs, this solution is currently providing one of the best I/O adapter based solutions and ...
Cluster communication using a PCI to SCI interface
- In IASTED Eighth International Conference on Parallel and Distributed Computing and Systems
, 1996
"... Communication latency and throughput are important parameters when building a cluster of interconnected computers. We investigate the mechanisms that ensure low latency and high throughput in a new cluster adapter based on the Scalable Coherent Interface (SCI) standard, and compare it to a more trad ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
Communication latency and throughput are important parameters when building a cluster of interconnected computers. We investigate the mechanisms that ensure low latency and high throughput in a new cluster adapter based on the Scalable Coherent Interface (SCI) standard, and compare it to a more traditional communication technology (Fast Ethernet). Keywords Clustering, Shared memory, Performance evaluation, Message passing 1 Introduction Computer clusters are an attractive alternative to large SMP systems due to lower cost and improved reliability [1, 2]. We discuss the performance of a cluster interconnect based on the iScalable Coherent Interfacej (SCI) standard [3]. A bridge has been developed [4] that connects SCI to the industry standard PCI bus. This PCI-SCI bridge ooeers low latency message passing through non-cached shared memory. Remote memory read, write, lock, interrupt and DMA operations are available. The performance of DMA operations will not be investigated here. Instea...
Synchronization Support in I/O Adapter Based SCI Clusters
- In Proceedings of Workshop on Communication and Architectural Support for Network-based Parallel Computing, CANPC'97
, 1997
"... . This paper examines synchronization support of two generations of SCI adapters from Dolphin Interconnect Solutions and compares the functionality to similar support on Digital's Memory Channel. Memory Channel enforces sequential consistency across the interconnect, while SCI allows store reorderin ..."
Abstract
-
Cited by 7 (3 self)
- Add to MetaCart
. This paper examines synchronization support of two generations of SCI adapters from Dolphin Interconnect Solutions and compares the functionality to similar support on Digital's Memory Channel. Memory Channel enforces sequential consistency across the interconnect, while SCI allows store reordering. This gives SCI a potential performance payoff by allowing more flexible pipelining of data through the interconnect. The lower number of ordering constraints also reduces hardware complexity, but moves the complexity to software. For a straightforward implementation of message passing this overhead is significant. A new software algorithm, the valid flag algorithm, is introduced to improve this situation. A new hardware lock support mechanism is proposed to facilitate efficient locks in absence of lock support on the I/O bus. Performance of the simple message passing protocol is compared to the suggested valid flag protocol. The valid flag protocol reduces latency of a small message by 50...
SCI Clustering through the I/O bus: A Performance and Functionality Analysis
, 1998
"... A number of high bandwidth, low latency interconnect technologies based on I/O adapters have recently emerged. SCI (Scalable Coherent Interface) is the underlying protocol employed in of one of these technologies. This dissertation studies aspects of several generations of SCI based clusters of work ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
A number of high bandwidth, low latency interconnect technologies based on I/O adapters have recently emerged. SCI (Scalable Coherent Interface) is the underlying protocol employed in of one of these technologies. This dissertation studies aspects of several generations of SCI based clusters of workstations with respect to performance and functionality, and analyses how the hardware support affects performance, what the limits are and how performance can be increased. Much of the performance work is accomplished through microbenchmarks specifically tuned for the particular hardware in order to demonstrate optimal performance of the hardware. Hardware functionality is explored and investigated to reveal weaknesses of the semantics as seen from software and thus provide understanding of how the situation could be improved in the next generation. Apart from the detailed understanding of four generations of SCI adapters, the thesis as a whole provides the historical line of the development...
Performance of a Cluster of PCI Based UltraSPARC Workstations Interconnected with SCI
- In Proceedings of Network-Based Parallel Computing, Communication, Architecture, and Applications, CANPC`98, Las Vegas, Nevada, Jan/Feb 1998, Lecture Notes in Computer Science no.1362
, 1998
"... . SCI is based on unidirectional point-to-point links forming ringlets that can be connected with switches to allow further scaling. This paper presents performance results from running on a number of differently configured SCI clusters. The SCI technology used is Dolphin's second generation PCI/SCI ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
. SCI is based on unidirectional point-to-point links forming ringlets that can be connected with switches to allow further scaling. This paper presents performance results from running on a number of differently configured SCI clusters. The SCI technology used is Dolphin's second generation PCI/SCI adapter based on the LC-2 LinkController chip as well as a new 4 port, LC-2 based switch. Nodes are UltraSparcs running Solaris 2.5.1 as the operating system. Results show latencies down to 2.9µs for remote stores, and bandwidths up to 80 Mbytes/s into a single system. Network throughput of more than 270 Mbytes/s (8 node system) is demonstrated. Results indicate that the new LC-2 eliminates a number of problems with the earlier LC-1 chip in addition to increasing peak performance. With its flexible building blocks this technology should also make it possible to construct systems with a large number of nodes, pushing I/O adapter based SCI interconnects forward as a promising system area netw...
Eliminating the protocol stack for socket based communication in shared memory interconnects
- In Proceedings of International Workshop on Personal Computer based Networks Of Workstations (at IPPS'98
, 1998
"... Abstract. We show how the traditional protocol stack, such as TCP/IP, can be eliminated for socket based high speed communication within a cluster. The SCI shared memory interconnect is used as an example, and we demonstrate how existing applications can utilize the new technology without relinking. ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Abstract. We show how the traditional protocol stack, such as TCP/IP, can be eliminated for socket based high speed communication within a cluster. The SCI shared memory interconnect is used as an example, and we demonstrate how existing applications can utilize the new technology without relinking. This is done by dynamically remapping the TCP/IP socket implementation to our high performance SCILAN sockets. We describe a novel mechanism for synchronization of communication through shared memory, aimed at minimizing the interrupt load on the receiving system. We discuss the implementation and present an evaluation with comparison to alternative technologies, such as 100baseT and ATM. Significant improvement over current solutions are shown both in terms of throughput and latency. 1
SCI for Local Area Networks
, 1998
"... We show how the traditional protocol stack, such as TCP/IP, can be eliminated for socket based high speed communication within a cluster. The SCI shared memory interconnect is used as an example, and we demonstrate how existing applications can utilize the new technology without relinking. This is d ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
We show how the traditional protocol stack, such as TCP/IP, can be eliminated for socket based high speed communication within a cluster. The SCI shared memory interconnect is used as an example, and we demonstrate how existing applications can utilize the new technology without relinking. This is done by dynamically remapping the TCP/IP socket implementation to our high performance SCILAN sockets. We describe a novel mechanism for synchronization of communication through shared memory, aimed at minimizing the interrupt load on the receiving system. We discuss the implementation and present an evaluation with comparison to alternative technologies, such as 100baseT and ATM. Significant improvement over current solutions are shown both in terms of throughput and latency. 1 Introduction Recent network technologies give us the ability to transfer multi-gigabytes per second over interconnects with low latency at the physical layer. When fast network technologies such as MemoryChannel [GK9...
Evaluating the Trade-Offs in the Parallelization of Probabilistic Search Algorithms
, 1997
"... In this work, we propose a speculative parallelization strategy for Probabilistic Search algorithms. We design a parallel version of one such algorithm for a real--time computer vision problem with many practical applications. The implementation is performed on a cluster of eight DEC AlphaServer 210 ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
In this work, we propose a speculative parallelization strategy for Probabilistic Search algorithms. We design a parallel version of one such algorithm for a real--time computer vision problem with many practical applications. The implementation is performed on a cluster of eight DEC AlphaServer 2100 4/233 machines connected via a DEC Memory Channel network. Four types of run--time systems are tested: Hardware--coherent Shared Memory (HSM), Distributed Shared Memory (DSM) with software coherence (Cashmere-2L), Reflective Shared Memory (RSM), and message passing (Digital PVM). A run--time system that is normally not the most efficient among these four (RSM) is found to be the best choice for our combination of algorithm and architecture. This shows that algorithmic parallelization and the selection of an implementation environment mutually interfere with each other and must be performed in a synergistic manner. Resumo Neste trabalho propomos uma estrat'egia de paraleliza¸c~ao especulat...

