Results 1  10
of
181
A Scalable, Commodity Data Center Network Architecture
, 2008
"... Today’s data centers may contain tens of thousands of computers with significant aggregate bandwidth requirements. The network architecture typically consists of a tree of routing and switching elements with progressively more specialized and expensive equipment moving up the network hierarchy. Unfo ..."
Abstract

Cited by 205 (13 self)
 Add to MetaCart
Today’s data centers may contain tens of thousands of computers with significant aggregate bandwidth requirements. The network architecture typically consists of a tree of routing and switching elements with progressively more specialized and expensive equipment moving up the network hierarchy. Unfortunately, even when deploying the highestend IP switches/routers, resulting topologies may only support 50 % of the aggregate bandwidth available at the edge of the network, while still incurring tremendous cost. Nonuniform bandwidth among data center nodes complicates application design and limits overall system performance. In this paper, we show how to leverage largely commodity Ethernet switches to support the full aggregate bandwidth of clusters consisting of tens of thousands of elements. Similar to how clusters of commodity computers have largely replaced more specialized SMPs and MPPs, we argue that appropriately architected and interconnected commodity switches may deliver more performance at less cost than available from today’s higherend solutions. Our approach requires no modifications to the end host network interface, operating system, or applications; critically, it is fully backward compatible with Ethernet, IP, and TCP.
Programming Parallel Algorithms
, 1996
"... In the past 20 years there has been treftlendous progress in developing and analyzing parallel algorithftls. Researchers have developed efficient parallel algorithms to solve most problems for which efficient sequential solutions are known. Although some ofthese algorithms are efficient only in a th ..."
Abstract

Cited by 193 (9 self)
 Add to MetaCart
In the past 20 years there has been treftlendous progress in developing and analyzing parallel algorithftls. Researchers have developed efficient parallel algorithms to solve most problems for which efficient sequential solutions are known. Although some ofthese algorithms are efficient only in a theoretical framework, many are quite efficient in practice or have key ideas that have been used in efficient implementations. This research on parallel algorithms has not only improved our general understanding ofparallelism but in several cases has led to improvements in sequential algorithms. Unf:ortunately there has been less success in developing good languages f:or prograftlftling parallel algorithftls, particularly languages that are well suited for teaching and prototyping algorithms. There has been a large gap between languages
BCube: A High Performance, Servercentric Network Architecture for Modular Data Centers
 In SIGCOMM
, 2009
"... This paper presents BCube, a new network architecture specifically designed for shippingcontainer based, modular data centers. At the core of the BCube architecture is its servercentric network structure, where servers with multiple network ports connect to multiple layers of COTS (commodity offt ..."
Abstract

Cited by 121 (15 self)
 Add to MetaCart
This paper presents BCube, a new network architecture specifically designed for shippingcontainer based, modular data centers. At the core of the BCube architecture is its servercentric network structure, where servers with multiple network ports connect to multiple layers of COTS (commodity offtheshelf) miniswitches. Servers act as not only end hosts, but also relay nodes for each other. BCube supports various bandwidthintensive applications by speedingup onetoone, onetoseveral, and onetoall traffic patterns, and by providing high network capacity for alltoall traffic. BCube exhibits graceful performance degradation as the server and/or switch failure rate increases. This property is of special importance for shippingcontainer data centers, since once the container is sealed and operational, it becomes very difficult to repair or replace its components. Our implementation experiences show that BCube can be seamlessly integrated with the TCP/IP protocol stack and BCube packet forwarding can be efficiently implemented in both hardware and software. Experiments in our testbed demonstrate that BCube is fault tolerant and load balancing and it significantly accelerates representative bandwidthintensive applications.
Randomized routing and sorting on fixedconnection networks
 Journal of Algorithms
, 1994
"... This paper presents a general paradigm for the design of packet routing algorithms for fixedconnection networks. Its basis is a randomized online algorithm for scheduling any set of N packets whose paths have congestion c on any boundeddegree leveled network with depth L in O(c + L + log N) steps ..."
Abstract

Cited by 88 (13 self)
 Add to MetaCart
This paper presents a general paradigm for the design of packet routing algorithms for fixedconnection networks. Its basis is a randomized online algorithm for scheduling any set of N packets whose paths have congestion c on any boundeddegree leveled network with depth L in O(c + L + log N) steps, using constantsize queues. In this paradigm, the design of a routing algorithm is broken into three parts: (1) showing that the underlying network can emulate a leveled network, (2) designing a path selection strategy for the leveled network, and (3) applying the scheduling algorithm. This strategy yields randomized algorithms for routing and sorting in time proportional to the diameter for meshes, butterflies, shuffleexchange graphs, multidimensional arrays, and hypercubes. It also leads to the construction of an areauniversal network: an Nnode network with area Θ(N) that can simulate any other network of area O(N) with slowdown O(log N).
Special Purpose Parallel Computing
 Lectures on Parallel Computation
, 1993
"... A vast amount of work has been done in recent years on the design, analysis, implementation and verification of special purpose parallel computing systems. This paper presents a survey of various aspects of this work. A long, but by no means complete, bibliography is given. 1. Introduction Turing ..."
Abstract

Cited by 77 (5 self)
 Add to MetaCart
A vast amount of work has been done in recent years on the design, analysis, implementation and verification of special purpose parallel computing systems. This paper presents a survey of various aspects of this work. A long, but by no means complete, bibliography is given. 1. Introduction Turing [365] demonstrated that, in principle, a single general purpose sequential machine could be designed which would be capable of efficiently performing any computation which could be performed by a special purpose sequential machine. The importance of this universality result for subsequent practical developments in computing cannot be overstated. It showed that, for a given computational problem, the additional efficiency advantages which could be gained by designing a special purpose sequential machine for that problem would not be great. Around 1944, von Neumann produced a proposal [66, 389] for a general purpose storedprogram sequential computer which captured the fundamental principles of...
The Network Architecture of the Connection Machine CM5
 Journal of Parallel and Distributed Computing
, 1992
"... The Connection Machine Model CM5 Supercomputer is a massively parallel computer system designed to offer performance in the range of 1 teraflops (10 12 floatingpoint operations per second). The CM5 obtains its high performance while offering ease of programming, flexibility, and reliability. Th ..."
Abstract

Cited by 77 (2 self)
 Add to MetaCart
The Connection Machine Model CM5 Supercomputer is a massively parallel computer system designed to offer performance in the range of 1 teraflops (10 12 floatingpoint operations per second). The CM5 obtains its high performance while offering ease of programming, flexibility, and reliability. The machine contains three communication networks: a data network, a control network, and a diagnostic network. This paper describes the organization of these three networks and how they contribute to the design goals of the CM5. 1 Introduction In the design of a parallel computer, the engineering principle of economy of mechanism suggests that the machine should employ only a single communication network to convey information among the processors in the system. Indeed, many parallel computers contain only a single network: typically, a hypercube or a mesh. The Connection Machine Model CM5 Supercomputer has three networks, however, and none is a hypercube or a mesh. This paper describes the...
Online algorithms for path selection in a nonblocking network
 SIAM Journal on Computing
, 1996
"... This paper presents the first optimaltime algorithms for path selection in an optimalsize nonblocking network. In particular, we describe an Ninput, Noutput, nonblocking network with O(N log N) boundeddegree nodes, and an algorithm that can satisfy any request for a connection or disconnection ..."
Abstract

Cited by 63 (14 self)
 Add to MetaCart
This paper presents the first optimaltime algorithms for path selection in an optimalsize nonblocking network. In particular, we describe an Ninput, Noutput, nonblocking network with O(N log N) boundeddegree nodes, and an algorithm that can satisfy any request for a connection or disconnection between an input and an output in O(log N) bit steps, even if many requests are made at once. Viewed in a telephone switching context, the algorithm can put through any set of calls among N parties in O(log N) bit steps, even if many calls are placed simultaneously. Parties can hang up and call again whenever they like; every call is still put through O(log N) bit steps after being placed. Viewed in a distributed memory machine context, our algorithm allows any processor to access any idle block of memory within O(log N) bit steps, no matter what other connections have been made previously or are being made simultaneously.
Randomized Routing on FatTrees
 Advances in Computing Research
, 1996
"... Fattrees are a class of routing networks for hardwareefficient parallel computation. This paper presents a randomized algorithm for routing messages on a fattree. The quality of the algorithm is measured in terms of the load factor of a set of messages to be routed, which is a lower bound on the ..."
Abstract

Cited by 51 (11 self)
 Add to MetaCart
Fattrees are a class of routing networks for hardwareefficient parallel computation. This paper presents a randomized algorithm for routing messages on a fattree. The quality of the algorithm is measured in terms of the load factor of a set of messages to be routed, which is a lower bound on the time required to deliver the messages. We show that if a set of messages has load factor on a fattree with n processors, the number of delivery cycles (routing attempts) that the algorithm requires is O(+lg n lg lg n) with probability 1 \Gamma O(1=n). The best previous bound was O( lg n) for the offline problem in which the set of messages is known in advance. In the context of a VLSI model that equates hardware cost with physical volume, the routing algorithm can be used to demonstrate that fattrees are universal routing networks. Specifically, we prove that any routing network can be efficiently simulated by a fattree of comparable hardware cost. 1 Introduction Fattrees constitute...
Modeling Parallel Computers as Memory Hierarchies
 In Proc. Programming Models for Massively Parallel Computers
, 1993
"... A parameterized generic model that captures the features of diverse computer architectures would facilitate the development of portable programs. Specific models appropriate to particular computers are obtained by specifying parameters of the generic model. A generic model should be simple, and for ..."
Abstract

Cited by 45 (6 self)
 Add to MetaCart
A parameterized generic model that captures the features of diverse computer architectures would facilitate the development of portable programs. Specific models appropriate to particular computers are obtained by specifying parameters of the generic model. A generic model should be simple, and for each machine that it is intended to represent, it should have a reasonably accurate specific model. The Parallel Memory Hierarchy (PMH) model of computation uses a single mechanism to model the costs of both interprocessor communication and memory hierarchy traffic. A computer is modeled as a tree of memory modules with processors at the leaves. All data movement takes the form of block transfers between children and their parents. This paper assesses the strengths and weaknesses of the PMH model as a generic model. 1 Introduction The raw computing power of multiprocessor computers is exploding. The challenge is to create software that can take advantage of this computing power. The diversit...
On the Fault Tolerance of Some Popular BoundedDegree Networks
 SIAM Journal on Computing
, 1992
"... In this paper, we analyze the ability of several boundeddegree networks that are commonly used for parallel computation to tolerate faults. Among other things, we show that an Nnode butterfly containing N 1\Gammaffl worstcase faults (for any constant ffl ? 0) can emulate a faultfree butterfly ..."
Abstract

Cited by 44 (7 self)
 Add to MetaCart
In this paper, we analyze the ability of several boundeddegree networks that are commonly used for parallel computation to tolerate faults. Among other things, we show that an Nnode butterfly containing N 1\Gammaffl worstcase faults (for any constant ffl ? 0) can emulate a faultfree butterfly of the same size with only constant slowdown. Similar results are proved for the shuffleexchange graph. Hence, these networks become the first connected boundeddegree networks known to be able to sustain more than a constant number of worstcase faults without suffering more than a constantfactor slowdown in performance. We also show that an Nnode butterfly whose nodes fail with some constant probability p can emulate a faultfree version of itself with a slowdown of 2 O(log N) , which is a very slowly increasing function of N . The proofs of these results combine the technique of redundant computation with new algorithms for (packet) routing around faults in hypercubic networks. Tech...