Results 1 - 10
of
142
Programming Parallel Algorithms
, 1996
"... In the past 20 years there has been treftlendous progress in developing and analyzing parallel algorithftls. Researchers have developed efficient parallel algorithms to solve most problems for which efficient sequential solutions are known. Although some ofthese algorithms are efficient only in a th ..."
Abstract
-
Cited by 163 (7 self)
- Add to MetaCart
In the past 20 years there has been treftlendous progress in developing and analyzing parallel algorithftls. Researchers have developed efficient parallel algorithms to solve most problems for which efficient sequential solutions are known. Although some ofthese algorithms are efficient only in a theoretical framework, many are quite efficient in practice or have key ideas that have been used in efficient implementations. This research on parallel algorithms has not only improved our general understanding ofparallelism but in several cases has led to improvements in sequential algorithms. Unf:ortunately there has been less success in developing good languages f:or prograftlftling parallel algorithftls, particularly languages that are well suited for teaching and prototyping algorithms. There has been a large gap between languages
A Scalable, Commodity Data Center Network Architecture
, 2008
"... Today’s data centers may contain tens of thousands of computers with significant aggregate bandwidth requirements. The network architecture typically consists of a tree of routing and switching elements with progressively more specialized and expensive equipment moving up the network hierarchy. Unfo ..."
Abstract
-
Cited by 91 (9 self)
- Add to MetaCart
Today’s data centers may contain tens of thousands of computers with significant aggregate bandwidth requirements. The network architecture typically consists of a tree of routing and switching elements with progressively more specialized and expensive equipment moving up the network hierarchy. Unfortunately, even when deploying the highest-end IP switches/routers, resulting topologies may only support 50 % of the aggregate bandwidth available at the edge of the network, while still incurring tremendous cost. Nonuniform bandwidth among data center nodes complicates application design and limits overall system performance. In this paper, we show how to leverage largely commodity Ethernet switches to support the full aggregate bandwidth of clusters consisting of tens of thousands of elements. Similar to how clusters of commodity computers have largely replaced more specialized SMPs and MPPs, we argue that appropriately architected and interconnected commodity switches may deliver more performance at less cost than available from today’s higher-end solutions. Our approach requires no modifications to the end host network interface, operating system, or applications; critically, it is fully backward compatible with Ethernet, IP, and TCP.
Randomized routing and sorting on fixed-connection networks
- Journal of Algorithms
, 1994
"... This paper presents a general paradigm for the design of packet routing algorithms for fixed-connection networks. Its basis is a randomized on-line algorithm for scheduling any set of N packets whose paths have congestion c on any bounded-degree leveled network with depth L in O(c + L + log N) steps ..."
Abstract
-
Cited by 84 (13 self)
- Add to MetaCart
This paper presents a general paradigm for the design of packet routing algorithms for fixed-connection networks. Its basis is a randomized on-line algorithm for scheduling any set of N packets whose paths have congestion c on any bounded-degree leveled network with depth L in O(c + L + log N) steps, using constant-size queues. In this paradigm, the design of a routing algorithm is broken into three parts: (1) showing that the underlying network can emulate a leveled network, (2) designing a path selection strategy for the leveled network, and (3) applying the scheduling algorithm. This strategy yields randomized algorithms for routing and sorting in time proportional to the diameter for meshes, butterflies, shuffle-exchange graphs, multidimensional arrays, and hypercubes. It also leads to the construction of an area-universal network: an N-node network with area Θ(N) that can simulate any other network of area O(N) with slowdown O(log N).
Special Purpose Parallel Computing
- Lectures on Parallel Computation
, 1993
"... A vast amount of work has been done in recent years on the design, analysis, implementation and verification of special purpose parallel computing systems. This paper presents a survey of various aspects of this work. A long, but by no means complete, bibliography is given. 1. Introduction Turing ..."
Abstract
-
Cited by 77 (5 self)
- Add to MetaCart
A vast amount of work has been done in recent years on the design, analysis, implementation and verification of special purpose parallel computing systems. This paper presents a survey of various aspects of this work. A long, but by no means complete, bibliography is given. 1. Introduction Turing [365] demonstrated that, in principle, a single general purpose sequential machine could be designed which would be capable of efficiently performing any computation which could be performed by a special purpose sequential machine. The importance of this universality result for subsequent practical developments in computing cannot be overstated. It showed that, for a given computational problem, the additional efficiency advantages which could be gained by designing a special purpose sequential machine for that problem would not be great. Around 1944, von Neumann produced a proposal [66, 389] for a general purpose storedprogram sequential computer which captured the fundamental principles of...
The Network Architecture of the Connection Machine CM-5
- Journal of Parallel and Distributed Computing
, 1992
"... The Connection Machine Model CM-5 Supercomputer is a massively parallel computer system designed to offer performance in the range of 1 teraflops (10 12 floating-point operations per second). The CM-5 obtains its high performance while offering ease of programming, flexibility, and reliability. Th ..."
Abstract
-
Cited by 75 (2 self)
- Add to MetaCart
The Connection Machine Model CM-5 Supercomputer is a massively parallel computer system designed to offer performance in the range of 1 teraflops (10 12 floating-point operations per second). The CM-5 obtains its high performance while offering ease of programming, flexibility, and reliability. The machine contains three communication networks: a data network, a control network, and a diagnostic network. This paper describes the organization of these three networks and how they contribute to the design goals of the CM-5. 1 Introduction In the design of a parallel computer, the engineering principle of economy of mechanism suggests that the machine should employ only a single communication network to convey information among the processors in the system. Indeed, many parallel computers contain only a single network: typically, a hypercube or a mesh. The Connection Machine Model CM-5 Supercomputer has three networks, however, and none is a hypercube or a mesh. This paper describes the...
On-line algorithms for path selection in a nonblocking network
- SIAM Journal on Computing
, 1996
"... This paper presents the first optimal-time algorithms for path selection in an optimal-size nonblocking network. In particular, we describe an N-input, N-output, nonblocking network with O(N log N) bounded-degree nodes, and an algorithm that can satisfy any request for a connection or disconnection ..."
Abstract
-
Cited by 62 (14 self)
- Add to MetaCart
This paper presents the first optimal-time algorithms for path selection in an optimal-size nonblocking network. In particular, we describe an N-input, N-output, nonblocking network with O(N log N) bounded-degree nodes, and an algorithm that can satisfy any request for a connection or disconnection between an input and an output in O(log N) bit steps, even if many requests are made at once. Viewed in a telephone switching context, the algorithm can put through any set of calls among N parties in O(log N) bit steps, even if many calls are placed simultaneously. Parties can hang up and call again whenever they like; every call is still put through O(log N) bit steps after being placed. Viewed in a distributed memory machine context, our algorithm allows any processor to access any idle block of memory within O(log N) bit steps, no matter what other connections have been made previously or are being made simultaneously.
BCube: A High Performance, Server-centric Network Architecture for Modular Data Centers
- In SIGCOMM
, 2009
"... This paper presents BCube, a new network architecture specifically designed for shipping-container based, modular data centers. At the core of the BCube architecture is its server-centric network structure, where servers with multiple network ports connect to multiple layers of COTS (commodity off-t ..."
Abstract
-
Cited by 54 (10 self)
- Add to MetaCart
This paper presents BCube, a new network architecture specifically designed for shipping-container based, modular data centers. At the core of the BCube architecture is its server-centric network structure, where servers with multiple network ports connect to multiple layers of COTS (commodity off-the-shelf) mini-switches. Servers act as not only end hosts, but also relay nodes for each other. BCube supports various bandwidth-intensive applications by speedingup one-to-one, one-to-several, and one-to-all traffic patterns, and by providing high network capacity for all-to-all traffic. BCube exhibits graceful performance degradation as the server and/or switch failure rate increases. This property is of special importance for shipping-container data centers, since once the container is sealed and operational, it becomes very difficult to repair or replace its components. Our implementation experiences show that BCube can be seamlessly integrated with the TCP/IP protocol stack and BCube packet forwarding can be efficiently implemented in both hardware and software. Experiments in our testbed demonstrate that BCube is fault tolerant and load balancing and it significantly accelerates representative bandwidthintensive applications.
A polynomial-time tree decomposition to minimize congestion
- in Proceedings of the 15th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA
, 2003
"... ABSTRACT R"acke recently gave a remarkable proof showing that any undirected multicommodity flow problem can be routed in an oblivious fashion with congestion that is within a factor of O(log 3 n) of the best off-line solution to the problem. He also presented interesting applications of this r ..."
Abstract
-
Cited by 47 (0 self)
- Add to MetaCart
ABSTRACT R"acke recently gave a remarkable proof showing that any undirected multicommodity flow problem can be routed in an oblivious fashion with congestion that is within a factor of O(log 3 n) of the best off-line solution to the problem. He also presented interesting applications of this result to distributed computing. Maggs, Miller, Parekh, Ravi and Wu have shown that such a decomposition also has an application to speeding up iterative solvers of linear systems. R"acke's construction finds a decomposition tree of the underlying graph, along with a method to obliviously route in a hierarchical fashion on the tree. The construction, however, uses exponential-time procedures to build the decomposition. The non-constructive nature of his result was remedied, in part, by Azar, Cohen, Fiat, Kaplan, and R"acke, who gave a polynomial time method for building an oblivious routing strategy. Their construction was not based on finding a hierarchical decomposition, and this precludes its application to iterative methods for solving linear systems. In this paper, we show how to compute a hierarchical decomposition and a corresponding oblivious routing strategy in polynomial time. In addition, our decomposition gives an improved competitive ratio for congestion of O(log 2 n log log n). In an independent result in this conference, Bienkowski, Korzeniowski, and R"acke give a polynomial-time method for constructing a decomposition tree with competitive ratio O(log 4 n). We note that our original submission used essentially the same algorithm, and we appreciate them allowing us to present this improved version.
Randomized Routing on Fat-Trees
- Advances in Computing Research
, 1996
"... Fat-trees are a class of routing networks for hardware-efficient parallel computation. This paper presents a randomized algorithm for routing messages on a fat-tree. The quality of the algorithm is measured in terms of the load factor of a set of messages to be routed, which is a lower bound on the ..."
Abstract
-
Cited by 47 (10 self)
- Add to MetaCart
Fat-trees are a class of routing networks for hardware-efficient parallel computation. This paper presents a randomized algorithm for routing messages on a fat-tree. The quality of the algorithm is measured in terms of the load factor of a set of messages to be routed, which is a lower bound on the time required to deliver the messages. We show that if a set of messages has load factor on a fat-tree with n processors, the number of delivery cycles (routing attempts) that the algorithm requires is O(+lg n lg lg n) with probability 1 \Gamma O(1=n). The best previous bound was O( lg n) for the off-line problem in which the set of messages is known in advance. In the context of a VLSI model that equates hardware cost with physical volume, the routing algorithm can be used to demonstrate that fat-trees are universal routing networks. Specifically, we prove that any routing network can be efficiently simulated by a fat-tree of comparable hardware cost. 1 Introduction Fat-trees constitute...
On the Fault Tolerance of Some Popular Bounded-Degree Networks
- SIAM Journal on Computing
, 1992
"... In this paper, we analyze the ability of several bounded-degree networks that are commonly used for parallel computation to tolerate faults. Among other things, we show that an N-node butterfly containing N 1\Gammaffl worst-case faults (for any constant ffl ? 0) can emulate a fault-free butterfly ..."
Abstract
-
Cited by 43 (6 self)
- Add to MetaCart
In this paper, we analyze the ability of several bounded-degree networks that are commonly used for parallel computation to tolerate faults. Among other things, we show that an N-node butterfly containing N 1\Gammaffl worst-case faults (for any constant ffl ? 0) can emulate a fault-free butterfly of the same size with only constant slowdown. Similar results are proved for the shuffleexchange graph. Hence, these networks become the first connected boundeddegree networks known to be able to sustain more than a constant number of worst-case faults without suffering more than a constant-factor slowdown in performance. We also show that an N-node butterfly whose nodes fail with some constant probability p can emulate a fault-free version of itself with a slowdown of 2 O(log N) , which is a very slowly increasing function of N . The proofs of these results combine the technique of redundant computation with new algorithms for (packet) routing around faults in hypercubic networks. Tech...

