Results 1  10
of
12
On the Fault Tolerance of Some Popular BoundedDegree Networks
 SIAM Journal on Computing
, 1992
"... In this paper, we analyze the ability of several boundeddegree networks that are commonly used for parallel computation to tolerate faults. Among other things, we show that an Nnode butterfly containing N 1\Gammaffl worstcase faults (for any constant ffl ? 0) can emulate a faultfree butterfly ..."
Abstract

Cited by 44 (7 self)
 Add to MetaCart
In this paper, we analyze the ability of several boundeddegree networks that are commonly used for parallel computation to tolerate faults. Among other things, we show that an Nnode butterfly containing N 1\Gammaffl worstcase faults (for any constant ffl ? 0) can emulate a faultfree butterfly of the same size with only constant slowdown. Similar results are proved for the shuffleexchange graph. Hence, these networks become the first connected boundeddegree networks known to be able to sustain more than a constant number of worstcase faults without suffering more than a constantfactor slowdown in performance. We also show that an Nnode butterfly whose nodes fail with some constant probability p can emulate a faultfree version of itself with a slowdown of 2 O(log N) , which is a very slowly increasing function of N . The proofs of these results combine the technique of redundant computation with new algorithms for (packet) routing around faults in hypercubic networks. Tech...
The architecture and performance of security protocols in the ensemble group communication system
 ACM Transactions on Information and System Security
, 2001
"... Ensemble is a Group Communication System built at Cornell and the Hebrew universities. It allows processes to create process groups within which scalable reliable fifoordered multicast and pointtopoint communication are supported. The system also supports other communication properties, such as c ..."
Abstract

Cited by 33 (1 self)
 Add to MetaCart
Ensemble is a Group Communication System built at Cornell and the Hebrew universities. It allows processes to create process groups within which scalable reliable fifoordered multicast and pointtopoint communication are supported. The system also supports other communication properties, such as causal and total multicast ordering, flow control, etc. This paper describes the security protocols and infrastructure of Ensemble. Applications using Ensemble with the extensions described here benefit from strong security properties. Under the assumption that trusted processes will not be corrupted, all communication is secured from tampering by outsiders. Our work extends previous work performed in the Horus system (Ensemble’s predecessor) by adding support for multiple partitions, efficient rekeying, and application defined security policies. Unlike Horus, which used its own security infrastructure with nonstandard key distribution and timing services, Ensemble’s security mechanism is based on offthe shelf authentication systems, such as PGP and Kerberos. We extend previous results on group rekeying, with a novel protocol that makes use of diamondlike data structures. Our Diamond protocol allows the removal of untrusted members within milliseconds.
Fault Tolerant Networks With Small Degree
 In Proceedings of the 12th ACM Symposium on Parallel Algorithms and Architectures (SPAA
, 2000
"... In this paper, we study the design of fault tolerant networks for arrays and meshes by adding redundant nodes and edges. For a target graph G (linear array or mesh in this paper), a graph G # is called a kfaulttolerant graph of G if when we remove any k nodes from G # , it still contains a subg ..."
Abstract

Cited by 13 (0 self)
 Add to MetaCart
In this paper, we study the design of fault tolerant networks for arrays and meshes by adding redundant nodes and edges. For a target graph G (linear array or mesh in this paper), a graph G # is called a kfaulttolerant graph of G if when we remove any k nodes from G # , it still contains a subgraph isomorphic to G. The major quality measures for a faulttolerant graph are the number of spare nodes it uses and the maximum degree it has. The degree is particularly important in practice as it poses constraints on the scalability of the system. In this paper, we aim at designing faulttolerant graphs with both small degree and small number of spare nodes. The graphs we obtain have degree O(1) for arrays and O(log 3 k) for meshes. The number of spare nodes used are O(k log 2 k) and O(k 2 / log k), respectively. Compared to the previous results, the number of spare nodes used in our construction has one fewer linear factor in k. 1 1 Introduction In many parallel computer ...
Immunet: A Cheap and Robust FaultTolerant Packet Routing Mechanism
 31th Annual International Symposium on Computer Architecture
, 2004
"... Abstract 1 A new and efficient mechanism to tolerate failures in interconnection networks for parallel and distributed computers, denoted as Immunet, is presented in this work. In the presence of failures, Immunet automatically reacts with a hardware reconfiguration of the surviving network resource ..."
Abstract

Cited by 13 (4 self)
 Add to MetaCart
Abstract 1 A new and efficient mechanism to tolerate failures in interconnection networks for parallel and distributed computers, denoted as Immunet, is presented in this work. In the presence of failures, Immunet automatically reacts with a hardware reconfiguration of the surviving network resources. Immunet has four important advantages over previous faulttolerant switching mechanisms. Its low hardware costs minimize the overhead that the network must support in absence of faults. As long as the network remains connected, Immunet can tolerate any number of failures regardless of their spatial and temporal combinations. The resulting communication infrastructure provides optimized adaptive minimal routing over the surviving topology. The system behavior under successive failures exhibits graceful performance degradation. Immunet reconfiguration can be totally transparent to the applications running on the parallel system as they will only be affected by the loss of those data packets circulating through the broken components. The rest of the packets will suffer only a tolerable delay induced by the time employed to perform the automatic network reconfiguration. Descriptions of the hardware network architecture and detailed synthetic and executiondriven simulations will demonstrate the benefits of Immunet. 1.
Reconfiguring Arrays with Faults Part I: WorstCase Faults
 SIAM Journal on Computing
, 1997
"... . In this paper we study the ability of arraybased networks to tolerate worstcase faults. We show that an N \Theta N twodimensional array can sustain N 1\Gammaffl worstcase faults, for any fixed ffl ? 0, and still emulate T steps of a fully functioning N \Theta N array in O(T +N) steps, i.e., ..."
Abstract

Cited by 11 (1 self)
 Add to MetaCart
. In this paper we study the ability of arraybased networks to tolerate worstcase faults. We show that an N \Theta N twodimensional array can sustain N 1\Gammaffl worstcase faults, for any fixed ffl ? 0, and still emulate T steps of a fully functioning N \Theta N array in O(T +N) steps, i.e., with only constant slowdown. Previously it was known only that an array could tolerate a constant number of faults with constant slowdown. We also show that if faulty nodes are allowed to communicate, but not compute, then an Nnode onedimensional array can tolerate log k N worstcase faults, for any constant k ? 0, and still emulate a faultfree array with constant slowdown, and this bound is tight. Key words. fault tolerance, arraybased network, mesh network, network emulation AMS subject classifications. 68M07, 68M10, 68M15, 68Q68 1. Introduction. In a truly large parallel computer, some components are bound to fail. Knowing this, a programmer can write software that explicitly cope...
Nodecovering, Errorcorrecting Codes and Multiprocessors with Very High Average Fault Tolerance
 IEEE TRANSACTIONS ON COMPUTERS
, 1997
"... Structural fault tolerance (SFT) is the ability of a multiprocessor to reconfigure around faulty processors or links in order to preserve its original processor interconnection structure. In this paper, we focus on the design of SFT multiprocessors that have low switch and link overheads, but can t ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
Structural fault tolerance (SFT) is the ability of a multiprocessor to reconfigure around faulty processors or links in order to preserve its original processor interconnection structure. In this paper, we focus on the design of SFT multiprocessors that have low switch and link overheads, but can tolerate a very large number of processor faults on the average. Most previous work has concentrated on deterministic kfaulttolerant (kFT) designs in which exactly k spare processors and some spare switches and links are added to construct multiprocessors that can tolerate any k processor faults. However, after k faults are reconfigured around, much of the extra links and switches can remain unutilized. It is possible within the basic nodecovering framework, which was introduced by Dutt and Hayes as an efficient kFT design method, to design FT multiprocessors that have the same amount of switches and links as, say, a twoFT deterministic design, but have s spare processors, where s @ 2, so that, on the average, k = Q(s) (k £ s) processor failures can be reconfigured around. Such designs utilize the spare link and switch capacity very efficiently, and are called probabilistic FT designs. An elegant and powerful method to construct covering graphs or CG’s, which are key to obtaining the probabilistic FT designs, is to use linear errorcorrecting codes (ECCs). We show how to construct probabilistic designs with very high average fault tolerance but low wiring and switch overhead using ECCs like the 2Dparity, fulltwo, 3Dparity, and fullthree codes. This design methodology is applicable to any multiprocessor interconnection topology and the resulting FT designs have the same node degree as the nonFT target topology. We also analyze the deterministic fault tolerance for these designs and develop efficient layout strategies for them. Finally, we compare the proposed probabilistic designs to some of the best
Enhanced Cluster kAry nCube, A FaultTolerant Multiprocessor
 IEEE Transactions on Computers
, 2003
"... Abstract—In this paper, we present a strongly faulttolerant design for the kary ncube multiprocessor and examine its reconfigurability. Our design augments the kary ncube with ðk jÞn spare nodes. Each set of jn regular nodes is connected to a spare node and the spare nodes are interconnected as ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
Abstract—In this paper, we present a strongly faulttolerant design for the kary ncube multiprocessor and examine its reconfigurability. Our design augments the kary ncube with ðk jÞn spare nodes. Each set of jn regular nodes is connected to a spare node and the spare nodes are interconnected as either a ðk k k jÞary ncube if j 6 2 or a hypercube of dimension n if j 2. Our approach utilizes the capabilities of the waveswitching communication modules of the spare nodes to tolerate a large number of faulty nodes. Both theoretical and experimental results are examined. Compared with other proposed schemes, our approach can tolerate significantly more faulty nodes with a low overhead and no performance degradation.
Immunet: Dependable Routing for Interconnection Networks with Arbitrary Topology
 IEEE TRANSACTIONS ON COMPUTER, TC2007070304 1
, 2007
"... A complete mechanism for tolerating multiple failures in parallel computer systems, denoted as Immunet, is described in this paper. Immunet can be applied to arbitrary topologies, either regular or irregular, exhibiting in both cases graceful performance degradation. Provided that the network remai ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
A complete mechanism for tolerating multiple failures in parallel computer systems, denoted as Immunet, is described in this paper. Immunet can be applied to arbitrary topologies, either regular or irregular, exhibiting in both cases graceful performance degradation. Provided that the network remains connected, Immunet is able to deal with any number of failures regardless of their spatial and temporal distribution. Our mechanism operates on the basis of a dynamic network reconfiguration in response to failures. The network reconfiguration only employs local information recorded at the router nodes which leads to a highly scalable system. In addition, its low cost and overhead permit a practicable hardware implementation. Finally, as Immunet does not require inflight traffic to be discarded, the parallel applications running in the system can transparently circumvent network failures. Only packets stored in or traveling through a broken component need to be recovered by higher system levels.
Networks with Small Stretch Number
, 1928
"... In a previous work, the authors introduced the class of graphs with bounded induced distance of order k, (BID(k) for short) to model nonreliable interconnection networks. A network modeled as a graph in BID(k) can be characterized as follows: if some nodes have failed, as long as two nodes remain c ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
In a previous work, the authors introduced the class of graphs with bounded induced distance of order k, (BID(k) for short) to model nonreliable interconnection networks. A network modeled as a graph in BID(k) can be characterized as follows: if some nodes have failed, as long as two nodes remain connected, the distance between these nodes in the faulty graph is at most k times the distance in the nonfaulty graph. The smallest k such that G 2 BID(k) is called stretch number of G. We show an odd characteristic of the stretch numbers: every rational number greater or equal 2 is a stretch number, but only discrete values are admissible for smaller stretch numbers. Moreover, we give a new characterization of classes BID(2 i ), i 1, based on forbidden induced subgraphs. By using this characterization, we provide a polynomial time recognition algorithm for graphs belonging to these classes, while the general recognition problem is CoNPcomplete.
Explicit Constructions of Fault Tolerant Open Linear Arrays
, 2005
"... Two graph models for a Kfault tolerant linear array with external inputs and outputs are proposed: one is with minimum number of spare nodes and incurs low spare resource overhead, and the other is with small internal degree and incurs small runtime reconfiguration hardware overhead. Both graph mo ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Two graph models for a Kfault tolerant linear array with external inputs and outputs are proposed: one is with minimum number of spare nodes and incurs low spare resource overhead, and the other is with small internal degree and incurs small runtime reconfiguration hardware overhead. Both graph models can be applied to fault tolerant VLSI designs while maintaining low hardware cost.