Results 1 -
2 of
2
Tolerating network failures in system area networks
- In Proc. of the 2002 International Conference on Parallel Processing (ICPP02
, 2002
"... In this paper, we investigate how system area networks can deal with transient and permanent network failures. We design and implement a firmware–level retransmission scheme to tolerate transient failures and an on–demand network mapping scheme to deal with permanent failures. Both schemes are trans ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
In this paper, we investigate how system area networks can deal with transient and permanent network failures. We design and implement a firmware–level retransmission scheme to tolerate transient failures and an on–demand network mapping scheme to deal with permanent failures. Both schemes are transparent to applications and are conceptually simple and suitable for low–level implementations, e.g. in firmware. We then examine how the retransmission scheme affects system performance and how various protocol parameters impact system behavior. We analyze and evaluate system performance by using a real implementation on a state–of–the art cluster and both micro– benchmarks and real applications from the SPLASH-2 suite. 1.
DBAR: An Efficient Routing Algorithm to Support Multiple Concurrent Applications in Networks-on-Chip
"... With the emergence of many-core architectures, it is quite likely that multiple applications will run concurrently on a system. Existing locally and globally adaptive routing algorithms largely overlook issues associated with workload consolidation. The shortsightedness of locally adaptive routing a ..."
Abstract
- Add to MetaCart
With the emergence of many-core architectures, it is quite likely that multiple applications will run concurrently on a system. Existing locally and globally adaptive routing algorithms largely overlook issues associated with workload consolidation. The shortsightedness of locally adaptive routing algorithms limits performance due to poor network congestion avoidance. Globally adaptive routing algorithms attack this issue by introducing a congestion propagation network to obtain network status information beyond neighboring nodes. However, they may suffer from intra- and inter-application interference during output port selection for consolidated workloads, coupling the behavior of otherwise independent applications and negatively affecting performance. To address these two issues, we propose Destination-Based Adaptive Routing (DBAR). We design a novel low-cost congestion propagation network that leverages both local and non-local network information for more accurate congestion estimates. Thus, DBAR offers effective adaptivity for congestion beyond neighboring nodes. More importantly, by integrating the destination into the selection function, DBAR mitigates intra- and inter-application interference and offers dynamic isolation among regions. Experimental results show that DBAR can offer better performance than the best baseline algorithm for all measured configurations; it is well suited for workload consolidation. The wiring overhead of DBAR is low and DBAR provides improvement in the energydelay product for medium and high injection rates.

