Results 1 -
3 of
3
On Failure Detection Algorithms in Overlay Networks
- IN IEEE INFOCOM
, 2003
"... One of the key reasons overlay networks are seen as an excellent platform for large scale distributed systems is their resilience in the presence of node failures. This resilience rely on accurate and timely detection of node failures. Despite the prevalent use of keep-alive algorithms in overlay ne ..."
Abstract
-
Cited by 19 (0 self)
- Add to MetaCart
One of the key reasons overlay networks are seen as an excellent platform for large scale distributed systems is their resilience in the presence of node failures. This resilience rely on accurate and timely detection of node failures. Despite the prevalent use of keep-alive algorithms in overlay networks to detect node failures, their tradeoffs and the circumstances in which they might best be suited is not well understood. In this paper, we study how the design of various keep-alive approaches affect their performance in node failure detection time, probability of false positive, control overhead, and packet loss rate via analysis, simulation, and implementation. We find that among the class of keep-alive algorithms that share information, the maintenance of backpointer state substantially improves detection time and packet loss rate. The improvement in detection time between baseline and sharing algorithms becomes more pronounced as the size of neighbor set increases. Finally, sharing of information allows a network to tolerate a higher churn rate than baseline.
Reverse Engineering the Internet
- IN ACM HOTNETS-II
, 2003
"... To provide insight into Internet operation and performance, recent efforts have measured various aspects of the Internet, developing and improving measurement tools in the process. In this paper, we argue that these independent advances present the community with a startling opportunity: the collabo ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
To provide insight into Internet operation and performance, recent efforts have measured various aspects of the Internet, developing and improving measurement tools in the process. In this paper, we argue that these independent advances present the community with a startling opportunity: the collaborative reverse-engineering of the Internet. By this, we mean annotating a map of the Internet with properties such as: client populations, features and workloads; network ownership, capacity, connectivity, geography and routing policies; patterns of loss, congestion, failure and growth; and so forth. This combination of properties is greater than the sum of its parts, and exposes the attributes of network design easily overlooked by simpler, uncorrelated models. We argue that reverse engineering the Internet is feasible based on continuing improvements in measurement techniques, the potential to infer new properties from external measurements, and an accounting of the resources required to complete the process.
Exploring tradeoffs in failure detection in routing overlays
, 2003
"... One of the key reasons overlay networks are seen as an excellent platform for large scale distributed systems is their resilience in the presence of node failures. This resilience rely on accurate and timely detection of node failures. Despite the prevalent use of keep-alive algorithms in overlay ne ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
One of the key reasons overlay networks are seen as an excellent platform for large scale distributed systems is their resilience in the presence of node failures. This resilience rely on accurate and timely detection of node failures. Despite the prevalent use of keep-alive algorithms in overlay networks to detect node failures, their tradeoffs and the circumstances in which they might best be suited is not well understood. In this paper, we study how the design of various keep-alive approaches affect their performance in node failure detection time, probability of false positive, control overhead, and packet loss rate via analysis, simulation, and implementation. We find that among the class of keep-alive algorithms that share information, the maintenance of backpointer state substantially improves detection time and packet loss rate. The improvement in detection time between baseline and sharing algorithms becomes more pronounced as the size of neighbor set increases. Finally, sharing of information allows a network to tolerate a higher churn rate than the baseline algorithm. 1

