Results 1 - 10
of
10
Fault Tolerant Scheduling of Precedence Task Graphs on Heterogeneous Platforms
, 2007
"... Fault tolerance and latency are important requirements in several applications which are time critical in nature: such applications require guaranties in terms of latency, even when processors are subject to failures. In this paper, we propose a fault tolerant scheduling heuristic for mapping preced ..."
Abstract
-
Cited by 9 (4 self)
- Add to MetaCart
Fault tolerance and latency are important requirements in several applications which are time critical in nature: such applications require guaranties in terms of latency, even when processors are subject to failures. In this paper, we propose a fault tolerant scheduling heuristic for mapping precedence task graphs on heterogeneous systems. Our approach is based on an active replication scheme, capable of supporting ε arbitrary fail-silent (fail-stop) processor failures, hence valid results will be provided even if ε processors fail. We focus on a bi-criteria approach, where we aim at minimizing the latency given a fixed number of failures supported in the system, or the other way round. Major achievements include a low complexity, and a drastic reduction of the number of additional communications induced by the replication mechanism. Experimental results demonstrate that our heuristics, despite their lower complexity, outperform their direct competitor, the FTBAR scheduling algorithm [8].
Fault TOLERANCE IN GRID COMPUTING: STATE OF THE ART AND OPEN ISSUES
"... Fault tolerance is an important property for large scale computational grid systems, where geographically distributed nodes co-operate to execute a task. In order to achieve high level of reliability and availability, the grid infrastructure should be a foolproof fault tolerant. Since the failure of ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
(Show Context)
Fault tolerance is an important property for large scale computational grid systems, where geographically distributed nodes co-operate to execute a task. In order to achieve high level of reliability and availability, the grid infrastructure should be a foolproof fault tolerant. Since the failure of resources affects job execution fatally, fault tolerance service is essential to satisfy QOS requirement in grid computing. Commonly utilized techniques for providing fault tolerance are job checkpointing and replication. Both techniques mitigate the amount of work lost due to changing system availability but can introduce significant runtime overhead. The latter largely depends on the length of checkpointing interval and the chosen number of replicas, respectively. In case of complex scientific workflows where tasks can execute in well defined order reliability is another biggest challenge because of the unreliable nature of the grid resources.
QoS-aware fault-tolerant scheduling for real-time tasks on heterogeneous clusters
- IEEE Trans. Comput
, 2011
"... Abstract—Fault-tolerant scheduling plays a significant role in improving system reliability of clusters. Although extensive fault-tolerant scheduling algorithms have been proposed for real-time tasks in parallel and distributed systems, quality of service (QoS) requirements of tasks have not been ta ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
(Show Context)
Abstract—Fault-tolerant scheduling plays a significant role in improving system reliability of clusters. Although extensive fault-tolerant scheduling algorithms have been proposed for real-time tasks in parallel and distributed systems, quality of service (QoS) requirements of tasks have not been taken into account. This paper presents a fault-tolerant scheduling algorithm called QAFT that can tolerate one node’s permanent failures at one time instant for real-time tasks with QoS needs on heterogeneous clusters. In order to improve system flexibility, reliability, schedulability, and resource utilization, QAFT strives to either advance the start time of primary copies and delay the start time of backup copies in order to help backup copies adopt the passive execution scheme, or to decrease the simultaneous execution time of the primary and backup copies of a task as much as possible to improve resource utilization. QAFT is capable of adaptively adjusting the QoS levels of tasks and the execution schemes of backup copies to attain high system flexibility. Furthermore, we employ the overlapping technology of backup copies. The latest start time of backup copies and their constraints are analyzed and discussed. We conduct extensive experiments to compare our QAFT with two existing schemes—NOQAFT and DYFARS. Experimental results show that QAFT significantly improves the scheduling quality of NOQAFT and DYFARS. Index Terms—Heterogeneous clusters, real-time, scheduling, fault tolerance, quality of service (QoS), heuristic. Ç 1
Fault-tolerant partitioning scheduling algorithms in real-time multiprocessor systems
- Pacific Rim International Symposium on Dependable Computing, IEEE
"... ..."
(Show Context)
Realistic Models and Efficient Algorithms for Fault Tolerant Scheduling on Heterogeneous Platforms
, 2008
"... Most list scheduling heuristics rely on a simple platform model where communication contention is not taken into account. In addition, it is generally assumed that processors in the systems are completely safe. To schedule precedence graphs in a more realistic framework, we introduce an efficient fa ..."
Abstract
-
Cited by 4 (4 self)
- Add to MetaCart
(Show Context)
Most list scheduling heuristics rely on a simple platform model where communication contention is not taken into account. In addition, it is generally assumed that processors in the systems are completely safe. To schedule precedence graphs in a more realistic framework, we introduce an efficient fault tolerant scheduling algorithm that is both contentionaware and capable of supporting ε arbitrary fail-silent (fail-stop) processor failures. We focus on a bi-criteria approach, where we aim at minimizing the total execution time, or latency, given a fixed number of failures supported in the system. Our algorithm has a low time complexity, and drastically reduces the number of additional communications induced by the replication mechanism. Experimental results fully demonstrate the usefulness of the proposed algorithm, which leads to efficient execution schemes while guaranteeing a prescribed level of fault tolerance.
Fault-tolerant earliest-deadlinefirst scheduling algorithm
- In IEEE International Parallel and Distributed Processing Symposium
, 2007
"... The general approach to fault tolerance in uniprocessor systems is to maintain enough time redundancy in the schedule so that any task instance can be re-executed in presence of faults during the execution. In this paper a scheme is presented to add enough and efficient time redundancy to the Earlie ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
(Show Context)
The general approach to fault tolerance in uniprocessor systems is to maintain enough time redundancy in the schedule so that any task instance can be re-executed in presence of faults during the execution. In this paper a scheme is presented to add enough and efficient time redundancy to the Earliest-Deadline-First (EDF) scheduling policy for periodic real-time tasks. This scheme can be used to tolerate transient faults during the execution of tasks. We describe a recovery scheme which can be used to re-execute tasks in the event of transient faults and discuss conditions that must be met by any such recovery scheme. For performance evaluation of this idea a tool is developed.
Multi-Criteria Scheduling of Precedence Task Graphs on Heterogeneous Platforms
- The Computer J
, 2010
"... Latency, fault tolerance and reliability are important requirements for several applications that are time critical in nature: such applications require guarantees in terms of latency, even when processors are subject to failures. In this paper, we propose a fault-tolerant scheduling heuristic for m ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Latency, fault tolerance and reliability are important requirements for several applications that are time critical in nature: such applications require guarantees in terms of latency, even when processors are subject to failures. In this paper, we propose a fault-tolerant scheduling heuristic for mapping precedence task graphs on heterogeneous systems. Our approach is based on an active replication scheme, capable of supporting ε arbitrary fail-silent/fail-stop processor failures, and hence valid results will be provided even if ε processors fail. First we focus on a bi-criteria approach, where we aim at minimizing the latency given a fixed number of failures supported in the system, or the other way round. Next we derive a more complex algorithm in which we not only minimize latency and support a fixed number of failures, but also improve the overall reliability. Major achievements include low complexity of the new algorithms, and a drastic reduction of the number of additional communications induced by the replication mechanism. Experimental results demonstrate that our heuristics, despite their lower complexity, outperform their direct competitor, the fault-tolerance based active replication scheduling algorithm FTBAR.
Survivability in Wireless Networks:
"... Abstract: A link scheduling model is presented that utilizes primary-backup scheduling for packet scheduling. The advantage of this scheduling paradigm is that overhead can be suppressed in the fault-free case and overhead only needs to be endured in case of actual faults. The scheduling paradigm si ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract: A link scheduling model is presented that utilizes primary-backup scheduling for packet scheduling. The advantage of this scheduling paradigm is that overhead can be suppressed in the fault-free case and overhead only needs to be endured in case of actual faults. The scheduling paradigm significantly increases survivability and can be used to reduce overhead of redundancybased approaches. The foundation for using primarybackup scheduling in networks is derived. The schemes presented are very effective for multi-path protocols and MIMO and can be applied where watchdog-based algorithms fail or where geographic-centric disruptions render local approaches useless. 1
Abstract Fault-Models in Wireless Communication: Towards Survivable Wireless Networks ∗
"... This research introduces a new approach to modeling wireless network reliability under diverse fault assumptions. It allows for quantifying reliability and offers potential for modeling survivability. The general model is presented as a join graph of cliques, that allows for horizontal and orthogona ..."
Abstract
- Add to MetaCart
This research introduces a new approach to modeling wireless network reliability under diverse fault assumptions. It allows for quantifying reliability and offers potential for modeling survivability. The general model is presented as a join graph of cliques, that allows for horizontal and orthogonal cross-monitoring. This allows for the determination of the maximal potential fault tolerance. The two-dimensional cross-monitoring approach is related to recent research addressing omission faults [5]. Finally an example of its use is given in which we consider benign and omission faults and utilize primary-backup scheduling, specifically backup-backup link scheduling, as fault tolerant mechanisms. 1
Fault Tolerant Scheduling of Precedence Task
, 2007
"... HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte p ..."
Abstract
- Add to MetaCart
(Show Context)
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et a ̀ la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés. appor t de r ech er ch e