4.5 Article

Analyzing, modeling and evaluating dynamic adaptive fault tolerance strategies in cloud computing environments

Journal

JOURNAL OF SUPERCOMPUTING
Volume 66, Issue 1, Pages 193-228

Publisher

SPRINGER
DOI: 10.1007/s11227-013-0898-7

Keywords

Fault tolerance; Checkpointing; Data replication; High serviceability; Cloud computing

Funding

  1. National Science Foundation for Distinguished Young Scholars of China [61225012]
  2. National Natural Science Foundation of China [61070162, 71071028, 70931001]
  3. Specialized Research Fund of the Doctoral Program of Higher Education for the Priority Development Areas [20120042130003]
  4. Specialized Research Fund for the Doctoral Program of Higher Education [20100042110025, 20110042110024]
  5. ministry of industry and information technology of the P.R. China
  6. Fundamental Research Funds for the Central Universities [N100604012, N110204003]

Ask authors/readers for more resources

Failures are normal rather than exceptional in cloud computing environments, high fault tolerance issue is one of the major obstacles for opening up a new era of high serviceability cloud computing as fault tolerance plays a key role in ensuring cloud serviceability. Fault tolerant service is an essential part of Service Level Objectives (SLOs) in clouds. To achieve high level of cloud serviceability and to meet high level of cloud SLOs, a foolproof fault tolerance strategy is needed. In this paper, the definitions of fault, error, and failure in a cloud are given, and the principles for high fault tolerance objectives are systematically analyzed by referring to the fault tolerance theories suitable for large-scale distributed computing environments. Based on the principles and semantics of cloud fault tolerance, a dynamic adaptive fault tolerance strategy DAFT is put forward. It includes: (i) analyzing the mathematical relationship between different failure rates and two different fault tolerance strategies, which are checkpointing fault tolerance strategy and data replication fault tolerance strategy; (ii) building a dynamic adaptive checkpointing fault tolerance model and a dynamic adaptive replication fault tolerance model by combining the two fault tolerance models together to maximize the serviceability and meet the SLOs; and (iii) evaluating the dynamic adaptive fault tolerance strategy under various conditions in large-scale cloud data centers and consider different system centric parameters, such as fault tolerance degree, fault tolerance overhead, response time, etc. Theoretical as well as experimental results conclusively demonstrate that the dynamic adaptive fault tolerance strategy DAFT has high potential as it provides efficient fault tolerance enhancements, significant cloud serviceability improvement, and great SLOs satisfaction. It efficiently and effectively achieves a trade-off for fault tolerance objectives in cloud computing environments.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.5
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available