4.5 Article

Cloud reliability and efficiency improvement via failure risk based proactive actions

期刊

JOURNAL OF SYSTEMS AND SOFTWARE
卷 163, 期 -, 页码 -

出版社

ELSEVIER SCIENCE INC
DOI: 10.1016/j.jss.2020.110524

关键词

Cloud computing system; Reliability; Efficiency; Risk identification; Failure mitigation and fault tolerance

资金

  1. National Basic Research Program (China) [2018YFB1003403]
  2. Natural Science Basic Research Plan in Shaanxi Province of China [2018JM6086]
  3. NSF Net-Centric Software and Systems IUCRC (U.S.)
  4. China Scholarship Council

向作者/读者索取更多资源

Due to the huge magnitude and complexity of cloud computing systems (CCS), failures are inevitable, which lead to reliability and efficiency losses. Failure mitigation, fault tolerance, and recovery actions can be performed to improve CCS reliability and efficiency. Using data collected during CCS operation, failure prediction and risk identification techniques could anticipate such failure occurrences. In this paper, we develop a framework to combine risk identification with follow-up proactive actions for CCS reliability and efficiency improvement. We start by analyzing cloud failures and the related operational data. Then a tree based predictive model is trained to diagnose high risk cloud tasks. By proactively terminating these high risk tasks, both the number of CCS failures and the resource consumption could be significantly reduced. The impact of these proactive actions can be simulated to quantify the improvement to both system reliability and efficiency. The new approach has been applied on the Google cluster dataset, covering approximately 400GB of operational data over 29 consecutive days, to demonstrate its viability and effectiveness. (C) 2020 Published by Elsevier Inc.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.5
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据