☆ 4.5 Article

Cloud reliability and efficiency improvement via failure risk based proactive actions

JOURNAL OF SYSTEMS AND SOFTWARE (2020)

期刊

JOURNAL OF SYSTEMS AND SOFTWARE

卷 163, 期 -, 页码 -

出版社

ELSEVIER SCIENCE INC

DOI: 10.1016/j.jss.2020.110524

关键词

Cloud computing system; Reliability; Efficiency; Risk identification; Failure mitigation and fault tolerance

类别

Computer Science, Software Engineering Computer Science, Theory & Methods

资金

National Basic Research Program (China) [2018YFB1003403]
Natural Science Basic Research Plan in Shaanxi Province of China [2018JM6086]
NSF Net-Centric Software and Systems IUCRC (U.S.)
China Scholarship Council

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

摘要

Due to the huge magnitude and complexity of cloud computing systems (CCS), failures are inevitable, which lead to reliability and efficiency losses. Failure mitigation, fault tolerance, and recovery actions can be performed to improve CCS reliability and efficiency. Using data collected during CCS operation, failure prediction and risk identification techniques could anticipate such failure occurrences. In this paper, we develop a framework to combine risk identification with follow-up proactive actions for CCS reliability and efficiency improvement. We start by analyzing cloud failures and the related operational data. Then a tree based predictive model is trained to diagnose high risk cloud tasks. By proactively terminating these high risk tasks, both the number of CCS failures and the resource consumption could be significantly reduced. The impact of these proactive actions can be simulated to quantify the improvement to both system reliability and efficiency. The new approach has been applied on the Google cluster dataset, covering approximately 400GB of operational data over 29 consecutive days, to demonstrate its viability and effectiveness. (C) 2020 Published by Elsevier Inc.

Cloud reliability and efficiency improvement via failure risk based proactive actions

期刊

JOURNAL OF SYSTEMS AND SOFTWARE

出版社

ELSEVIER SCIENCE INC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Cloud reliability and efficiency improvement via failure risk based proactive actions

期刊

JOURNAL OF SYSTEMS AND SOFTWARE

出版社

ELSEVIER SCIENCE INC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文