☆ 4.7 Article

Task Failure Prediction in Cloud Data Centers Using Deep Learning

IEEE TRANSACTIONS ON SERVICES COMPUTING (2022)

期刊

IEEE TRANSACTIONS ON SERVICES COMPUTING

卷 15, 期 3, 页码 1411-1422

出版社

IEEE COMPUTER SOC

DOI: 10.1109/TSC.2020.2993728

关键词

Cloud computing; Data centers; Task analysis; Predictive models; Deep learning; Software; Support vector machines; Task failure; cloud data center; deep learning

类别

Computer Science, Information Systems Computer Science, Software Engineering

资金

U.S. NSF [NSF-1827674, CCF-1822965, OAC-1724845, CNS-1733596, ACI-1661378]
Microsoft Research Faculty Fellowship [8300751]
AWS Machine Learning Research Awards

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

Large-scale cloud data centers often face high failure rates due to hardware and software failures, which can greatly reduce service reliability and require significant resources for recovery. Predicting task and job failures with high accuracy is crucial to avoid wastage. This article proposes a failure prediction algorithm based on multi-layer Bi-LSTM, which outperforms other methods with 93% accuracy for task failure and 87% accuracy for job failures.

A large-scale cloud data center needs to provide high service reliability and availability with low failure occurrence probability. However, current large-scale cloud data centers still face high failure rates due to many reasons such as hardware and software failures, which often result in task and job failures. Such failures can severely reduce the reliability of cloud services and also occupy huge amount of resources to recover the service from failures. Therefore, it is important to predict task or job failures before occurrence with high accuracy to avoid unexpected wastage. Many machine learning and deep learning based methods have been proposed for the task or job failure prediction by analyzing past system message logs and identifying the relationship between the data and the failures. In order to further improve the failure prediction accuracy of the previous machine learning and deep learning based methods, in this article, we propose a failure prediction algorithm based on multi-layer Bidirectional Long Short Term Memory (Bi-LSTM) to identify task and job failures in the cloud. The goal of Bi-LSTM failure prediction algorithm is to predict whether the tasks and jobs are failed or completed. The trace-driven experiments show that our algorithm outperforms other state-of-art prediction methods with 93 percent accuracy and 87 percent for task failure and job failures respectively.

Task Failure Prediction in Cloud Data Centers Using Deep Learning

期刊

IEEE TRANSACTIONS ON SERVICES COMPUTING

出版社

IEEE COMPUTER SOC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Task Failure Prediction in Cloud Data Centers Using Deep Learning

期刊

IEEE TRANSACTIONS ON SERVICES COMPUTING

出版社

IEEE COMPUTER SOC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文