4.7 Article

Task Failure Prediction in Cloud Data Centers Using Deep Learning

期刊

IEEE TRANSACTIONS ON SERVICES COMPUTING
卷 15, 期 3, 页码 1411-1422

出版社

IEEE COMPUTER SOC
DOI: 10.1109/TSC.2020.2993728

关键词

Cloud computing; Data centers; Task analysis; Predictive models; Deep learning; Software; Support vector machines; Task failure; cloud data center; deep learning

资金

  1. U.S. NSF [NSF-1827674, CCF-1822965, OAC-1724845, CNS-1733596, ACI-1661378]
  2. Microsoft Research Faculty Fellowship [8300751]
  3. AWS Machine Learning Research Awards

向作者/读者索取更多资源

Large-scale cloud data centers often face high failure rates due to hardware and software failures, which can greatly reduce service reliability and require significant resources for recovery. Predicting task and job failures with high accuracy is crucial to avoid wastage. This article proposes a failure prediction algorithm based on multi-layer Bi-LSTM, which outperforms other methods with 93% accuracy for task failure and 87% accuracy for job failures.
A large-scale cloud data center needs to provide high service reliability and availability with low failure occurrence probability. However, current large-scale cloud data centers still face high failure rates due to many reasons such as hardware and software failures, which often result in task and job failures. Such failures can severely reduce the reliability of cloud services and also occupy huge amount of resources to recover the service from failures. Therefore, it is important to predict task or job failures before occurrence with high accuracy to avoid unexpected wastage. Many machine learning and deep learning based methods have been proposed for the task or job failure prediction by analyzing past system message logs and identifying the relationship between the data and the failures. In order to further improve the failure prediction accuracy of the previous machine learning and deep learning based methods, in this article, we propose a failure prediction algorithm based on multi-layer Bidirectional Long Short Term Memory (Bi-LSTM) to identify task and job failures in the cloud. The goal of Bi-LSTM failure prediction algorithm is to predict whether the tasks and jobs are failed or completed. The trace-driven experiments show that our algorithm outperforms other state-of-art prediction methods with 93 percent accuracy and 87 percent for task failure and job failures respectively.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据