3.8 Proceedings Paper

Predicting Faults in High Performance Computing Systems: An In-Depth Survey of the State-of-the-Practice

出版社

ASSOC COMPUTING MACHINERY
DOI: 10.1145/3295500.3356185

关键词

High Performance Computing; Fault Prediction; Resillience; Exascale Computing

资金

  1. German Federal Ministry of Education and Research (BMBF) [01|H16010D]

向作者/读者索取更多资源

As we near exascale, resilience remains a major technical hurdle. Any technique with the goal of achieving resilience suffers from having to be reactive, as failures can appear at any time. A wide body of research aims at predicting failures, i.e., forecasting failures so that evasive actions can be taken while the system is still fully functional, which has the benefit of giving insight into the global system state. This research area has grown very diverse with a large number of approaches, yet is currently poorly classified, making it hard to understand the impact and coverage of existing work. In this paper, we perform an extensive survey of existing literature in failure prediction by analyzing and comparing more than 30 different failure prediction approaches. We develop a taxonomy, which aids in categorizing the methods, and we show how this can help us to understand the state-of-the-practice of this field and to identify opportunities, gaps as well as future work.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

3.8
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据