3.8 Proceedings Paper

Aarohi: Making Real-Time Node Failure Prediction Feasible

出版社

IEEE
DOI: 10.1109/IPDPS47924.2020.00115

关键词

Online Prediction; HPC; Node Failures; Parsing

资金

  1. DOE from Lawrence Berkeley National Lab
  2. NSF [1525609, 0958311]
  3. U.S. Department of Energy by Lawrence Livermore National Laboratory [DE-AC52-07NA27344]
  4. DOE from Lawrence Livermore National Lab

向作者/读者索取更多资源

Large-scale production systems are well known to encounter node failures, which affect compute capacity and energy. Both in HPC systems and enterprise data centers, combating failures is becoming challenging with increasing hardware and software complexity. Several data mining solutions of logs have been investigated in the context of anomaly detection in such systems. However, with subsequent proactive failure mitigation, the existing log mining solutions are not sufficiently fast for realtime anomaly detection. Machine learning (ML)-based training can produce high accuracy but the inference scheme needs to be enhanced with rapid parsers to assess anomalies in real-time. This work tackles online anomaly prediction in computing systems by exploiting context free grammar-based rapid event analysis. We present our framework Aarohi(1), which describes an effective way to predict failures online. Aarohi is designed to be generic and scalable making it suitable as a real-time predictor. Aarohi obtains more than 3 minutes lead times to node failures with an average of 0.31 msecs prediction time for a chain length of 18. The overall improvement obtained w.r.t. the existing state-of-the-art is over a factor of 27.4x. Our compiler-based approach provides new research directions for lead time optimization with a significant prediction speedup required for the deployment of proactive fault tolerant solutions in practice.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

3.8
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据