3.8 Proceedings Paper

Aarohi: Making Real-Time Node Failure Prediction Feasible

Publisher

IEEE
DOI: 10.1109/IPDPS47924.2020.00115

Keywords

Online Prediction; HPC; Node Failures; Parsing

Funding

  1. DOE from Lawrence Berkeley National Lab
  2. NSF [1525609, 0958311]
  3. U.S. Department of Energy by Lawrence Livermore National Laboratory [DE-AC52-07NA27344]
  4. DOE from Lawrence Livermore National Lab

Ask authors/readers for more resources

Large-scale production systems are well known to encounter node failures, which affect compute capacity and energy. Both in HPC systems and enterprise data centers, combating failures is becoming challenging with increasing hardware and software complexity. Several data mining solutions of logs have been investigated in the context of anomaly detection in such systems. However, with subsequent proactive failure mitigation, the existing log mining solutions are not sufficiently fast for realtime anomaly detection. Machine learning (ML)-based training can produce high accuracy but the inference scheme needs to be enhanced with rapid parsers to assess anomalies in real-time. This work tackles online anomaly prediction in computing systems by exploiting context free grammar-based rapid event analysis. We present our framework Aarohi(1), which describes an effective way to predict failures online. Aarohi is designed to be generic and scalable making it suitable as a real-time predictor. Aarohi obtains more than 3 minutes lead times to node failures with an average of 0.31 msecs prediction time for a chain length of 18. The overall improvement obtained w.r.t. the existing state-of-the-art is over a factor of 27.4x. Our compiler-based approach provides new research directions for lead time optimization with a significant prediction speedup required for the deployment of proactive fault tolerant solutions in practice.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

3.8
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available