3.8 Proceedings Paper

Toward an In-Depth Analysis of Multifidelity High Performance Computing Systems

出版社

IEEE COMPUTER SOC
DOI: 10.1109/CCGrid54584.2022.00081

关键词

Error Log Analysis; HPC; Visualization; Time-series Clustering; Machine Learning; Reliability

资金

  1. Argonne Leadership Computing Facility (ALCF), a DOE Office of Science User Facility [DE-AC02-06CH11357]
  2. DOE [DE-SC0014917]

向作者/读者索取更多资源

To maintain a robust and reliable supercomputing facility, monitoring and understanding the hardware system events and behaviors are essential. In this work, we built an end-to-end error log analysis system that examines the job logs and extracts insights from their correlation with hardware error logs and environment logs. Our machine learning pipeline achieved an accuracy of 92% in predicting the job exit status and provides sufficient lead time for preventive measures before the actual failure occurs.
To maintain a robust and reliable supercomputing facility, monitoring it and understanding its hardware system events and behaviors is an essential task. Exascale systems will be increasingly heterogeneous, and the volume of systems data, collected from multiple subsystems and components measured at multiple fidelity levels and temporal resolutions, will continue to grow. In this work, we aim to create an effective solution to analyze diverse and massive datasets gathered from the error logs, job logs, and environment logs of an HPC system, such as a Cray XC40 supercomputer. In this work, we build an end-to-end error log analysis system that analyzes the job logs and gleans insights from their correspondence with hardware error logs and environment logs despite their varying temporal and spatial resolutions. Our machine learning pipeline built in our system is similar to 92% accurate in predicting the job exit status and does so with sufficient lead time for evasive actions to be taken before the actual failure event occurs.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

3.8
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据