期刊
出版社
IEEE COMPUTER SOC
DOI: 10.1109/CCGrid54584.2022.00081
关键词
Error Log Analysis; HPC; Visualization; Time-series Clustering; Machine Learning; Reliability
资金
- Argonne Leadership Computing Facility (ALCF), a DOE Office of Science User Facility [DE-AC02-06CH11357]
- DOE [DE-SC0014917]
To maintain a robust and reliable supercomputing facility, monitoring and understanding the hardware system events and behaviors are essential. In this work, we built an end-to-end error log analysis system that examines the job logs and extracts insights from their correlation with hardware error logs and environment logs. Our machine learning pipeline achieved an accuracy of 92% in predicting the job exit status and provides sufficient lead time for preventive measures before the actual failure occurs.
To maintain a robust and reliable supercomputing facility, monitoring it and understanding its hardware system events and behaviors is an essential task. Exascale systems will be increasingly heterogeneous, and the volume of systems data, collected from multiple subsystems and components measured at multiple fidelity levels and temporal resolutions, will continue to grow. In this work, we aim to create an effective solution to analyze diverse and massive datasets gathered from the error logs, job logs, and environment logs of an HPC system, such as a Cray XC40 supercomputer. In this work, we build an end-to-end error log analysis system that analyzes the job logs and gleans insights from their correspondence with hardware error logs and environment logs despite their varying temporal and spatial resolutions. Our machine learning pipeline built in our system is similar to 92% accurate in predicting the job exit status and does so with sufficient lead time for evasive actions to be taken before the actual failure event occurs.
作者
我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。
推荐
暂无数据