☆ 3.8 Proceedings Paper

Toward an In-Depth Analysis of Multifidelity High Performance Computing Systems

2022 22ND IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND INTERNET COMPUTING (CCGRID 2022) (2022)

期刊

2022 22ND IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND INTERNET COMPUTING (CCGRID 2022)

卷 -, 期 -, 页码 716-725

出版社

IEEE COMPUTER SOC

DOI: 10.1109/CCGrid54584.2022.00081

关键词

Error Log Analysis; HPC; Visualization; Time-series Clustering; Machine Learning; Reliability

类别

Computer Science, Hardware & Architecture Computer Science, Theory & Methods

资金

Argonne Leadership Computing Facility (ALCF), a DOE Office of Science User Facility [DE-AC02-06CH11357]
DOE [DE-SC0014917]

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

To maintain a robust and reliable supercomputing facility, monitoring and understanding the hardware system events and behaviors are essential. In this work, we built an end-to-end error log analysis system that examines the job logs and extracts insights from their correlation with hardware error logs and environment logs. Our machine learning pipeline achieved an accuracy of 92% in predicting the job exit status and provides sufficient lead time for preventive measures before the actual failure occurs.

To maintain a robust and reliable supercomputing facility, monitoring it and understanding its hardware system events and behaviors is an essential task. Exascale systems will be increasingly heterogeneous, and the volume of systems data, collected from multiple subsystems and components measured at multiple fidelity levels and temporal resolutions, will continue to grow. In this work, we aim to create an effective solution to analyze diverse and massive datasets gathered from the error logs, job logs, and environment logs of an HPC system, such as a Cray XC40 supercomputer. In this work, we build an end-to-end error log analysis system that analyzes the job logs and gleans insights from their correspondence with hardware error logs and environment logs despite their varying temporal and spatial resolutions. Our machine learning pipeline built in our system is similar to 92% accurate in predicting the job exit status and does so with sufficient lead time for evasive actions to be taken before the actual failure event occurs.

Toward an In-Depth Analysis of Multifidelity High Performance Computing Systems

期刊

2022 22ND IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND INTERNET COMPUTING (CCGRID 2022)

出版社

IEEE COMPUTER SOC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Toward an In-Depth Analysis of Multifidelity High Performance Computing Systems

期刊

2022 22ND IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND INTERNET COMPUTING (CCGRID 2022)

出版社

IEEE COMPUTER SOC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文