4.5 Article

Design and implementation of I/O performance prediction scheme on HPC systems through large-scale log analysis

期刊

JOURNAL OF BIG DATA
卷 10, 期 1, 页码 -

出版社

SPRINGERNATURE
DOI: 10.1186/s40537-023-00741-4

关键词

High performance computing; Distributed file system; Performance modeling

向作者/读者索取更多资源

Large-scale high performance computing (HPC) systems with diverse user applications require a good understanding of their performance characteristics, including I/O performance. However, predicting I/O performance is challenging due to shared I/O systems and the complex software and hardware stack involved. To address this, we propose integrating information from multiple system logs and developing a regression-based approach for accurate I/O performance prediction on HPC systems. Our evaluation shows promising results with up to 90% accuracy for write performance and up to 99% accuracy for read performance using real logs from the Cori supercomputer system at NERSC.
Large-scale high performance computing (HPC) systems typically consist of many thousands of CPUs and storage units used by hundreds to thousands of users simultaneously. Applications from large numbers of users have diverse characteristics, such as varying computation, communication, memory, and I/O intensity. A good understanding of the performance characteristics of each user application is important for job scheduling and resource provisioning. Among these performance characteristics, I/O performance is becoming increasingly important as data sizes rapidly increase and large-scale applications, such as simulation and model training, are widely adopted. However, predicting I/O performance is difficult because I/O systems are shared among all users and involve many layers of software and hardware stack, including the application, network interconnect, operating system, file system, and storage devices. Furthermore, updates to these layers and changes in system management policy can significantly alter the I/O behavior of applications and the entire system. To improve the prediction of the I/O performance on HPC systems, we propose integrating information from several different system logs and developing a regression-based approach to predict the I/O performance. Our proposed scheme can dynamically select the most relevant features from the log entries using various feature selection algorithms and scoring functions, and can automatically select the regression algorithm with the best accuracy for the prediction task. The evaluation results show that our proposed scheme can predict the write performance with up to 90% prediction accuracy and the read performance with up to 99% prediction accuracy using the real logs from the Cori supercomputer system at NERSC.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.5
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据