4.5 Article

Design and implementation of I/O performance prediction scheme on HPC systems through large-scale log analysis

Journal

JOURNAL OF BIG DATA
Volume 10, Issue 1, Pages -

Publisher

SPRINGERNATURE
DOI: 10.1186/s40537-023-00741-4

Keywords

High performance computing; Distributed file system; Performance modeling

Ask authors/readers for more resources

Large-scale high performance computing (HPC) systems with diverse user applications require a good understanding of their performance characteristics, including I/O performance. However, predicting I/O performance is challenging due to shared I/O systems and the complex software and hardware stack involved. To address this, we propose integrating information from multiple system logs and developing a regression-based approach for accurate I/O performance prediction on HPC systems. Our evaluation shows promising results with up to 90% accuracy for write performance and up to 99% accuracy for read performance using real logs from the Cori supercomputer system at NERSC.
Large-scale high performance computing (HPC) systems typically consist of many thousands of CPUs and storage units used by hundreds to thousands of users simultaneously. Applications from large numbers of users have diverse characteristics, such as varying computation, communication, memory, and I/O intensity. A good understanding of the performance characteristics of each user application is important for job scheduling and resource provisioning. Among these performance characteristics, I/O performance is becoming increasingly important as data sizes rapidly increase and large-scale applications, such as simulation and model training, are widely adopted. However, predicting I/O performance is difficult because I/O systems are shared among all users and involve many layers of software and hardware stack, including the application, network interconnect, operating system, file system, and storage devices. Furthermore, updates to these layers and changes in system management policy can significantly alter the I/O behavior of applications and the entire system. To improve the prediction of the I/O performance on HPC systems, we propose integrating information from several different system logs and developing a regression-based approach to predict the I/O performance. Our proposed scheme can dynamically select the most relevant features from the log entries using various feature selection algorithms and scoring functions, and can automatically select the regression algorithm with the best accuracy for the prediction task. The evaluation results show that our proposed scheme can predict the write performance with up to 90% prediction accuracy and the read performance with up to 99% prediction accuracy using the real logs from the Cori supercomputer system at NERSC.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.5
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available