☆ 3.8 Proceedings Paper

Dynamic Kernel Fusion for Bulk Non-contiguous Data Transfer on GPU Clusters

2020 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER 2020) (2020)

期刊

2020 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER 2020)

卷 -, 期 -, 页码 130-141

出版社

IEEE COMPUTER SOC

DOI: 10.1109/CLUSTER49012.2020.00023

关键词

Datatype; GPU; MPI

类别

Computer Science, Hardware & Architecture Computer Science, Theory & Methods

资金

NSF [1931537, 1450440, 1664137, 1818253]
XRAC grant [NCR-130002]
Direct For Computer & Info Scie & Enginr
Office of Advanced Cyberinfrastructure (OAC) [1450440, 1931537] Funding Source: National Science Foundation
Direct For Computer & Info Scie & Enginr
Office of Advanced Cyberinfrastructure (OAC) [1664137] Funding Source: National Science Foundation

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

摘要

In the last decade, many scientific applications have been significantly accelerated by large-scale GPU systems. However, the movement of non-contiguous GPU-resident data is one of the most challenging components of scaling these applications using communication middleware like MPI. Although plenty of research has discussed improving non-contiguous data movement within communication middleware, the packing/unpacking operations on GPUs are still expensive. They cannot be hidden due to the limitation of MPI standard and the not-well-optimized designs in existing MPI implementations for GPU-resident data. Consequently, application developers tend to implement customized packing/unpacking kernels to improve GPU utilization by avoiding unnecessary synchronizations in MPI routines. However, this reduces productivity as well as performance as it cannot overlap the packing/unpacking operations with communication. In this paper, we propose a novel approach to achieve low-latency and high-bandwidth by dynamically fusing the packing/unpacking GPU kernels to reduce the expensive kernel launch overhead. The evaluation of the proposed designs shows up to 8X and 5X performance improvement for sparse and dense non-contiguous layout, respectively, compared to the state-of-the-art approaches on the Lassen system. Similarly, we observe up to 19X improvement over existing approaches on the ABCI system. Furthermore, the proposed design also outperforms the production libraries, such as SpectrumMPI, OpenMPI, and MVAPICH2, by many orders of magnitude.

Dynamic Kernel Fusion for Bulk Non-contiguous Data Transfer on GPU Clusters

期刊

2020 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER 2020)

出版社

IEEE COMPUTER SOC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Dynamic Kernel Fusion for Bulk Non-contiguous Data Transfer on GPU Clusters

期刊

2020 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER 2020)

出版社

IEEE COMPUTER SOC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文