3.8 Proceedings Paper

Dynamic Kernel Fusion for Bulk Non-contiguous Data Transfer on GPU Clusters

出版社

IEEE COMPUTER SOC
DOI: 10.1109/CLUSTER49012.2020.00023

关键词

Datatype; GPU; MPI

资金

  1. NSF [1931537, 1450440, 1664137, 1818253]
  2. XRAC grant [NCR-130002]
  3. Direct For Computer & Info Scie & Enginr
  4. Office of Advanced Cyberinfrastructure (OAC) [1450440, 1931537] Funding Source: National Science Foundation
  5. Direct For Computer & Info Scie & Enginr
  6. Office of Advanced Cyberinfrastructure (OAC) [1664137] Funding Source: National Science Foundation

向作者/读者索取更多资源

In the last decade, many scientific applications have been significantly accelerated by large-scale GPU systems. However, the movement of non-contiguous GPU-resident data is one of the most challenging components of scaling these applications using communication middleware like MPI. Although plenty of research has discussed improving non-contiguous data movement within communication middleware, the packing/unpacking operations on GPUs are still expensive. They cannot be hidden due to the limitation of MPI standard and the not-well-optimized designs in existing MPI implementations for GPU-resident data. Consequently, application developers tend to implement customized packing/unpacking kernels to improve GPU utilization by avoiding unnecessary synchronizations in MPI routines. However, this reduces productivity as well as performance as it cannot overlap the packing/unpacking operations with communication. In this paper, we propose a novel approach to achieve low-latency and high-bandwidth by dynamically fusing the packing/unpacking GPU kernels to reduce the expensive kernel launch overhead. The evaluation of the proposed designs shows up to 8X and 5X performance improvement for sparse and dense non-contiguous layout, respectively, compared to the state-of-the-art approaches on the Lassen system. Similarly, we observe up to 19X improvement over existing approaches on the ABCI system. Furthermore, the proposed design also outperforms the production libraries, such as SpectrumMPI, OpenMPI, and MVAPICH2, by many orders of magnitude.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

3.8
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据