3.8 Proceedings Paper

Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning

出版社

ASSOC COMPUTING MACHINERY
DOI: 10.1145/2966884.2966912

关键词

-

向作者/读者索取更多资源

Emerging paradigms like High Performance Data Analytics (HPDA) and Deep Learning (DL) pose at least two new design challenges for existing MPI runtimes. First, these paradigms require an efficient support for communicating unusually large messages across processes. And second, the communication buffers used by HPDA applications and DL frameworks generally reside on a GPU's memory. In this context, we observe that conventional MPI runtimes have been optimized over decades to achieve lowest possible communication latency for relatively smaller message sizes (up-to 1 Megabyte) and that too for CPU memory buffers. With the advent of CUDA-Aware MPI runtimes, a lot of research has been conducted to improve performance of GPU buffer based communication. However, little exists in current state of the art that deals with very large message communication of GPU buffers. In this paper, we investigate these new challenges by analyzing the performance bottlenecks in existing CUDA-Aware MPI runtimes like MVAPICH2-GDR, and propose hierarchical collective designs to improve communication latency of the MPI_Bcast primitive by exploiting a new communication library called NCCL. To the best of our knowledge, this is the first work that addresses these new requirements where GPU buffers are used for communication with message sizes surpassing hundreds of megabytes. We highlight the design challenges for our work along with the details of design and implementation. In addition, we provide a comprehensive performance evaluation using a Micro-benchmark and a CUDA-Aware adaptation of Microsoft CNTK DL framework. We report up to 47% improvement in training time for CNTK using the proposed hierarchical MPI_Bcast design.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

3.8
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据