☆ 3.8 Proceedings Paper

Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning

PROCEEDINGS OF THE 23RD EUROPEAN MPI USERS' GROUP MEETING (EUROMPI 2016) (2016)

期刊

PROCEEDINGS OF THE 23RD EUROPEAN MPI USERS' GROUP MEETING (EUROMPI 2016)

卷 -, 期 -, 页码 15-22

出版社

ASSOC COMPUTING MACHINERY

DOI: 10.1145/2966884.2966912

关键词

类别

Computer Science, Theory & Methods

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

摘要

Emerging paradigms like High Performance Data Analytics (HPDA) and Deep Learning (DL) pose at least two new design challenges for existing MPI runtimes. First, these paradigms require an efficient support for communicating unusually large messages across processes. And second, the communication buffers used by HPDA applications and DL frameworks generally reside on a GPU's memory. In this context, we observe that conventional MPI runtimes have been optimized over decades to achieve lowest possible communication latency for relatively smaller message sizes (up-to 1 Megabyte) and that too for CPU memory buffers. With the advent of CUDA-Aware MPI runtimes, a lot of research has been conducted to improve performance of GPU buffer based communication. However, little exists in current state of the art that deals with very large message communication of GPU buffers. In this paper, we investigate these new challenges by analyzing the performance bottlenecks in existing CUDA-Aware MPI runtimes like MVAPICH2-GDR, and propose hierarchical collective designs to improve communication latency of the MPI_Bcast primitive by exploiting a new communication library called NCCL. To the best of our knowledge, this is the first work that addresses these new requirements where GPU buffers are used for communication with message sizes surpassing hundreds of megabytes. We highlight the design challenges for our work along with the details of design and implementation. In addition, we provide a comprehensive performance evaluation using a Micro-benchmark and a CUDA-Aware adaptation of Microsoft CNTK DL framework. We report up to 47% improvement in training time for CNTK using the proposed hierarchical MPI_Bcast design.

Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning

期刊

PROCEEDINGS OF THE 23RD EUROPEAN MPI USERS' GROUP MEETING (EUROMPI 2016)

出版社

ASSOC COMPUTING MACHINERY

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning

期刊

PROCEEDINGS OF THE 23RD EUROPEAN MPI USERS' GROUP MEETING (EUROMPI 2016)

出版社

ASSOC COMPUTING MACHINERY

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文