4.3 Article

Optimized large-message broadcast for deep learning workloads: MPI, MPI plus NCCL, or NCCL2?

Journal

PARALLEL COMPUTING
Volume 85, Issue -, Pages 141-152

Publisher

ELSEVIER
DOI: 10.1016/j.parco.2019.03.005

Keywords

HPC; Distributed d learning; MPI_Bcast; NCCL; CODA-Aware MPI

Funding

  1. National Science Foundation [CNS-1513120, CCF-1565414]

Ask authors/readers for more resources

Traditionally, MPI runtimes have been designed for clusters with a large number of nodes. However, with the advent of MPI+CUDA applications and GPU clusters with a relatively smaller number of nodes, efficient communication schemes need to be designed for such systems. This coupled with new application workloads brought forward by Deep Learning (DL) frameworks like Caffe and Microsoft Cognitive Toolkit (CNTK) pose additional design constraints due to very large message communication of GPU buffers during the training phase. In this context, special-purpose libraries like NVIDIA NCCL have emerged to deal with DL workloads. In this paper, we address these new challenges for MPI runtimes and propose two new designs to deal with them: (1) A pipelined chain (PC) design for MPI_Bcast that provides efficient intra- and inter-node communication of GPU buffers, and (2) A Topology-Aware pipelined chain (TA-PC) design for systems with multiple GPUs to fully exploit all the available PCIe links available within a multi-GPU node. To highlight the benefits of our designs, we present an in-depth performance landscape for the proposed MPI_Bcast (MPI) designs, our earlier NCCL-based MPI_Bcast (MPI+NCCL1) design, and ncclBroadcast (NCCL2) design. The proposed designs offer up to 14 x and 16.6 x better performance than MPI+NCCL1 based solutions for intra- and inter-node broadcast latency, respectively. With the recent introduction of NCCL2 (inter-node capable) library, we have enhanced our performance results by adding comparisons for the proposed MPI_Bcast designs as well as ncclBroadcast (NCCL2) design. We report up to 10 x better performance for small and medium message sizes and comparable performance for large message sizes. We also observed that the TA-PC design is up to 50% better than the PC design for MPI_Bcast to 64 GPUs. Furthermore, we provide application level performance comparison using a CUDA-Aware version of CNTK called CA-CNTK. The proposed MPI_Bcast designs provide up to 7% improvement over MPI+NCCL based solutions for data parallel training of the VGG network on 128 GPUs. We present our performance evaluation on three GPU clusters with diverse characteristics: (1) KESCH; a dense multi-GPU system with 8 K80 GPU cards per node, (2) RI2; with a single K80 GPU card per node, and (3) Owens; with a single P100 GPU per node. (C) 2019 Elsevier B.V. All rights reserved.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.3
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available