4.7 Article

GradientFlow: Optimizing Network Performance for Large-Scale Distributed DNN Training

期刊

IEEE TRANSACTIONS ON BIG DATA
卷 8, 期 2, 页码 495-507

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/TBDATA.2019.2957478

关键词

Training; Graphics processing units; Computational modeling; Servers; Data models; Computer architecture; Bandwidth; Distributed computing; deep learning; computer network

资金

  1. GDCR [NRF2015ENC-GDCR01001003]
  2. FogChain [NRF2017EWT-EP003-023]
  3. BSEWWT [BSEWWT2017_2_06]

向作者/读者索取更多资源

Scaling out deep neural network (DNN) training is crucial for reducing model training time, but high communication overhead in distributed DNN training is a major performance bottleneck. In this study, we propose GradientFlow, a communication backend, and employ various network optimization techniques to tackle this problem. By integrating methods such as ring-based allreduce, mixed-precision training, and computation/communication overlap, as well as introducing lazy allreduce and coarse-grained sparse communication, we were able to achieve impressive speedup ratios when training AlexNet and ResNet-50 on the ImageNet dataset using multiple GPUs.
It is important to scale out deep neural network (DNN) training for reducing model training time. The high communication overhead is one of the major performance bottlenecks for distributed DNN training across multiple GPUs. Our investigations have shown that popular open-source DNN systems could only achieve 2.5 speedup ratio on 64 GPUs connected by 56 Gbps network. To address this problem, we propose a communication backend named GradientFlow for distributed DNN training, and employ a set of network optimization techniques. First, we integrate ring-based allreduce, mixed-precision training, and computation/communication overlap into GradientFlow. Second, we propose lazy allreduce to improve network throughput by fusing multiple communication operations into a single one, and design coarse-grained sparse communication to reduce network traffic by only transmitting important gradient chunks. When training AlexNet and ResNet-50 on the ImageNet dataset using 512 GPUs, our approach could achieve 410.2 and 434.1 speedup ratio, respectively.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据