☆ 4.7 Article

GradientFlow: Optimizing Network Performance for Large-Scale Distributed DNN Training

IEEE TRANSACTIONS ON BIG DATA (2022)

期刊

IEEE TRANSACTIONS ON BIG DATA

卷 8, 期 2, 页码 495-507

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

DOI: 10.1109/TBDATA.2019.2957478

关键词

Training; Graphics processing units; Computational modeling; Servers; Data models; Computer architecture; Bandwidth; Distributed computing; deep learning; computer network

类别

Computer Science, Information Systems Computer Science, Theory & Methods

资金

GDCR [NRF2015ENC-GDCR01001003]
FogChain [NRF2017EWT-EP003-023]
BSEWWT [BSEWWT2017_2_06]

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

Scaling out deep neural network (DNN) training is crucial for reducing model training time, but high communication overhead in distributed DNN training is a major performance bottleneck. In this study, we propose GradientFlow, a communication backend, and employ various network optimization techniques to tackle this problem. By integrating methods such as ring-based allreduce, mixed-precision training, and computation/communication overlap, as well as introducing lazy allreduce and coarse-grained sparse communication, we were able to achieve impressive speedup ratios when training AlexNet and ResNet-50 on the ImageNet dataset using multiple GPUs.

It is important to scale out deep neural network (DNN) training for reducing model training time. The high communication overhead is one of the major performance bottlenecks for distributed DNN training across multiple GPUs. Our investigations have shown that popular open-source DNN systems could only achieve 2.5 speedup ratio on 64 GPUs connected by 56 Gbps network. To address this problem, we propose a communication backend named GradientFlow for distributed DNN training, and employ a set of network optimization techniques. First, we integrate ring-based allreduce, mixed-precision training, and computation/communication overlap into GradientFlow. Second, we propose lazy allreduce to improve network throughput by fusing multiple communication operations into a single one, and design coarse-grained sparse communication to reduce network traffic by only transmitting important gradient chunks. When training AlexNet and ResNet-50 on the ImageNet dataset using 512 GPUs, our approach could achieve 410.2 and 434.1 speedup ratio, respectively.

GradientFlow: Optimizing Network Performance for Large-Scale Distributed DNN Training

期刊

IEEE TRANSACTIONS ON BIG DATA

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

GradientFlow: Optimizing Network Performance for Large-Scale Distributed DNN Training

期刊

IEEE TRANSACTIONS ON BIG DATA

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文