☆ 4.7 Article

A Quantitative Survey of Communication Optimizations in Distributed Deep Learning

IEEE NETWORK (2021)

期刊

IEEE NETWORK

卷 35, 期 3, 页码 230-237

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

DOI: 10.1109/MNET.011.2000537

关键词

Training; Computational modeling; Distributed databases; Parallel processing; Data models; Tensors; Task analysis

类别

Computer Science, Hardware & Architecture Computer Science, Information Systems Engineering, Electrical & Electronic Telecommunications

资金

Hong Kong RGC GRF grants [HKBU 12200418, HKUST 16206417, 16207818]
National Natural Science Foundation of China [62002240]

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

This article presents a quantitative survey of communication optimization techniques for data parallel distributed deep learning, identifying major challenges and classifying solutions at different levels. A comparative study of seven common methods on a 32-GPU cluster with 100Gb/s InfiniBand reveals the difficulties in scaling DL models with low model intensity, and highlights the critical impact of system architecture and scheduling algorithms on performance. Discussions on open issues for further investigation are also provided.

Nowadays, large and complex deep learning (DL) models are increasingly trained in a distributed manner across multiple worker machines, in which extensive communications between workers pose serious scaling problems. In this article, we present a quantitative survey of communication optimization techniques for data parallel distributed DL. We first identify the major communication challenges and classify the existing solutions into three levels, namely the learning algorithm, the system architecture, and the network infrastructure. We present the state-of-the-art communication optimization techniques and conduct a comparative study of seven common lossless distributed DL methods on a 32-GPU cluster with 100Gb/s InfiniBand (IB). We show that the DL models with low model intensity (such as BERT and BERT-Large) are difficult to scale out even with the best available lossless algorithm over 100Gb/s IB; and the system architecture and scheduling algorithms have a critical impact on the scaling property. We conclude the article with discussions of open issues for further investigation.

A Quantitative Survey of Communication Optimizations in Distributed Deep Learning

期刊

IEEE NETWORK

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

A Quantitative Survey of Communication Optimizations in Distributed Deep Learning

期刊

IEEE NETWORK

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文