4.7 Article

A Quantitative Survey of Communication Optimizations in Distributed Deep Learning

期刊

IEEE NETWORK
卷 35, 期 3, 页码 230-237

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/MNET.011.2000537

关键词

Training; Computational modeling; Distributed databases; Parallel processing; Data models; Tensors; Task analysis

资金

  1. Hong Kong RGC GRF grants [HKBU 12200418, HKUST 16206417, 16207818]
  2. National Natural Science Foundation of China [62002240]

向作者/读者索取更多资源

This article presents a quantitative survey of communication optimization techniques for data parallel distributed deep learning, identifying major challenges and classifying solutions at different levels. A comparative study of seven common methods on a 32-GPU cluster with 100Gb/s InfiniBand reveals the difficulties in scaling DL models with low model intensity, and highlights the critical impact of system architecture and scheduling algorithms on performance. Discussions on open issues for further investigation are also provided.
Nowadays, large and complex deep learning (DL) models are increasingly trained in a distributed manner across multiple worker machines, in which extensive communications between workers pose serious scaling problems. In this article, we present a quantitative survey of communication optimization techniques for data parallel distributed DL. We first identify the major communication challenges and classify the existing solutions into three levels, namely the learning algorithm, the system architecture, and the network infrastructure. We present the state-of-the-art communication optimization techniques and conduct a comparative study of seven common lossless distributed DL methods on a 32-GPU cluster with 100Gb/s InfiniBand (IB). We show that the DL models with low model intensity (such as BERT and BERT-Large) are difficult to scale out even with the best available lossless algorithm over 100Gb/s IB; and the system architecture and scheduling algorithms have a critical impact on the scaling property. We conclude the article with discussions of open issues for further investigation.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据