期刊
IEEE NETWORK
卷 35, 期 3, 页码 230-237出版社
IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/MNET.011.2000537
关键词
Training; Computational modeling; Distributed databases; Parallel processing; Data models; Tensors; Task analysis
类别
资金
- Hong Kong RGC GRF grants [HKBU 12200418, HKUST 16206417, 16207818]
- National Natural Science Foundation of China [62002240]
This article presents a quantitative survey of communication optimization techniques for data parallel distributed deep learning, identifying major challenges and classifying solutions at different levels. A comparative study of seven common methods on a 32-GPU cluster with 100Gb/s InfiniBand reveals the difficulties in scaling DL models with low model intensity, and highlights the critical impact of system architecture and scheduling algorithms on performance. Discussions on open issues for further investigation are also provided.
Nowadays, large and complex deep learning (DL) models are increasingly trained in a distributed manner across multiple worker machines, in which extensive communications between workers pose serious scaling problems. In this article, we present a quantitative survey of communication optimization techniques for data parallel distributed DL. We first identify the major communication challenges and classify the existing solutions into three levels, namely the learning algorithm, the system architecture, and the network infrastructure. We present the state-of-the-art communication optimization techniques and conduct a comparative study of seven common lossless distributed DL methods on a 32-GPU cluster with 100Gb/s InfiniBand (IB). We show that the DL models with low model intensity (such as BERT and BERT-Large) are difficult to scale out even with the best available lossless algorithm over 100Gb/s IB; and the system architecture and scheduling algorithms have a critical impact on the scaling property. We conclude the article with discussions of open issues for further investigation.
作者
我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。
推荐
暂无数据