☆ 4.7 Article

Breaking (Global) Barriers in Parallel Stochastic Optimization With Wait-Avoiding Group Averaging

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS (2021)

期刊

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS

卷 32, 期 7, 页码 1725-1739

出版社

IEEE COMPUTER SOC

DOI: 10.1109/TPDS.2020.3040606

关键词

Computational modeling; Training; Convergence; Program processors; Stochastic processes; Deep learning; Task analysis; Stochastic gradient descent; distributed deep learning; decentralized optimization

类别

Computer Science, Theory & Methods Engineering, Electrical & Electronic

资金

European Research Council (ERC) under the European Union [678880, 801039]
ERC [805223]
Swiss National Science Foundation [185778]
ETH Postdoctoral Fellowship
European Research Council (ERC) [805223] Funding Source: European Research Council (ERC)

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

The study introduces a wait-avoiding stochastic optimizer, WAGMA-SGD, that reduces global communication through subgroup weight exchange while maintaining convergence rates similar to globally communicating SGD. Empirical results show significant advantages of the method across different tasks, particularly in training throughput and time-to-solution.

Deep learning at scale is dominated by communication time. Distributing samples across nodes usually yields the best performance, but poses scaling challenges due to global information dissemination and load imbalance across uneven sample lengths. State-of-the-art decentralized optimizers mitigate the problem, but require more iterations to achieve the same accuracy as their globally-communicating counterparts. We present Wait-Avoiding Group Model Averaging (WAGMA) SGD, a wait-avoiding stochastic optimizer that reduces global communication via subgroup weight exchange. The key insight is a combination of algorithmic changes to the averaging scheme and the use of a group allreduce operation. We prove the convergence of WAGMA-SGD, and empirically show that it retains convergence rates similar to Allreduce-SGD. For evaluation, we train ResNet-50 on ImageNet; Transformer for machine translation; and deep reinforcement learning for navigation at scale. Compared with state-of-the-art decentralized SGD variants, WAGMA-SGD significantly improves training throughput (e.g., 2.1x on 1,024 GPUs for reinforcement learning), and achieves the fastest time-to-solution (e.g., the highest score using the shortest training time for Transformer).

Breaking (Global) Barriers in Parallel Stochastic Optimization With Wait-Avoiding Group Averaging

期刊

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS

出版社

IEEE COMPUTER SOC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Breaking (Global) Barriers in Parallel Stochastic Optimization With Wait-Avoiding Group Averaging

期刊

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS

出版社

IEEE COMPUTER SOC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文