3.8 Proceedings Paper

Co-designing the Topology/Algorithm to Accelerate Distributed Training

Publisher

IEEE
DOI: 10.1109/ISPA-BDCloud-SocialCom-SustainCom52081.2021.00141

Keywords

Terms topology; hardware training platform; distributed training; collective communication

Funding

  1. Science and Technology Innovation project of Hunan Province [2018RS3083]
  2. National Key Research and Development Project [2018YFB0204301]

Ask authors/readers for more resources

With the development of Deep Learning, the complexity of Deep Neural Network models has increased, leading to challenges in hardware training platforms due to the need for large computing and memory resources for training large models. Distributed training platforms are being developed to address these challenges, with a focus on optimizing communication efficiency in interconnection networks to improve system performance.
With the development of Deep Learning (DL), Deep Neural Network (DNN) models have become more complex. At the same time, the development of the Internet makes it easy to obtain large data sets for DL training. Large-scale model parameters and training data enhance the level of AI by improving the accuracy of DNN models. But on the other hand, they also present more severe challenges to the hardware training platform because training a large model needs a lot of computing and memory resources that can easily exceed the capacity of a single processor. In this context, integrating more processors on a hierarchical system to conduct distributed training is a direction for the development of training platforms. In distributed training, collective communication operations (including all-to-all, all-reduce, and all-gather) take up a lot of training time, making the interconnection network between computing nodes one of the most critical factors affecting the system performance. The hierarchical torus topology, combined with the Ring All-Reduce collective communication algorithm, is one of the current mainstream distributed interconnection networks. However, we believe that its communication performance is not the best. In this work, we first designed a new intra-package communication topology, i.e. the switch-based fully connected topology, which shortens the time consumed by cross-node communication. Then, considering the characteristics of this topology, we carefully devised more efficient all-reduce and all-gather communication algorithms. Finally, combined with the torus topology, we implemented a novel distributed DL training platform. Compared with the hierarchical torus, our platform improves communication efficiency and provides 1.16-2.68 times speedup in distributed training of DNN models.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

3.8
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available