期刊
出版社
IEEE
DOI: 10.1109/ISPASS48437.2020.00018
关键词
Distributed training; Collective communication; Training parallelism; High performance training systems
类别
资金
- Facebook Faculty Research Award
Modern Deep Learning systems heavily rely on distributed training over high-performance accelerator (e.g., TPU, GPU)-based hardware platforms. Examples today include Google's Cloud TPU and Facebook's Zion. DNN training involves a complex interplay between the DNN model architecture, parallelization strategy, scheduling strategy, collective communication algorithm, network topology, and the end-point accelerator. As innovation in AI/ML models continues to grow at an accelerated rate, there is a need for a comprehensive methodology to understand and navigate this complex SW/HW design-space for future systems to support efficient training of future DNN models. In this work, we make the following contributions (i) establish the SW/HW design-space for Distributed Training over a hierarchical scale-up fabric, (ii) develop a network simulator for navigating the design-space, and (iii) demonstrate the promise of algorithm-topology co-design for speeding up end to end training.
作者
我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。
推荐
暂无数据