☆ 3.8 Proceedings Paper

ASTRA-SIM: Enabling SW/HW Co-Design Exploration for Distributed DL Training Platforms

2020 IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE (ISPASS) (2020)

期刊

2020 IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE (ISPASS)

卷 -, 期 -, 页码 81-92

出版社

IEEE

DOI: 10.1109/ISPASS48437.2020.00018

关键词

Distributed training; Collective communication; Training parallelism; High performance training systems

类别

Computer Science, Hardware & Architecture Computer Science, Information Systems Computer Science, Software Engineering

资金

Facebook Faculty Research Award

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

摘要

Modern Deep Learning systems heavily rely on distributed training over high-performance accelerator (e.g., TPU, GPU)-based hardware platforms. Examples today include Google's Cloud TPU and Facebook's Zion. DNN training involves a complex interplay between the DNN model architecture, parallelization strategy, scheduling strategy, collective communication algorithm, network topology, and the end-point accelerator. As innovation in AI/ML models continues to grow at an accelerated rate, there is a need for a comprehensive methodology to understand and navigate this complex SW/HW design-space for future systems to support efficient training of future DNN models. In this work, we make the following contributions (i) establish the SW/HW design-space for Distributed Training over a hierarchical scale-up fabric, (ii) develop a network simulator for navigating the design-space, and (iii) demonstrate the promise of algorithm-topology co-design for speeding up end to end training.

ASTRA-SIM: Enabling SW/HW Co-Design Exploration for Distributed DL Training Platforms

期刊

2020 IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE (ISPASS)

出版社

IEEE

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

ASTRA-SIM: Enabling SW/HW Co-Design Exploration for Distributed DL Training Platforms

期刊

2020 IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE (ISPASS)

出版社

IEEE

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文