☆ 3.8 Proceedings Paper

Strategies to Deploy and Scale Deep Learning on the Summit Supercomputer

PROCEEDINGS OF 2019 IEEE/ACM THIRD WORKSHOP ON DEEP LEARNING ON SUPERCOMPUTERS (DLS) (2019)

期刊

PROCEEDINGS OF 2019 IEEE/ACM THIRD WORKSHOP ON DEEP LEARNING ON SUPERCOMPUTERS (DLS)

卷 -, 期 -, 页码 84-94

出版社

IEEE COMPUTER SOC

DOI: 10.1109/DLS49591.2019.00016

关键词

HPC; software deployment; performance evaluation; scalable machine learning

类别

Computer Science, Artificial Intelligence Computer Science, Theory & Methods

资金

U.S. Department of Energy [DE-AC05-00OR22725]

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

摘要

The rapid growth and wide applicability of Deep Learning (DL) frameworks poses challenges to computing centers which need to deploy and support the software, and also to domain scientists who have to keep up with the system environment and scale up scientific exploration through DL. We offer recommendations for deploying and scaling DL frameworks on the Summit supercomputer, currently atop the Top500 list, at the Oak Ridge National Laboratory Leadership Computing Facility (OLCF). We discuss DL software deployment in the form of containers, and compare performance of native-built frameworks and containerized deployment. Software containers show no noticeable negative performance impact and exhibit faster Python loading times and promise easier maintenance. To explore strategies for scaling up DL model training campaigns, we assess DL compute kernel performance, discuss and recommend I/O data formats and staging, and identify communication needs for scalable message exchange for DL runs at scale. We recommend that users take a step-wise tuning approach beginning with algorithmic kernel choice, node I/O configuration, and communications tuning as best-practice. We present baseline examples of scaling efficiency 87% for a DL run of ResNet50 running on 1024 nodes (6144 V100 GPUs).

Strategies to Deploy and Scale Deep Learning on the Summit Supercomputer

期刊

PROCEEDINGS OF 2019 IEEE/ACM THIRD WORKSHOP ON DEEP LEARNING ON SUPERCOMPUTERS (DLS)

出版社

IEEE COMPUTER SOC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Strategies to Deploy and Scale Deep Learning on the Summit Supercomputer

期刊

PROCEEDINGS OF 2019 IEEE/ACM THIRD WORKSHOP ON DEEP LEARNING ON SUPERCOMPUTERS (DLS)

出版社

IEEE COMPUTER SOC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文