☆ 4.5 Article

Communication Profiling and Characterization of Deep-Learning Workloads on Clusters With High-Performance Interconnects

IEEE MICRO (2020)

Journal

IEEE MICRO

Volume 40, Issue 1, Pages 35-43

Publisher

IEEE COMPUTER SOC

DOI: 10.1109/MM.2019.2949986

Keywords

Funding

National Science Foundation (NSF) [CNS-1513120, ACI-1450440, CCF-1565414, ACI-1664137]

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Abstract

Heterogeneous high-performance computing systems with GPUs are equipped with high-performance interconnects like InfiniBand, Omni-Path, PCIe, andNVLink. However, little exists in the literature that captures the performance impact of these interconnects on distributed deep learning (DL). In this article, wechooseHorovod, a distributed training middleware, to analyze and profile variousDNNtrainingworkloads using TensorFlowand PyTorch in addition to standardMPI microbenchmarks. We use a wide variety of systems withCPUslike Intel Xeon and IBMPOWER9, GPUslike VoltaV100, and various interconnects to analyze the followingmetrics: 1) message-size withHorovod's tensor-fusion; 2) message-size without tensor-fusion; 3) number ofMPI/NCCL calls; and 4) timetaken by eachMPI/NCCL call. Weobserved extreme performance variations for non-power-of-twomessage sizes on different platforms. To address this, wedesign a message-padding schemefor Horovod, illustrate significantly smoother allreduce latency profiles, and report caseswhereweobserved improvement for end-to-end training.

Communication Profiling and Characterization of Deep-Learning Workloads on Clusters With High-Performance Interconnects

Journal

IEEE MICRO

Publisher

IEEE COMPUTER SOC

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Communication Profiling and Characterization of Deep-Learning Workloads on Clusters With High-Performance Interconnects

Journal

IEEE MICRO

Publisher

IEEE COMPUTER SOC

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper