4.5 Article

Communication Profiling and Characterization of Deep-Learning Workloads on Clusters With High-Performance Interconnects

Journal

IEEE MICRO
Volume 40, Issue 1, Pages 35-43

Publisher

IEEE COMPUTER SOC
DOI: 10.1109/MM.2019.2949986

Keywords

-

Funding

  1. National Science Foundation (NSF) [CNS-1513120, ACI-1450440, CCF-1565414, ACI-1664137]

Ask authors/readers for more resources

Heterogeneous high-performance computing systems with GPUs are equipped with high-performance interconnects like InfiniBand, Omni-Path, PCIe, andNVLink. However, little exists in the literature that captures the performance impact of these interconnects on distributed deep learning (DL). In this article, wechooseHorovod, a distributed training middleware, to analyze and profile variousDNNtrainingworkloads using TensorFlowand PyTorch in addition to standardMPI microbenchmarks. We use a wide variety of systems withCPUslike Intel Xeon and IBMPOWER9, GPUslike VoltaV100, and various interconnects to analyze the followingmetrics: 1) message-size withHorovod's tensor-fusion; 2) message-size without tensor-fusion; 3) number ofMPI/NCCL calls; and 4) timetaken by eachMPI/NCCL call. Weobserved extreme performance variations for non-power-of-twomessage sizes on different platforms. To address this, wedesign a message-padding schemefor Horovod, illustrate significantly smoother allreduce latency profiles, and report caseswhereweobserved improvement for end-to-end training.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.5
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available