Journal
IEEE MICRO
Volume 40, Issue 1, Pages 35-43Publisher
IEEE COMPUTER SOC
DOI: 10.1109/MM.2019.2949986
Keywords
-
Funding
- National Science Foundation (NSF) [CNS-1513120, ACI-1450440, CCF-1565414, ACI-1664137]
Ask authors/readers for more resources
Heterogeneous high-performance computing systems with GPUs are equipped with high-performance interconnects like InfiniBand, Omni-Path, PCIe, andNVLink. However, little exists in the literature that captures the performance impact of these interconnects on distributed deep learning (DL). In this article, wechooseHorovod, a distributed training middleware, to analyze and profile variousDNNtrainingworkloads using TensorFlowand PyTorch in addition to standardMPI microbenchmarks. We use a wide variety of systems withCPUslike Intel Xeon and IBMPOWER9, GPUslike VoltaV100, and various interconnects to analyze the followingmetrics: 1) message-size withHorovod's tensor-fusion; 2) message-size without tensor-fusion; 3) number ofMPI/NCCL calls; and 4) timetaken by eachMPI/NCCL call. Weobserved extreme performance variations for non-power-of-twomessage sizes on different platforms. To address this, wedesign a message-padding schemefor Horovod, illustrate significantly smoother allreduce latency profiles, and report caseswhereweobserved improvement for end-to-end training.
Authors
I am an author on this paper
Click your name to claim this paper and add it to your profile.
Reviews
Recommended
No Data Available