3.8 Proceedings Paper

Heterogeneity-Aware Distributed Machine Learning Training via Partial Reduce

出版社

ASSOC COMPUTING MACHINERY
DOI: 10.1145/3448016.3452773

关键词

Distributed machine learning; Heterogeneity; All-Reduce

资金

  1. National Key Research and Development Program of China [2018YFB1004403]
  2. National Natural Science Foundation of China [61832001, U1936104, 61702015]
  3. PKU-Tencent joint research Lab
  4. CAAI Huawei MindSpore Open Fund
  5. Fundamental Research Funds for the Central Universities [2020RC25]
  6. Beijing Academy of Artificial Intelligence (BAAI)

向作者/读者索取更多资源

The paper introduces a novel variant of All-reduce called partial-reduce, which improves tolerance and performance in heterogeneous environments by decomposing synchronous operations into parallel-asynchronous partial reduce operations, with a sub-linear convergence rate similar to distributed SGD.
All-reduce is the key communication primitive used in distributed data-parallel training due to the high performance in the homogeneous environment. However, All-reduce is sensitive to stragglers and communication delays as deep learning has been increasingly deployed on the heterogeneous environment like cloud. In this paper, we propose and analyze a novel variant of all-reduce, called partial-reduce, which provides high heterogeneity tolerance and performance by decomposing the synchronous all-reduce primitive into parallel-asynchronous partial-reduce operations. We provide theoretical guarantees, proving that partial-reduce converges to a stationary point at the similar sub-linear rate as distributed SGD. To enforce the convergence of the partial-reduce primitive, we further propose a dynamic staleness-aware distributed averaging algorithm and implement a novel group generation mechanism to prevent possible update isolation in heterogeneous environments. We build a prototype system in the real production cluster and validate its performance under different workloads. The experiments show that it is 1.21x-2x faster than other state-of-the-art baselines.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

3.8
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据