3.8 Proceedings Paper

Heterogeneity-Aware Distributed Machine Learning Training via Partial Reduce

Publisher

ASSOC COMPUTING MACHINERY
DOI: 10.1145/3448016.3452773

Keywords

Distributed machine learning; Heterogeneity; All-Reduce

Funding

  1. National Key Research and Development Program of China [2018YFB1004403]
  2. National Natural Science Foundation of China [61832001, U1936104, 61702015]
  3. PKU-Tencent joint research Lab
  4. CAAI Huawei MindSpore Open Fund
  5. Fundamental Research Funds for the Central Universities [2020RC25]
  6. Beijing Academy of Artificial Intelligence (BAAI)

Ask authors/readers for more resources

The paper introduces a novel variant of All-reduce called partial-reduce, which improves tolerance and performance in heterogeneous environments by decomposing synchronous operations into parallel-asynchronous partial reduce operations, with a sub-linear convergence rate similar to distributed SGD.
All-reduce is the key communication primitive used in distributed data-parallel training due to the high performance in the homogeneous environment. However, All-reduce is sensitive to stragglers and communication delays as deep learning has been increasingly deployed on the heterogeneous environment like cloud. In this paper, we propose and analyze a novel variant of all-reduce, called partial-reduce, which provides high heterogeneity tolerance and performance by decomposing the synchronous all-reduce primitive into parallel-asynchronous partial-reduce operations. We provide theoretical guarantees, proving that partial-reduce converges to a stationary point at the similar sub-linear rate as distributed SGD. To enforce the convergence of the partial-reduce primitive, we further propose a dynamic staleness-aware distributed averaging algorithm and implement a novel group generation mechanism to prevent possible update isolation in heterogeneous environments. We build a prototype system in the real production cluster and validate its performance under different workloads. The experiments show that it is 1.21x-2x faster than other state-of-the-art baselines.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

3.8
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available