☆ 3.8 Proceedings Paper

Heterogeneity-Aware Distributed Machine Learning Training via Partial Reduce

SIGMOD '21: PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA (2021)

Journal

SIGMOD '21: PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA

Volume -, Issue -, Pages 2262-2270

Publisher

ASSOC COMPUTING MACHINERY

DOI: 10.1145/3448016.3452773

Keywords

Distributed machine learning; Heterogeneity; All-Reduce

Funding

National Key Research and Development Program of China [2018YFB1004403]
National Natural Science Foundation of China [61832001, U1936104, 61702015]
PKU-Tencent joint research Lab
CAAI Huawei MindSpore Open Fund
Fundamental Research Funds for the Central Universities [2020RC25]
Beijing Academy of Artificial Intelligence (BAAI)

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

The paper introduces a novel variant of All-reduce called partial-reduce, which improves tolerance and performance in heterogeneous environments by decomposing synchronous operations into parallel-asynchronous partial reduce operations, with a sub-linear convergence rate similar to distributed SGD.

All-reduce is the key communication primitive used in distributed data-parallel training due to the high performance in the homogeneous environment. However, All-reduce is sensitive to stragglers and communication delays as deep learning has been increasingly deployed on the heterogeneous environment like cloud. In this paper, we propose and analyze a novel variant of all-reduce, called partial-reduce, which provides high heterogeneity tolerance and performance by decomposing the synchronous all-reduce primitive into parallel-asynchronous partial-reduce operations. We provide theoretical guarantees, proving that partial-reduce converges to a stationary point at the similar sub-linear rate as distributed SGD. To enforce the convergence of the partial-reduce primitive, we further propose a dynamic staleness-aware distributed averaging algorithm and implement a novel group generation mechanism to prevent possible update isolation in heterogeneous environments. We build a prototype system in the real production cluster and validate its performance under different workloads. The experiments show that it is 1.21x-2x faster than other state-of-the-art baselines.

Heterogeneity-Aware Distributed Machine Learning Training via Partial Reduce

Journal

SIGMOD '21: PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA

Publisher

ASSOC COMPUTING MACHINERY

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Heterogeneity-Aware Distributed Machine Learning Training via Partial Reduce

Journal

SIGMOD '21: PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA

Publisher

ASSOC COMPUTING MACHINERY

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper