☆ 4.7 Article

Gradient Coding With Dynamic Clustering for Straggler-Tolerant Distributed Learning

IEEE TRANSACTIONS ON COMMUNICATIONS (2023)

期刊

IEEE TRANSACTIONS ON COMMUNICATIONS

卷 71, 期 6, 页码 3317-3332

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

DOI: 10.1109/TCOMM.2022.3166902

关键词

Distributed coded computation; gradient descent; straggler mitigation; gradient coding; clustering

类别

Engineering, Electrical & Electronic Telecommunications

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

Distributed implementations are crucial in speeding up large scale machine learning applications. Distributed gradient descent is widely used to parallelize the learning task, but straggling workers can cause performance bottlenecks. Recent techniques in coded distributed computation have been introduced to mitigate straggling workers and improve the completion time of iterations.

Distributed implementations are crucial in speeding up large scale machine learning applications. Distributed gradient descent (GD) is widely employed to parallelize the learning task by distributing the dataset across multiple workers. A significant performance bottleneck for the per-iteration completion time in distributed synchronous GD is straggling workers. Coded distributed computation techniques have been introduced recently to mitigate stragglers and to speed up GD iterations by assigning redundant computations to workers. In this paper, we introduce a novel paradigm of dynamic coded computation, which assigns redundant data to workers to acquire the flexibility to dynamically choose from among a set of possible codes depending on the past straggling behavior. In particular, we propose gradient coding (GC) with dynamic clustering, called GC-DC, and regulate the number of stragglers in each cluster by dynamically forming the clusters at each iteration. With time-correlated straggling behavior, GC-DC adapts to the straggling behavior over time; in particular, at each iteration, GC-DC aims at distributing the stragglers across clusters as uniformly as possible based on the past straggler behavior. For both homogeneous and heterogeneous worker models, we numerically show that GC-DC provides significant improvements in the average per-iteration completion time without an increase in the communication load compared to the original GC scheme.

Gradient Coding With Dynamic Clustering for Straggler-Tolerant Distributed Learning

期刊

IEEE TRANSACTIONS ON COMMUNICATIONS

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Gradient Coding With Dynamic Clustering for Straggler-Tolerant Distributed Learning

期刊

IEEE TRANSACTIONS ON COMMUNICATIONS

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文