4.5 Article

Coded Sparse Matrix Computation Schemes That Leverage Partial Stragglers

期刊

IEEE TRANSACTIONS ON INFORMATION THEORY
卷 68, 期 6, 页码 4156-4181

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/TIT.2022.3152827

关键词

Sparse matrices; Task analysis; Encoding; Codes; Decoding; Web services; Numerical stability; Distributed computing; MDS code; stragglers; condition number; sparsity

资金

  1. National Science Foundation (NSF) [CCF-1718470, CCF-1910840, CCF-2115200]

向作者/读者索取更多资源

Distributed matrix computations can be affected by slow or failed worker nodes, but coded computation can mitigate these issues. However, using MDS codes may destroy the sparsity of sparse matrices and ignore the partial computations from slow nodes. This research proposes a scheme that leverages partial computation and reduces coding time, leading to improved numerical stability in the decoding process.
Distributed matrix computations over large clusters can suffer from the problem of slow or failed worker nodes (called stragglers) which can dominate the overall job execution time. Coded computation utilizes concepts from erasure coding to mitigate the effect of stragglers by running coded copies of tasks comprising a job; stragglers are typically treated as erasures. While this is useful, there are issues with applying, e.g., MDS codes in a straightforward manner. Several practical matrix computation scenarios involve sparse matrices. MDS codes typically require dense linear combinations of submatrices of the original matrices which destroy their inherent sparsity. This is problematic as it results in significantly higher worker computation times. Moreover, treating slow nodes as erasures ignores the potentially useful partial computations performed by them. Furthermore, some MDS techniques also suffer from significant numerical stability issues. In this work we present schemes that allow us to leverage partial computation by stragglers while imposing constraints on the level of coding that is required in generating the encoded submatrices. This significantly reduces the worker computation time as compared to previous approaches and results in improved numerical stability in the decoding process. Exhaustive numerical experiments on Amazon Web Services (AWS) clusters support our findings.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.5
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据