4.7 Article

Bandwidth-Aware Scheduling Repair Techniques in Erasure-Coded Clusters: Design and Analysis

期刊

出版社

IEEE COMPUTER SOC
DOI: 10.1109/TPDS.2022.3153061

关键词

Erasure coding; recovery time; heterogeneous network; repair link

资金

  1. National Key R&D Program of China [2018YFB1003305]
  2. National Natural Science Foundation of China [61872414]
  3. Key Laboratory of Information Storage System Ministry of Education of China

向作者/读者索取更多资源

This article proposes a single-node multi-level forwarding repair technique for heterogeneous networks, as well as a multi-node scheduling repair technique. SMFRepair accelerates single-node recovery time by selecting helper nodes and utilizing idle nodes. On the other hand, MSRepair reduces multi-node recovery time by scheduling repair links on multiple nodes.
Erasure codes offer a storage-efficient redundancy mechanism for maintaining data availability guarantees in storage clusters, yet also incur high network traffic consumption and recovery time in failure repair. Extensive research has been carried out to reduce the recovery time. However, previous works either target specific erasure code constructions which are not commonly used in today's distributed storage clusters or neglect the heterogeneous bandwidth property in real network environments. Since erasure-coded clusters are typically composed of multi-node with heterogeneous bandwidth and accessed in parallel, the whole recovery time is mainly restricted by the low-bandwidth links. In this article, we propose SMFRepair, a single-node multi-level forwarding repair technique that is designed to improve the performance in heterogeneous networks based on Reed-Solomon codes for general fault tolerance. SMFRepair carefully selects the helper nodes and uses idle nodes to bypass low-bandwidth links. Idle nodes have sufficient and unused network bandwidth. It also pipelines the repair links that are optimized by idle nodes. Furthermore, a multi-node scheduling repair technique, called MSRepair, is proposed. MSRepair carefully schedules the multi-node repair link to saturate the most unoccupied bandwidth and transfers data from as large-bandwidth links as possible, with the primary objective of minimizing the recovery time. Large-scale simulation and Amazon EC2 real experiments show that compared to state-of-the-art repair techniques, SMFRepair can accelerate the single-node recovery by up to 47.69%, and MSRepair can reduce the multi-node recovery time by 33.78%similar to 67.53%.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据