4.7 Article

Multi-Framework Reliability Approach

期刊

IEEE TRANSACTIONS ON CLOUD COMPUTING
卷 10, 期 4, 页码 2750-2768

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/TCC.2021.3065906

关键词

Big-data; cloud-computing; fault-tolerance; reliability

资金

  1. Northrop Grumman CRC
  2. Amazon AWS
  3. DARPA Grant [N11AP20014]
  4. NSF [TC-1117065, SWF-1421910, CSR-1618923]
  5. ERC Consolidator Grant [617805]
  6. DFG
  7. BMBF Center CRISP

向作者/读者索取更多资源

This article proposes the paradigm of dependable resources, which provides generic fault tolerance mechanisms by offering fault tolerance support at the level of resource management systems. Through the demonstration of Guardian, the benefits of this concept are shown, improving completion time for big data processing frameworks in the presence of failures while maintaining low overhead.
Despite advances in making datacenters dependable, failures still happen. This is particularly onerous for long-running big data applications, where partial failures can lead to significant losses and lengthy recomputations. Big data processing frameworks like Hadoop MapReduce include fault tolerance (FT) mechanisms, but these are commonly targeted at specific system/failure models, and are often redundant between frameworks. This article proposes the paradigm of dependable resources: big data processing frameworks are typically built on top of resource management systems (RMSs), and proposing FT support at the level of such an RMS yields generic FT mechanisms, which can be provided with low overhead by leveraging constraints on resources. We demonstrate our concepts through Guardian, a robust RMS based on Mesos and YARN. Guardian allows frameworks to run their applications with individually configurable FT granularity and degree, with only minor changes to their implementation. We demonstrate the benefits of our approach by evaluating Hadoop, Tez, Spark and Pig on a prototype of Guardian running on Amazon-EC2, improving completion time by around 68 percent in the presence of failures, while maintaining around 6 percent overhead.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据