4.2 Article Proceedings Paper

Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems

期刊

ACM TRANSACTIONS ON STORAGE
卷 14, 期 3, 页码 -

出版社

ASSOC COMPUTING MACHINERY
DOI: 10.1145/3242086

关键词

Hardware fault; performance; fail-slow; fail-stutter; limpware; jitter

资金

  1. NSF [CCF-1336580, CNS-1350499, CNS-1526304, CNS-1563956]
  2. DOE Office of Science User Facility [DE-AC02-06CH11357]

向作者/读者索取更多资源

Fail-slow hardware is an under-studied failure mode. We present a study of 114 reports of fail-slow hardware incidents, collected from large-scale cluster deployments in 14 institutions. We show that all hardware types such as disk, SSD, CPU, memory, and network components can exhibit performance faults. We made several important observations such as faults convert from one form to another, the cascading root causes and impacts can be long, and fail-slow faults can have varying symptoms. From this study, we make suggestions to vendors, operators, and systems designers.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.2
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据