4.7 Article

Adaptive Impact-Driven Detection of Silent Data Corruption for HPC Applications

期刊

出版社

IEEE COMPUTER SOC
DOI: 10.1109/TPDS.2016.2517639

关键词

Fault tolerance; silent data corruption; exascale HPC

资金

  1. U.S. Department of Energy, Office of Science, Advanced Scientific Computing Research Program [DE-AC02-06CH11357]
  2. ANR RESCUE
  3. INRIA-Illinois-ANL-BSC Joint Laboratory on Extreme Scale Computing
  4. Center for Exascale Simulation of Advanced Reactors (CESAR) at Argonne
  5. U.S. Department of Energy Office of Science laboratory [DE-AC02-06CH11357]

向作者/读者索取更多资源

For exascale HPC applications, silent data corruption (SDC) is one of the most dangerous problems because there is no indication that there are errors during the execution. We propose an adaptive impact-driven method that can detect SDCs dynamically. The key contributions are threefold. (1) We carefully characterize 18 HPC applications/benchmarks and discuss the runtime data features, as well as the impact of the SDCs on their execution results. (2) We propose an impact-driven detection model that does not blindly improve the prediction accuracy, but instead detects only influential SDCs to guarantee user-acceptable execution results. (3) Our solution can adapt to dynamic prediction errors based on local runtime data and can automatically tune detection ranges for guaranteeing low false alarms. Experiments show that our detector can detect 80-99.99 percent of SDCs with a false alarm rate less that 1 percent of iterations for most cases. The memory cost and detection overhead are reduced to 15 and 6.3 percent, respectively, for a large majority of applications.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据