Journal
PPOPP'22: PROCEEDINGS OF THE 27TH ACM SIGPLAN SYMPOSIUM ON PRINCIPLES AND PRACTICE OF PARALLEL PROGRAMMING
Volume -, Issue -, Pages 437-438Publisher
ASSOC COMPUTING MACHINERY
DOI: 10.1145/3503221.3508414
Keywords
Error Resilience; Fault Injection; Compiler; High Performance Computing
Funding
- U.S. Department of Energy, Office of Science [DE-AC02-06CH11357]
Ask authors/readers for more resources
The study finds that the existing SID technique faces a decrease in SDC coverage in HPC applications, due to evaluation limitations to single program inputs. To address this issue, the Sentinel framework is proposed to enhance SDC coverage across multiple inputs through automated compiler techniques.
With the ever-shrinking size of transistors and increasing scale of applications, silent data corruptions (SDCs) have become a common yet serious issue in HPC applications. Selective instruction duplication (SID) is a popular fault-tolerance technique that can obtain a high SDC coverage with low-performance overhead, as it selects the most vulnerable parts of a program for protection with priority. However, existing studies of SID are confined to single program input in the evaluation, assuming that the error resilience of the program remains similar across inputs, leading to a drastic loss of SDC coverage from SID when the protected program runs different inputs. Hence, we proposed Sentinel, an automated compiler-based framework to mitigate the loss of SDC coverage. Evaluation results show that Sentinel can effectively mitigate the loss of SDC coverage (up to 97.00%) across multiple inputs, which significantly hardens existing SID techniques.
Authors
I am an author on this paper
Click your name to claim this paper and add it to your profile.
Reviews
Recommended
No Data Available