Journal
EURO-PAR 2018: PARALLEL PROCESSING WORKSHOPS
Volume 11339, Issue -, Pages 800-812Publisher
SPRINGER INTERNATIONAL PUBLISHING AG
DOI: 10.1007/978-3-030-10549-5_62
Keywords
Exascale systems; Resiliency; Fault detection; Monitoring; Benchmarking; Open-source
Categories
Funding
- Oprecomp-Open Transprecision Computing project
- EU [654024]
Ask authors/readers for more resources
We present FINJ, a high-level fault injection tool for High-Performance Computing (HPC) systems, with a focus on the management of complex experiments. FINJ provides support for custom workloads and allows generation of anomalous conditions through the use of fault-triggering executable programs. FINJ can also be integrated seamlessly with most other lower-level fault injection tools, allowing users to create and monitor a variety of highly-complex and diverse fault conditions in HPC systems that would be difficult to recreate in practice. FINJ is suitable for experiments involving many, potentially interacting nodes, making it a very versatile design and evaluation tool.
Authors
I am an author on this paper
Click your name to claim this paper and add it to your profile.
Reviews
Recommended
No Data Available