☆ 4.6 Article

Online Failure Prediction for Complex Systems: Methodology and Case Studies

IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING (2023)

期刊

IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING

卷 20, 期 4, 页码 3520-3534

出版社

IEEE COMPUTER SOC

DOI: 10.1109/TDSC.2022.3192671

关键词

Online failure prediction; reliability; availability; machine learning

类别

Computer Science, Hardware & Architecture Computer Science, Information Systems Computer Science, Software Engineering

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

Online Failure Prediction (OFP) allows for proactive measures to be taken before a failure occurs, but it still faces limitations in terms of selecting optimal features and assessing predictive models. This article presents case studies on Linux and Windows, demonstrating the creation of models that can predict various types of failures and considering important factors such as operational requirements. A well-structured framework is also introduced for fair assessment and comparison of alternative predictive solutions.

Online Failure Prediction (OFP) allows proactively taking countermeasures before a failure occurs, such as saving data or restarting a system. However, despite its potential contribution to improving dependability, OFP still presents key limitations. Besides the problem of choosing the optimal set of features, assessing predictive models is complex and common procedures for supporting comparison are not available. There is, in fact, little work on developing and assessing failure predictors for complex systems. In this aricle, we present two extensive case studies on distinct Operating Systems (OSs), Linux and Windows, showing that it is possible to create models that can predict different types of incoming failures, highlighting various important considerations such as the operational requirements of the target system. To drive the case studies, we define a well-structured framework for a fair and sound assessment and comparison of alternative predictive solutions. It includes scenarios for choosing the most adequate metrics for the assessment, comparing alternative models, and selecting the best predictor, while considering the need to tolerate perturbations in the data. In practice, we show that, by following a well-defined process, it is possible to develop accurate failure predictors and establish a ranking of the models under evaluation in different scenarios and OSs.

Online Failure Prediction for Complex Systems: Methodology and Case Studies

期刊

IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING

出版社

IEEE COMPUTER SOC

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Online Failure Prediction for Complex Systems: Methodology and Case Studies

期刊

IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING

出版社

IEEE COMPUTER SOC

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文