☆ 4.5 Article

ML-driven risk estimation for memory failure in a data center environment with convolutional neural networks, self-supervised data labeling and distribution-based model drift determination

JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING (2024)

期刊

JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING

卷 185, 期 -, 页码 -

出版社

ACADEMIC PRESS INC ELSEVIER SCIENCE

DOI: 10.1016/j.jpdc.2023.104800

关键词

Convolutional neural network; Memory failures; Data center reliability; Memory error address decoding; MLOps framework

类别

Computer Science, Theory & Methods

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

With the trend towards multi-socket server systems, the demand for RAM per server has increased, resulting in more DIMM sockets per server. RAM issues have become a dominant failure pattern for servers due to the probability of failure in each DIMM. This study introduces an ML-driven framework to estimate the probability of memory failure for each RAM module. The framework utilizes structural information between correctable (CE) and uncorrectable errors (UE) and engineering measures to mitigate the impact of UE.

With the trend towards multi-socket server systems, the demand for random access memory (RAM) per server increased. The consequence are more DIMM sockets per server. Since every dual in-line memory module (DIMM), which comprises a series of dynamic random-access memory integrated circuits, has a probability of failure, RAM issues became a dominant failure pattern for servers. The concept introduced in this work contributes to improving the reliability of data centers by avoiding RAM failures and mitigating their impact. For this purpose, an ML-driven framework is provided to estimate the probability of memory failure for each RAM module. The ML framework is based on structural information between correctable (CE) and uncorrectable errors (UE). In a common memory scenario, a corrupted bit within a module can be restored by redundancy using an error correction code (ECC), resulting in a CE. However, if there is more than one corrupted bit within a group of bits covered by the ECC, the information cannot be restored, resulting in a UE.Consequently, the related task requesting the memory content, and the corresponding service may crash. There is evidence that UEs have a CE history and structural relation between the CEs. However, for the case of UEs without a CE history or of a false decision of the ML framework, we extend the total framework by engineering measures to mitigate the impact of a UE by avoiding kernel panic and using backups. The engineering measures use a mapping between physical and logical memory addresses.

ML-driven risk estimation for memory failure in a data center environment with convolutional neural networks, self-supervised data labeling and distribution-based model drift determination

期刊

JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING

出版社

ACADEMIC PRESS INC ELSEVIER SCIENCE

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

ML-driven risk estimation for memory failure in a data center environment with convolutional neural networks, self-supervised data labeling and distribution-based model drift determination

期刊

JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING

出版社

ACADEMIC PRESS INC ELSEVIER SCIENCE

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文