4.5 Article

ML-driven risk estimation for memory failure in a data center environment with convolutional neural networks, self-supervised data labeling and distribution-based model drift determination

期刊

出版社

ACADEMIC PRESS INC ELSEVIER SCIENCE
DOI: 10.1016/j.jpdc.2023.104800

关键词

Convolutional neural network; Memory failures; Data center reliability; Memory error address decoding; MLOps framework

向作者/读者索取更多资源

With the trend towards multi-socket server systems, the demand for RAM per server has increased, resulting in more DIMM sockets per server. RAM issues have become a dominant failure pattern for servers due to the probability of failure in each DIMM. This study introduces an ML-driven framework to estimate the probability of memory failure for each RAM module. The framework utilizes structural information between correctable (CE) and uncorrectable errors (UE) and engineering measures to mitigate the impact of UE.
With the trend towards multi-socket server systems, the demand for random access memory (RAM) per server increased. The consequence are more DIMM sockets per server. Since every dual in-line memory module (DIMM), which comprises a series of dynamic random-access memory integrated circuits, has a probability of failure, RAM issues became a dominant failure pattern for servers. The concept introduced in this work contributes to improving the reliability of data centers by avoiding RAM failures and mitigating their impact. For this purpose, an ML-driven framework is provided to estimate the probability of memory failure for each RAM module. The ML framework is based on structural information between correctable (CE) and uncorrectable errors (UE). In a common memory scenario, a corrupted bit within a module can be restored by redundancy using an error correction code (ECC), resulting in a CE. However, if there is more than one corrupted bit within a group of bits covered by the ECC, the information cannot be restored, resulting in a UE.Consequently, the related task requesting the memory content, and the corresponding service may crash. There is evidence that UEs have a CE history and structural relation between the CEs. However, for the case of UEs without a CE history or of a false decision of the ML framework, we extend the total framework by engineering measures to mitigate the impact of a UE by avoiding kernel panic and using backups. The engineering measures use a mapping between physical and logical memory addresses.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.5
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据