3.8 Proceedings Paper

Rule-Based Thermal Anomaly Detection for Tier-0 HPC Systems

Publisher

SPRINGER INTERNATIONAL PUBLISHING AG
DOI: 10.1007/978-3-031-23220-6_18

Keywords

HPC; Anomaly detection; HPC Monitoring Systems

Funding

  1. EU [956560, 101034126, 101033975]
  2. European Processor Initiative (EPI) SGA2 [101036168]
  3. CINECA

Ask authors/readers for more resources

This paper investigates thermal anomaly detection task in Marconi100, one of the most powerful HPC systems in the world, and successfully validates the suggested method against real thermal hazard events in production.
Today, significant advances in science and technology can not be envisioned without high computing capacity. To solve large problems in science, engineering, and business, data centers provide High-Performance Computing (HPC) systems with aggregation of the computing capacity of thousand of computing nodes with the cost of millions of euros per year [12]. In the datacenter, an anomaly is a suspicious/abnormal pattern in the monitoring signals. The severity of the anomaly can be different, and in extreme conditions, it can yield the outage of the datacenter. By defining complex statistical rules-based anomaly detection methods, this paper investigates the thermal anomaly detection task in one of the most powerful HPC systems in the world, namely Marconi100 hosted at CINECA. The suggested anomaly detection method is successfully validated against real thermal hazard events reported for the studied HPC cluster while in production.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

3.8
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available