4.7 Article

Anomaly Detection and Anticipation in High Performance Computing Systems

相关参考文献

注意:仅列出部分参考文献,下载原文获取全部文献信息。
Review Computer Science, Theory & Methods

Deep Learning for Anomaly Detection: A Review

Guansong Pang et al.

Summary: Deep anomaly detection has emerged as a critical direction in the research field of anomaly detection, covering advancements in multiple categories of methods. Reviewing them can help us understand their advantages, disadvantages, and how they address challenges.

ACM COMPUTING SURVEYS (2021)

Article Computer Science, Theory & Methods

A machine learning approach to online fault classification in HPC systems

Alessio Netti et al.

FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE (2020)

Article Computer Science, Theory & Methods

The Landscape of Exascale Research: A Data-Driven Literature Analysis

Stijn Heldens et al.

ACM COMPUTING SURVEYS (2020)

Article Computer Science, Theory & Methods

Online Diagnosis of Performance Variation in HPC Systems Using Machine Learning

Ozan Tuncer et al.

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS (2019)

Article Automation & Control Systems

A semisupervised autoencoder-based approach for anomaly detection in high performance computing systems

Andrea Borghesi et al.

ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE (2019)

Proceedings Paper Computer Science, Artificial Intelligence

Online Anomaly Detection in HPC Systems

Andrea Borghesi et al.

2019 IEEE INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE CIRCUITS AND SYSTEMS (AICAS 2019) (2019)

Proceedings Paper Computer Science, Theory & Methods

Paving theWay Toward Energy-Aware and Automated Datacentre

Andrea Bartolini et al.

PROCEEDINGS OF THE 48TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING WORKSHOPS (ICPP 2019) (2019)

Proceedings Paper Computer Science, Theory & Methods

FINJ: A Fault Injection Tool for HPC Systems

Alessio Netti et al.

EURO-PAR 2018: PARALLEL PROCESSING WORKSHOPS (2019)

Proceedings Paper Computer Science, Hardware & Architecture

Anomaly Detection in High Performance Computers: A Vicinity Perspective

Siavash Ghiasvand et al.

2019 18TH INTERNATIONAL SYMPOSIUM ON PARALLEL AND DISTRIBUTED COMPUTING (ISPDC 2019) (2019)

Article Computer Science, Theory & Methods

Unraveling Network-Induced Memory Contention: Deeper Insights with Machine Learning

Taylor Liles Groves et al.

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS (2018)

Article Computer Science, Theory & Methods

Modeling and Simulating Multiple Failure Masking Enabled by Local Recovery for Stencil-Based Applications at Extreme Scales

Marc Gamell et al.

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS (2017)

Article Engineering, Manufacturing

A Convolutional Neural Network for Fault Classification and Diagnosis in Semiconductor Manufacturing Processes

Ki Bum Lee et al.

IEEE TRANSACTIONS ON SEMICONDUCTOR MANUFACTURING (2017)

Article Computer Science, Theory & Methods

Adaptive Impact-Driven Detection of Silent Data Corruption for HPC Applications

Sheng Di et al.

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS (2016)

Article Computer Science, Theory & Methods

Using Migratable Objects to Enhance Fault Tolerance Schemes in Supercomputers

Esteban Meneses et al.

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS (2015)

Proceedings Paper Computer Science, Theory & Methods

Improving the computing efficiency of HPC systems using a combination of proactive and preventive checkpointing

Mohamed Slim Bouguerra et al.

IEEE 27TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS 2013) (2013)

Article Computer Science, Hardware & Architecture

The Reliability Wall for Exascale Supercomputing

Xuejun Yang et al.

IEEE TRANSACTIONS ON COMPUTERS (2012)

Article Computer Science, Artificial Intelligence

Random forests

L Breiman

MACHINE LEARNING (2001)