☆ 4.6 Article

On the Safe Deployment of Matrix Multiplication in Massively Parallel Safety-Related Systems

APPLIED SCIENCES-BASEL (2022)

Journal

APPLIED SCIENCES-BASEL

Volume 12, Issue 8, Pages -

Publisher

MDPI

DOI: 10.3390/app12083779

Keywords

safety; reliability; CNN; matrix multiplication; GPU; fault detection

Funding

European Union [871465]

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

This paper presents a safe matrix-matrix multiplication software implementation for GPUs with random hardware error-detection capabilities, which serves as a foundation for the implementation of safe deep learning libraries for GPUs. The performance impact and achievable diagnostic coverage of these mechanisms are measured with a set of representative matrix dimensions.

Deep learning technology has enabled the development of increasingly complex safety-related autonomous systems using high-performance computers, such as graphics processing units (GPUs), which provide the required high computing performance for the execution of parallel computing algorithms, such as matrix-matrix multiplications (a central computing element of deep learning software libraries). However, the safety certification of parallel computing software algorithms and GPU-based safety-related systems is a challenge to be addressed. For example, achieving the required fault-tolerance and diagnostic coverage for random hardware errors. This paper contributes with a safe matrix-matrix multiplication software implementation for GPUs with random hardware error-detection capabilities (permanent, transient) that can be used with different architectural patterns for fault-tolerance, and which serves as a foundation for the implementation of safe deep learning libraries for GPUs. The proposed contribution is complementary and can be combined with other techniques, such as algorithm-based fault tolerance. In particular, (i) we provide the high-performance matrix multiplication CUTLASS library with a catalog of diagnostic mechanisms to detect random hardware errors down to the arithmetic operation level; and (ii) we measure the performance impact incurred by the adoption of these mechanisms and their achievable diagnostic coverage with a set of representative matrix dimensions. To that end, we implement these algebraic operations, targeting CUDA cores with single instructions and multiple-thread math instructions in an NVIDIA Xavier NX GPU.

On the Safe Deployment of Matrix Multiplication in Massively Parallel Safety-Related Systems

Journal

APPLIED SCIENCES-BASEL

Publisher

MDPI

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

On the Safe Deployment of Matrix Multiplication in Massively Parallel Safety-Related Systems

Journal

APPLIED SCIENCES-BASEL

Publisher

MDPI

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper