☆ 4.6 Article

MATRIX MULTIPLICATION IN MULTIWORD ARITHMETIC: ERROR ANALYSIS AND APPLICATION TO GPU TENSOR CORES

SIAM JOURNAL ON SCIENTIFIC COMPUTING (2023)

期刊

SIAM JOURNAL ON SCIENTIFIC COMPUTING

卷 45, 期 1, 页码 C1-C19

出版社

SIAM PUBLICATIONS

DOI: 10.1137/21M1465032

关键词

matrix multiplication; numerical linear algebra; rounding error analysis; floating-point arithmetic; multiword arithmetic; reduced precision; mixed precision; GPUs; NVIDIA V100; NVIDIA A100; tensor cores; rounding modes; blocked summation; FABsum

类别

Mathematics, Applied

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

This paper investigates the use of multiword arithmetic to improve the performance-accuracy tradeoff of matrix multiplication with mixed precision block fused multiply--add (FMA) hardware, focusing on NVIDIA GPUs' tensor cores. The authors develop an error analysis of multiword matrix multiplication and implement several algorithms using double-fp16 arithmetic. However, they find that double-fp16 is less accurate than fp32 arithmetic, despite satisfying the same worst-case error bound. By using probabilistic error analysis, they identify the rounding mode used by the NVIDIA tensor cores as the likely cause and propose a parameterized blocked summation algorithm to alleviate the problem and improve the performance-accuracy tradeoff.

In multiword arithmetic, a matrix is represented as the unevaluated sum of two or more lower precision matrices, and a matrix product is formed by multiplying the constituents in low precision. We investigate the use of multiword arithmetic for improving the performance-accuracy tradeoff of matrix multiplication with mixed precision block fused multiply--add (FMA) hardware, focusing especially on the tensor cores available on NVIDIA GPUs. Building on a general block FMA framework, we develop a comprehensive error analysis of multiword matrix multiplication. After confirming the theoretical error bounds experimentally by simulating low precision in software, we use the cuBLAS and CUTLASS libraries to implement a number of matrix multiplication algorithms using double-fp16 (double-binary16) arithmetic. When running the algorithms on NVIDIA V100 and A100 GPUs, we find that double-fp16 is not as accurate as fp32 (binary32) arithmetic despite satisfying the same worst-case error bound. Using probabilistic error analysis, we explain why this issue is likely to be caused by the rounding mode used by the NVIDIA tensor cores, and we propose a parameterized blocked summation algorithm that alleviates the problem and significantly improves the performance-accuracy tradeoff.

MATRIX MULTIPLICATION IN MULTIWORD ARITHMETIC: ERROR ANALYSIS AND APPLICATION TO GPU TENSOR CORES

期刊

SIAM JOURNAL ON SCIENTIFIC COMPUTING

出版社

SIAM PUBLICATIONS

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

MATRIX MULTIPLICATION IN MULTIWORD ARITHMETIC: ERROR ANALYSIS AND APPLICATION TO GPU TENSOR CORES

期刊

SIAM JOURNAL ON SCIENTIFIC COMPUTING

出版社

SIAM PUBLICATIONS

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文