4.6 Article

MIXED PRECISION BLOCK FUSED MULTIPLY-ADD: ERROR ANALYSIS AND APPLICATION TO GPU TENSOR CORES

期刊

SIAM JOURNAL ON SCIENTIFIC COMPUTING
卷 42, 期 3, 页码 C124-C141

出版社

SIAM PUBLICATIONS
DOI: 10.1137/19M1289546

关键词

fused multiply-add; tensor cores; floating-point arithmetic; rounding error analysis; NVIDIA GPU; matrix multiplication; LU factorization

资金

  1. Engineering and Physical Sciences Research Council [EP/P020720/1]
  2. MathWorks
  3. Royal Society
  4. EPSRC [EP/P020720/1] Funding Source: UKRI

向作者/读者索取更多资源

Computing units that carry out a fused multiply-add (FMA) operation with matrix arguments, referred to as tensor units by some vendors, have great potential for use in scientific computing. However, these units are inherently mixed precision, and existing rounding error analyses do not support them. We consider a mixed precision block FMA that generalizes both the usual scalar FMA and existing tensor units. We describe how to exploit such a block FMA in the numerical linear algebra kernels of matrix multiplication and LU factorization and give detailed rounding error analyses of both kernels. An important application is to GMRES-based iterative refinement with block FMAs, about which our analysis provides new insight. Our framework is applicable to the tensor core units in the NVIDIA Volta and Turing GPUs. For these we compare matrix multiplication and LU factorization with TC16 and TC32 forms of FMA, which differ in the precision used for the output of the tensor cores. Our experiments on an NVDIA V100 GPU confirm the predictions of the analysis that the TC32 variant is much more accurate than the TC16 one, and they show that the accuracy boost is obtained with almost no performance loss.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.6
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据