☆ 4.5 Article

Future Scaling of Memory Hierarchy for Tensor Cores and Eliminating Redundant Shared Memory Traffic Using Inter-Warp Multicasting

IEEE TRANSACTIONS ON COMPUTERS (2022)

Journal

IEEE TRANSACTIONS ON COMPUTERS

Volume 71, Issue 12, Pages 3115-3126

Publisher

IEEE COMPUTER SOC

DOI: 10.1109/TC.2022.3207134

Keywords

GPU performance; deep neural network; tensor core; inter-warp multicasting

Funding

Samsung Advanced Institute of Technology
Samsung Electronics Co., Ltd.
NRF of Korea - KoreanGovernment, MSIT [2018R1A5A1059921]
IC Design Education Center (IDEC), Korea

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

This paper investigates the differences in computational performance and memory bandwidth between the CUDA core and the Tensor core in NVIDIA GPUs. Through comparisons and analysis of different generations of Tensor cores, a new method to reduce shared memory traffic is proposed. The experimental results show that inter-warp multicasting significantly improves the performance of deep neural networks.

The CUDA core of NVIDIA GPUs had been one of the most efficient computation units for parallel computing. However, recent rapid developments in deep neural networks demand an even higher level of computational performance. To meet this requirement, NVIDIA has introduced the Tensor core in recent generations. However, their impressive enhancements in computational performance have newly brought high pressure on the memory hierarchy. In this paper, first we identify the required memory bandwidth in the memory hierarchy as the computational performance increases in actual GPU hardware. Through a comparison of the CUDA core and the Tensor core in V100, we find that the tremendous performance increase of the Tensor core requires much higher memory bandwidth than that in the CUDA core. Moreover, we thoroughly investigate memory bandwidth requirement over Tensor core generations of V100, RTX TITAN, and A100. Lastly, we analyze a hypothetical next-generation Tensor core introduced by NVIDIA through a GPU simulation, through which we propose an inter-warp multicasting microarchitecture that reduces redundant shared memory (SMEM) traffic during the GEMM process. Our evaluation shows that inter-warp multicasting reduces the SMEM bandwidth pressure by 33% and improves the performance by 19% on average in all layers of ResNet-152 and BERT-Large.

Future Scaling of Memory Hierarchy for Tensor Cores and Eliminating Redundant Shared Memory Traffic Using Inter-Warp Multicasting

Journal

IEEE TRANSACTIONS ON COMPUTERS

Publisher

IEEE COMPUTER SOC

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Future Scaling of Memory Hierarchy for Tensor Cores and Eliminating Redundant Shared Memory Traffic Using Inter-Warp Multicasting

Journal

IEEE TRANSACTIONS ON COMPUTERS

Publisher

IEEE COMPUTER SOC

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper