☆ 4.7 Article

Tensorox: Accelerating GPU Applications via Neural Approximation on Unused Tensor Cores

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS (2022)

期刊

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS

卷 33, 期 2, 页码 429-443

出版社

IEEE COMPUTER SOC

DOI: 10.1109/TPDS.2021.3093239

关键词

Hardware; Tensors; Neural networks; Deep learning; Graphics processing units; Task analysis; Training; Graphics processing units; parallel programming; approximate computing; neural networks; tensor processing unit; GPGPU

类别

Computer Science, Theory & Methods Engineering, Electrical & Electronic

资金

Singapore Ministry of Education [T1-251RES1818, MOE2016-T2-2-150]

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

This article proposes Tensorox, a framework that utilizes the half-precision tensor cores on recent GPUs to accelerate non deep learning applications. By training shallow neural networks and running multiple instances in parallel using tensor operations on Nvidia GPUs, our approximation achieves higher accuracy than running the original single precision programs, while allowing for runtime adjustment of the degree of approximation.

Driven by the demands of deep learning, many hardware accelerators, including GPUs, have begun to include specialized tensor processing units to accelerate matrix operations. However, general-purpose GPU applications that have little or no large dense matrix operations cannot benefit from these tensor units. This article proposes Tensorox, a framework that exploits the half-precision tensor cores available on recent GPUs for approximable, non deep learning applications. In essence, a shallow neural network is trained based on the input-output mapping of the function to be approximated. The key innovation in our implementation is the use of the small and dimension-restricted tensor operations in Nvidia GPUs to run multiple instances of the approximation neural network in parallel. With the proper scaling and training methods, our approximation yielded an overall accuracy that is higher than naively running the original programs with half-precision. Furthermore, Tensorox allows for the runtime adjustment of the degree of approximation. For the 10 benchmarks we tested, we achieved speedups from 2x to 112x compared to the original in single precision floating point, while maintaining the error caused by the approximation to below 10 percent in most applications.

Tensorox: Accelerating GPU Applications via Neural Approximation on Unused Tensor Cores

期刊

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS

出版社

IEEE COMPUTER SOC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Tensorox: Accelerating GPU Applications via Neural Approximation on Unused Tensor Cores

期刊

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS

出版社

IEEE COMPUTER SOC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文