4.5 Article

High-Performance Tensor Learning Primitives Using GPU Tensor Cores

Journal

IEEE TRANSACTIONS ON COMPUTERS
Volume 72, Issue 6, Pages 1733-1746

Publisher

IEEE COMPUTER SOC
DOI: 10.1109/TC.2022.3222955

Keywords

Tensors; Graphics processing units; Matrix decomposition; Libraries; Neural networks; Deep learning; Matrix converters; Tensor learning; tensor computing; GPU tensor cores; tensor layer; neural network

Ask authors/readers for more resources

This paper presents hardware-oriented optimization strategies for tensor learning primitives on GPU tensor cores, resulting in significant speedups for tasks such as tensor decomposition and neural network compression. The proposed optimizations achieve up to 32.25x speedup compared to existing libraries like TensorLab and TensorLy, demonstrating the effectiveness of GPU-based tensor learning.
learning is a powerful tool for big data analytics and machine learning, e.g., gene analysis and deep learning. However, tensor learning algorithms are compute-intensive since their time and space complexities grow exponentially with the order of tensors, which hinders their application. In this paper, we exploit the parallelism of tensor learning primitives using GPU tensor cores and develop high-performance tensor learning algorithms. First, we propose novel hardware-oriented optimization strategies for tensor learning primitives on GPU tensor cores. Second, for big data analytics, we employ the optimized tensor learning primitives to accelerate the CP tensor decomposition and then apply it for gene analysis. Third, we optimize the Tucker tensor decomposition and propose a novel Tucker tensor layer to compress deep neural networks. We employ natural gradients to train the neural networks, which only involve a forward pass without backpropagation and thus are suitable for GPU computations. Compared with TensorLab and TensorLy libraries on an A100 GPU, our third-order CP tensor decomposition achieves up to 16.32x and 32.25x speedups; and 6.09x and 6.72x speedups for our third-order Tucker tensor decomposition. The proposed fourth-order CP and Tucker tensor decompositions achieve up to 30.65x and 5.41x speedups over the TensorLab. Our CP tensor decomposition for gene analysis achieves up to 5.88x speedup over TensorLy. Compared with a conventional fully connected neural network, our Tucker tensor layer neural network achieves an accuracy of 97.9%, a speedup of 4.47x, and a compression ratio of 2.92 at the cost of 0.4% drop in accuracy.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.5
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available