4.7 Article

Automatic Generation of High-Performance Convolution Kernels on ARM CPUs for Deep Learning

期刊

出版社

IEEE COMPUTER SOC
DOI: 10.1109/TPDS.2022.3146257

关键词

Convolution; Program processors; Libraries; Tensors; Shape; Codes; Artificial intelligence; AI; convolution; deep learning

资金

  1. National Key Research and Development Program of China [2018YFB0204403]
  2. Strategic Priority CAS Project [XDB38050100]
  3. National Science Foundation of China [U1813203]
  4. Shenzhen Basic Research Fund [RCYX2020071411473419, KQTD20200820113106007, JSGG20190220164202211]
  5. CAS Key Lab [2011DP173015]
  6. JST, PRESTO [JPMJPR20MA]
  7. JSPS KAKENHI [JP21K17750]
  8. AIST Emerging Research, Japan [AAZ2029701B]
  9. Artificial Intelligence Initiative at Oak Ridge National Laboratory
  10. U.S. Department of Energy [DE-AC05-00OR22725]

向作者/读者索取更多资源

FastConv is a template-based code auto-generation open-source library that generates high-performance deep learning convolution kernels for arbitrary matrices/tensors shapes. It addresses the optimization challenge for convolution layers of different shapes and achieves performance portability by automatically selecting the best combination of kernel shapes, cache tiles, loop orders, packing strategies, access patterns, and computations. FastConv outperforms NNPACK, ARM NN, and FeatherCNN on Kunpeng 920 CPU, with speedups ranging from 1.02x to 2.48x. It also demonstrates performance portability on various convolution shapes and achieves significant speedups over NNPACK and ARM NN using Winograd on Kunpeng 920, as well as other CPU architectures such as Snapdragon, Apple M1, and AWS Graviton2.
We present FastConv, a template-based code auto-generation open-source library that can automatically generate high-performance deep learning convolution kernels of arbitrary matrices/tensors shapes. FastConv is based on the Winograd algorithm, which is reportedly the highest performing algorithm for the time-consuming layers of convolutional neural networks. ARM CPUs cover a wide range of designs and specifications, from embedded devices to HPC-grade CPUs. The leads to the dilemma of how to consistently optimize Winograd-based convolution solvers for convolution layers of different shapes. FastConv addresses this problem by using templates to auto-generate multiple shapes of tuned kernels variants suitable for skinny tall matrices. As a performance portable library, FastConv transparently searches for the best combination of kernel shapes, cache tiles, scheduling of loop orders, packing strategies, access patterns, and online/offline computations. Auto-tuning is used to search the parameter configuration space for the best performance for a given target architecture and problem size. Results show 1.02x to 1.40x, 1.14x to 2.17x, and 1.22x and 2.48x speedup is achieved over NNPACK, ARM NN, and FeatherCNN on Kunpeng 920. Furthermore, performance portability experiments with various convolution shapes show that FastConv achieves 1.2x to 1.7x speedup and 2x to 22x speedup over NNPACK and ARM NN inference engine using Winograd on Kunpeng 920. CPU performance portability evaluation on VGG-16 show an average speedup over NNPACK of 1.42x, 1.21x, 1.26x, 1.37x, 2.26x, and 11.02x on Kunpeng 920, Snapdragon 835, 855, 888, Apple M1, and AWS Graviton2, respectively.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据