☆ 4.7 Article

Automatic Generation of High-Performance Convolution Kernels on ARM CPUs for Deep Learning

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS (2022)

期刊

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS

卷 33, 期 11, 页码 2885-2899

出版社

IEEE COMPUTER SOC

DOI: 10.1109/TPDS.2022.3146257

关键词

Convolution; Program processors; Libraries; Tensors; Shape; Codes; Artificial intelligence; AI; convolution; deep learning

类别

Computer Science, Theory & Methods Engineering, Electrical & Electronic

资金

National Key Research and Development Program of China [2018YFB0204403]
Strategic Priority CAS Project [XDB38050100]
National Science Foundation of China [U1813203]
Shenzhen Basic Research Fund [RCYX2020071411473419, KQTD20200820113106007, JSGG20190220164202211]
CAS Key Lab [2011DP173015]
JST, PRESTO [JPMJPR20MA]
JSPS KAKENHI [JP21K17750]
AIST Emerging Research, Japan [AAZ2029701B]
Artificial Intelligence Initiative at Oak Ridge National Laboratory
U.S. Department of Energy [DE-AC05-00OR22725]

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

FastConv is a template-based code auto-generation open-source library that generates high-performance deep learning convolution kernels for arbitrary matrices/tensors shapes. It addresses the optimization challenge for convolution layers of different shapes and achieves performance portability by automatically selecting the best combination of kernel shapes, cache tiles, loop orders, packing strategies, access patterns, and computations. FastConv outperforms NNPACK, ARM NN, and FeatherCNN on Kunpeng 920 CPU, with speedups ranging from 1.02x to 2.48x. It also demonstrates performance portability on various convolution shapes and achieves significant speedups over NNPACK and ARM NN using Winograd on Kunpeng 920, as well as other CPU architectures such as Snapdragon, Apple M1, and AWS Graviton2.

We present FastConv, a template-based code auto-generation open-source library that can automatically generate high-performance deep learning convolution kernels of arbitrary matrices/tensors shapes. FastConv is based on the Winograd algorithm, which is reportedly the highest performing algorithm for the time-consuming layers of convolutional neural networks. ARM CPUs cover a wide range of designs and specifications, from embedded devices to HPC-grade CPUs. The leads to the dilemma of how to consistently optimize Winograd-based convolution solvers for convolution layers of different shapes. FastConv addresses this problem by using templates to auto-generate multiple shapes of tuned kernels variants suitable for skinny tall matrices. As a performance portable library, FastConv transparently searches for the best combination of kernel shapes, cache tiles, scheduling of loop orders, packing strategies, access patterns, and online/offline computations. Auto-tuning is used to search the parameter configuration space for the best performance for a given target architecture and problem size. Results show 1.02x to 1.40x, 1.14x to 2.17x, and 1.22x and 2.48x speedup is achieved over NNPACK, ARM NN, and FeatherCNN on Kunpeng 920. Furthermore, performance portability experiments with various convolution shapes show that FastConv achieves 1.2x to 1.7x speedup and 2x to 22x speedup over NNPACK and ARM NN inference engine using Winograd on Kunpeng 920. CPU performance portability evaluation on VGG-16 show an average speedup over NNPACK of 1.42x, 1.21x, 1.26x, 1.37x, 2.26x, and 11.02x on Kunpeng 920, Snapdragon 835, 855, 888, Apple M1, and AWS Graviton2, respectively.

Automatic Generation of High-Performance Convolution Kernels on ARM CPUs for Deep Learning

期刊

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS

出版社

IEEE COMPUTER SOC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Automatic Generation of High-Performance Convolution Kernels on ARM CPUs for Deep Learning

期刊

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS

出版社

IEEE COMPUTER SOC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文