4.2 Article

MVP: An Efficient CNN Accelerator with Matrix, Vector, and Processing-Near-Memory Units

Publisher

ASSOC COMPUTING MACHINERY
DOI: 10.1145/3497745

Keywords

Neural Networks; CNN; energy-efficient AI accelerator

Funding

  1. Samsung Advanced Institute of Technology
  2. Engineering Research Center Program through the National Research Foundation of Korea (NRF) - Korean Government MSIT [NRF-2018R1A5A1059921]
  3. IC Design Education Center

Ask authors/readers for more resources

Mobile and edge devices are commonly used for inferring CNNs, but existing accelerators are not optimal for the latest CNN models, especially DW-CONV and SE models. This paper proposes a CNN acceleration architecture called MVP, which efficiently processes both compute- and memory-intensive operations with a small area overhead on top of the baseline systolic-array-based architecture.
Mobile and edge devices become common platforms for inferring convolutional neural networks (CNNs) due to superior privacy and service quality. To reduce the computational costs of convolution (CONV), recent CNN models adopt depth-wise CONV (DW-CONV) and Squeeze-and-Excitation (SE). However, existing area-efficient CNN accelerators are sub-optimal for these latest CNN models because they were mainly optimized for compute-intensive standard CONV layers with abundant data reuse that can be pipelined with activation and normalization operations. In contrast, DW-CONV and SE are memory-intensive with limited data reuse. The latter also strongly depends on the nearby CONV layers, making an effective pipelining a daunting task. Therefore, DW-CONV and SE only occupy 10% of entire operations but become memory bandwidth bound, spending more than 60% of the processing time in systolic-array-based accelerators. We propose a CNN acceleration architecture called MVP, which efficiently processes both compute- and memory-intensive operations with a small area overhead on top of the baseline systolic-array-based architecture. We suggest a specialized vector unit tailored for processing DW-CONV, including multipliers, adder trees, and multi-banked buffers to meet the high memory bandwidth requirement. We augment the unified buffer with tiny processing elements to smoothly pipeline SE with the subsequent CONV, enabling concurrent processing of DW-CONV with standard CONV, thereby achieving themaximum utilization of arithmetic units. Our evaluation shows that MVP improves performance by 2.6x and reduces energy by 47% on average for EfficientNet-B0/B4/B7, MnasNet, and MobileNet-V1/V2 with only a 9% area overhead compared to the baseline.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.2
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available