4.7 Article

SPCIM: Sparsity-Balanced Practical CIM Accelerator With Optimized Spatial-Temporal Multi-Macro Utilization

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/TCSI.2022.3216735

关键词

Computer architecture; Common Information Model (computing); Topology; Parallel processing; Organizations; Artificial neural networks; Spatial databases; Compute-in-memory (CIM); neural network; sparsity; CIM dataflow; CIM accelerator

向作者/读者索取更多资源

Compute-in-memory (CIM) is a promising technique for reducing data movement in neural network acceleration. However, practical multi-macro accelerators face issues of spatial and temporal under-utilization. To address these problems, we propose a Sparsity-balanced Practical CIM accelerator (SPCIM) that optimizes the dataflow and hardware architecture design. Experimental results show that SPCIM achieves significant speedup and energy savings compared to the baseline sparse CIM accelerator.
Compute-in-memory (CIM) is a promising technique that reduces data movement in neural network (NN) acceleration. To achieve higher efficiency, some recent CIM accelerators exploit NN sparsity based on CIM's small-grained operation unit (OU) feature. However, new problems arise in a practical multi-macro accelerator: The mismatch between workload parallelism and CIM macro organization causes spatial under-utilization; The multiple macros' different computation time leads to temporal under-utilization. To solve the under-utilization problems, we propose a Sparsity-balanced Practical CIM accelerator (SPCIM), including optimized dataflow and hardware architecture design. For the CIM dataflow design, we first propose a reconfigurable cluster topology for CIM macro organization. Then we regularize weight sparsity in the OU-height pattern and reorder the weight matrix based on the sparsity ratio. The cluster topology can be reshaped to match workload parallelism for higher spatial utilization. Each CIM cluster's workload is dynamically rebalanced for higher temporal utilization. Our hardware architecture supports the proposed dataflow with a spatial input dispatcher and a temporal workload allocator. Experimental results show that, compared with the baseline sparse CIM accelerator that suffers from spatial and temporal under-utilization, SPCIM achieves 2.94 x speedup and 2.86 x energy saving. The proposed sparsity-balanced dataflow and architecture are generic and scalable, which can be applied to other CIM accelerators. We strengthen two state-of-the-art CIM accelerators with the SPCIM techniques, improving their energy efficiency by 1.92 x and 5.59 x , respectively.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据