期刊
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I-REGULAR PAPERS
卷 70, 期 1, 页码 214-227出版社
IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/TCSI.2022.3216735
关键词
Computer architecture; Common Information Model (computing); Topology; Parallel processing; Organizations; Artificial neural networks; Spatial databases; Compute-in-memory (CIM); neural network; sparsity; CIM dataflow; CIM accelerator
Compute-in-memory (CIM) is a promising technique for reducing data movement in neural network acceleration. However, practical multi-macro accelerators face issues of spatial and temporal under-utilization. To address these problems, we propose a Sparsity-balanced Practical CIM accelerator (SPCIM) that optimizes the dataflow and hardware architecture design. Experimental results show that SPCIM achieves significant speedup and energy savings compared to the baseline sparse CIM accelerator.
Compute-in-memory (CIM) is a promising technique that reduces data movement in neural network (NN) acceleration. To achieve higher efficiency, some recent CIM accelerators exploit NN sparsity based on CIM's small-grained operation unit (OU) feature. However, new problems arise in a practical multi-macro accelerator: The mismatch between workload parallelism and CIM macro organization causes spatial under-utilization; The multiple macros' different computation time leads to temporal under-utilization. To solve the under-utilization problems, we propose a Sparsity-balanced Practical CIM accelerator (SPCIM), including optimized dataflow and hardware architecture design. For the CIM dataflow design, we first propose a reconfigurable cluster topology for CIM macro organization. Then we regularize weight sparsity in the OU-height pattern and reorder the weight matrix based on the sparsity ratio. The cluster topology can be reshaped to match workload parallelism for higher spatial utilization. Each CIM cluster's workload is dynamically rebalanced for higher temporal utilization. Our hardware architecture supports the proposed dataflow with a spatial input dispatcher and a temporal workload allocator. Experimental results show that, compared with the baseline sparse CIM accelerator that suffers from spatial and temporal under-utilization, SPCIM achieves 2.94 x speedup and 2.86 x energy saving. The proposed sparsity-balanced dataflow and architecture are generic and scalable, which can be applied to other CIM accelerators. We strengthen two state-of-the-art CIM accelerators with the SPCIM techniques, improving their energy efficiency by 1.92 x and 5.59 x , respectively.
作者
我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。
推荐
暂无数据