4.4 Article

A Multiply-and-Accumulate Array for Machine Learning Applications Based on a 3D Nanofabric Flow

期刊

IEEE TRANSACTIONS ON NANOTECHNOLOGY
卷 20, 期 -, 页码 873-882

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/TNANO.2021.3132224

关键词

3D logic integration; emerging technologies; hardware accelerators; nanotechnologies

资金

  1. IMEC core partners' CMOS program

向作者/读者索取更多资源

In this paper, a novel 3D integration scheme called 3D Nanofabric is introduced to reduce manufacturing costs. Additionally, a low-footprint MAC unit and a systolic 3D MAC array aimed at convolutional neural networks are demonstrated using this scheme, showing significant improvements in area, area-delay-product, energy overhead, and TOPs/mm(2) compared to traditional 2D implementations.
To keep pushing Moore's law cadence and improve integrated circuits area, delay, and power, novel fabrication schemes such as parallel and monolithic 3D integration have been recently proposed. While parallel 3D is limited by the large TSV pitch, monolithic 3D suffers from the high cost of the additional masks and processing steps, limiting the number of stacked transistor layers. In our previous work, we introduced a novel 3D integration scheme called 3D Nanofabric. Inspired by the 3D NAND flash process, the flow consists of N identical vertical tiers where multiple vertical layers can be patterned simultaneously, significantly reducing the manufacturing cost. In this paper, we propose to build low-footprint Multiply-And-Accumulate (MAC) units using our 3D Nanofabric flow. Since a MAC unit can be laid out as a regular array, we demonstrate how to arrange in a 3D fashion across several vertical tiers of the 3D Nanofabric. Through circuit-level evaluations, we show that for a 64-input bit MAC unit consisting of 64 stacked vertical tiers, the area and area-delay-product are reduced by 21.0x and 16.7x, respectively, compared to a traditional 2D implementation using a 28 nm FDSOI technology, with only a 43% energy overhead. More importantly, the total fabrication cost is reduced, producing a cost scaling roadmap. Additionally, we show how to build a systolic 3D MAC array aimed at convolutional neural networks. Through architectural evaluations, we demonstrate that when running VGG-16, our 3D MAC array can improve the TOPs/mm(2) by 2.8x compared to a TPU-like 2D systolic array.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.4
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据