4.6 Article

Fully On-Chip MAC at 14 nm Enabled by Accurate Row-Wise Programming of PCM-Based Weights and Parallel Vector-Transport in Duration-Format

期刊

IEEE TRANSACTIONS ON ELECTRON DEVICES
卷 68, 期 12, 页码 6629-6636

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/TED.2021.3115993

关键词

Analog accelerator; analog AI; analog multiply-accumulate (MAC) for deep neural networks (DNNs); deep learning accelerator; inference; in-memory computing; non-volatile memory (NVM); phase-change memory (PCM)

资金

  1. IBM Research AI Hardware Center

向作者/读者索取更多资源

This article introduces a 14nm test chip for Analog AI inference, utilizing phase change memory devices to achieve high-efficiency MAC operations and DNN weight storage. With a closed-loop tuning algorithm, accurate weight programming and transfer are achieved, leading to near-software-equivalent accuracy on two different DNNs.
Hardware acceleration of deep learning using analog non-volatile memory (NVM) requires large arrays with high device yield, high accuracy Multiply-ACcumulate (MAC) operations, and routing frameworks for implementing arbitrary deep neural network (DNN) topologies. In this article, we present a 14-nm test-chip for Analog AI inference-it contains multiple arrays of phase change memory (PCM)-devices, each array capable of storing 512 x 512 unique DNN weights and executing massively parallel MAC operations at the location of the data. DNN excitations are transported across the chip using a duration representation on a parallel and reconfigurable 2-D mesh. To accurately transfer inference models to the chip, we describe a closed-loop tuning (CLT) algorithm that programs the four PCM conductances in each weight, achieving <3% average weight-error. A row-wise programming scheme and associated circuitry allow us to execute CLT on up to 512 weights concurrently. We show that the test chip can achieve near-software-equivalent accuracy on two different DNNs. We demonstrate tile-to-tile transport with a fully-on-chip two-layer network for MNIST (accuracy degradation similar to 0.6%) and show resilience to error propagation across long sequences (up to 10 000 characters) with a recurrent long short-term memory (LSTM) network, implementing off-chip activation and vector-vector operations to generate recurrent inputs used in the next on-chip MAC.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.6
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据