☆ 3.8 Proceedings Paper

Achieving Full Parallelism in LSTM via a Unified Accelerator Design

2020 IEEE 38TH INTERNATIONAL CONFERENCE ON COMPUTER DESIGN (ICCD 2020) (2020)

期刊

2020 IEEE 38TH INTERNATIONAL CONFERENCE ON COMPUTER DESIGN (ICCD 2020)

卷 -, 期 -, 页码 469-477

出版社

IEEE COMPUTER SOC

DOI: 10.1109/ICCD50377.2020.00086

关键词

Long Short-Term Memory; Edge computing; Accelerator; FPGA; ASIC

类别

Computer Science, Hardware & Architecture Engineering, Electrical & Electronic

资金

National Science Foundation [NSF CCF-1820537]

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

摘要

Recently, Long Short-Term Memory (LSTM), a type of recurrent neural network, has been widely employed in real-time applications, such as speech recognition, word segmentation, machine translation, etc. While existing works demonstrate that LSTM can be efficiently deployed in cloud platforms, the high communication latency between cloud and edge will drastically reduce its efficiency. Therefore, efficient LSTM accelerators at the edge are highly demanded. The limited resource in edge devices and the heterogeneous operations in LSTM (e.g., LSTM gates) bring challenges for the LSTM accelerator design. It seems straightforward to implement each operation as a specific hardware kernel. However, the data dependency among gates leads to significant running stalls in the existing heterogeneous-kernel accelerator, resulting in low parallelism and low resource utilization. To overcome the above challenges, this work proposes a novel generic LSTM accelerator design for Field-programmable Gate Array (FPGA) and Application-specific Integrated Circuit (ASIC) platforms, where two fundamental computing patterns (i.e., element-wise multiplication and addition) are incorporated in a unified computing kernel to execute operations in all LSTM gates simultaneously. Thus, the running stalls caused by heterogeneous kernels can be eliminated, achieving full parallelism in LSTM. The proposed technique and architecture are validated on Xilinx PYNQ-Z1 FPGA which can fully utilize the available resource, achieving 10x faster in inference time and 15.2x improvement in computing power efficiency compared with the state-of-the-art LSTM accelerator.

Achieving Full Parallelism in LSTM via a Unified Accelerator Design

期刊

2020 IEEE 38TH INTERNATIONAL CONFERENCE ON COMPUTER DESIGN (ICCD 2020)

出版社

IEEE COMPUTER SOC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Achieving Full Parallelism in LSTM via a Unified Accelerator Design

期刊

2020 IEEE 38TH INTERNATIONAL CONFERENCE ON COMPUTER DESIGN (ICCD 2020)

出版社

IEEE COMPUTER SOC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文