3.8 Proceedings Paper

Achieving Full Parallelism in LSTM via a Unified Accelerator Design

出版社

IEEE COMPUTER SOC
DOI: 10.1109/ICCD50377.2020.00086

关键词

Long Short-Term Memory; Edge computing; Accelerator; FPGA; ASIC

资金

  1. National Science Foundation [NSF CCF-1820537]

向作者/读者索取更多资源

Recently, Long Short-Term Memory (LSTM), a type of recurrent neural network, has been widely employed in real-time applications, such as speech recognition, word segmentation, machine translation, etc. While existing works demonstrate that LSTM can be efficiently deployed in cloud platforms, the high communication latency between cloud and edge will drastically reduce its efficiency. Therefore, efficient LSTM accelerators at the edge are highly demanded. The limited resource in edge devices and the heterogeneous operations in LSTM (e.g., LSTM gates) bring challenges for the LSTM accelerator design. It seems straightforward to implement each operation as a specific hardware kernel. However, the data dependency among gates leads to significant running stalls in the existing heterogeneous-kernel accelerator, resulting in low parallelism and low resource utilization. To overcome the above challenges, this work proposes a novel generic LSTM accelerator design for Field-programmable Gate Array (FPGA) and Application-specific Integrated Circuit (ASIC) platforms, where two fundamental computing patterns (i.e., element-wise multiplication and addition) are incorporated in a unified computing kernel to execute operations in all LSTM gates simultaneously. Thus, the running stalls caused by heterogeneous kernels can be eliminated, achieving full parallelism in LSTM. The proposed technique and architecture are validated on Xilinx PYNQ-Z1 FPGA which can fully utilize the available resource, achieving 10x faster in inference time and 15.2x improvement in computing power efficiency compared with the state-of-the-art LSTM accelerator.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

3.8
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据