☆ 4.5 Article

R2F: A Remote Retraining Framework for AIoT Processors With Computing Errors

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS (2021)

期刊

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

卷 29, 期 11, 页码 1955-1966

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

DOI: 10.1109/TVLSI.2021.3089224

关键词

Program processors; Artificial neural networks; Training; Servers; Fault tolerant systems; Computational modeling; Data communication; Fault tolerance; redundancy; reliability

类别

Computer Science, Hardware & Architecture Engineering, Electrical & Electronic

资金

National Key Research and Development Program of China [2020YFB1600201]
National Natural Science Foundation of China (NSFC) [61902375, 61876173]

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

AIoT processors fabricated with newer technology nodes are susceptible to rising soft errors, especially in deep learning accelerators. To address this issue, a remote retraining framework and an optimized partial triple modular redundancy strategy are proposed. The experiments show that this approach allows for tradeoffs between model accuracy and performance penalty, while a data transmission optimization method reduces retraining time significantly.

Artificial Intelligence of Things (AIoT) processors fabricated with newer technology nodes suffer rising soft errors due to the shrinking transistor sizes and lower power supply. Soft errors on the AIoT processors particularly the deep learning accelerators (DLAs) with massive computing may cause substantial computing errors. These computing errors are difficult to be captured by the conventional training on general-purposed processors such as CPUs and GPUs in a server. Applying the offline trained neural network models to the edge accelerators with errors directly may lead to considerable prediction accuracy loss. To address the problem, we propose a remote retraining framework (R2F) for remote AIoT processors with computing errors. It takes the remote AIoT processor with soft errors in the training loop such that the on-site computing errors can be learned with the application data on the server and the retrained models can be resilient to the soft errors. Meanwhile, we propose an optimized partial triple modular redundancy (TMR) strategy to enhance the retraining. According to our experiments, R2F enables elastic design tradeoffs between the model accuracy and the performance penalty. The top-5 model accuracy can be improved by 1.93%-13.73% with 0%-200% performance penalty at high fault error rate. In addition, we notice that the retraining requires massive data transmission and even dominates the training time and propose a sparse increment compression approach for the data transmission optimization, which reduces the retraining time by 38%-88% on average with negligible accuracy loss over straightforward remote retraining.

R2F: A Remote Retraining Framework for AIoT Processors With Computing Errors

期刊

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

R2F: A Remote Retraining Framework for AIoT Processors With Computing Errors

期刊

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文