4.5 Article

Radiation-Tolerant Deep Learning Processor Unit (DPU)-Based Platform Using Xilinx 20-nm Kintex UltraScale FPGA

Journal

IEEE TRANSACTIONS ON NUCLEAR SCIENCE
Volume 70, Issue 4, Pages 714-721

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/TNS.2022.3216360

Keywords

Deep learning processing unit (DPU); fault-aware training (FAT); neural networks (NNs); ResNet; single-event functional interrupt (SEFI); single-event upset (SEU); Xilinx 20-nm UltraScale

Ask authors/readers for more resources

This article presents a platform and design approach for enabling radiation-tolerant deep learning acceleration. The platform includes solutions to mitigate radiation-induced interrupts and network datapath corruptions. Test results show significant improvements in system response and reduction in errors using the radiation-tolerant platform compared to standard nonmitigated approaches.
This article presents a platform and design approach for enabling radiation-tolerant deep learning acceleration on static random access memory (SRAM)-based 20-nm Kintex UltraScale field-programmable gate arrays (FPGAs), for terrestrial and high-radiation environments. The presented platform is suitable for deep neural network (DNN) implementations with an emphasis on image classification and includes the solutions to mitigate both radiation-induced single-event functional interrupts (SEFIs) and network datapath corruptions. The radiation-tolerant deep learning platform combines Xilinx's deep learning processing unit (DPU) IP, triple modular redundancy (TMR) MicroBlaze soft processor IP, and soft error mitigation (SEM)-IP to mitigate SEFIs. Furthermore, a technique known as fault aware training (FAT) was applied to effectively mitigate single-event effects in the datapath. Test results from a high-energy proton beam (>60 MeV) experiment using the ResNet-18 convolutional neural network (CNN) for image classification are presented. The single-event upset (SEU) rate, system-level SEFI rate, and neural network classification/data-path performance are compared between the radiation-tolerant platform and a standard, nonmitigated approach. Results show that datapath classification errors dominate the system response (90%) versus SEFIs (10%). When compared to standard nonmitigated training techniques, the radiation-tolerant platform using FAT methods shows dramatic improvements in overall system response: the overall single-event cross section was reduced by half and 40% reduction in misclassification errors was observed. Also, datapath events with classification accuracy degradation larger than 5% were completely mitigated. The SEFI rate was reduced by 100x with implemented solutions and can be further reduced by optimizing the physical separation between TMR modules.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.5
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available