Journal
IEEE EMBEDDED SYSTEMS LETTERS
Volume 14, Issue 1, Pages 15-18Publisher
IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/LES.2021.3087707
Keywords
Graphics processing units; Pipeline processing; Throughput; Optimization; Deep learning; Engines; Space exploration; Acceleration; deep learning (DL); optimization
Categories
Funding
- National Research Foundation of Korea (NRF) - Korea Government (MSIT) [NRF-2019R1A2B5B02069406]
Ask authors/readers for more resources
In this letter, the authors propose a parallelization methodology to maximize the throughput of a single DL application using both GPU and NPU by exploiting various types of parallelism on TensorRT. With six real-life benchmarks, they achieved 81%-391% throughput improvement over the baseline inference using GPU only.
As deep learning (DL) inference applications are increasing, an embedded device tends to equip neural processing units (NPUs) in addition to a CPU and a GPU. For fast and efficient development of DL applications, TensorRT is provided as the software development kit for the NVIDIA hardware platform, including optimizer and runtime that delivers low latency and high throughput for DL inference. Like most DL frameworks, TensorRT assumes that the inference is executed on a single processing element, GPU, or NPU, not both. In this letter, we propose a parallelization methodology to maximize the throughput of a single DL application using both GPU and NPU by exploiting various types of parallelism on TensorRT. With six real-life benchmarks, we could achieve 81%-391% throughput improvement over the baseline inference using GPU only.
Authors
I am an author on this paper
Click your name to claim this paper and add it to your profile.
Reviews
Recommended
No Data Available