4.3 Article

Deep Learning Inference Parallelization on Heterogeneous Processors With TensorRT

Journal

IEEE EMBEDDED SYSTEMS LETTERS
Volume 14, Issue 1, Pages 15-18

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/LES.2021.3087707

Keywords

Graphics processing units; Pipeline processing; Throughput; Optimization; Deep learning; Engines; Space exploration; Acceleration; deep learning (DL); optimization

Funding

  1. National Research Foundation of Korea (NRF) - Korea Government (MSIT) [NRF-2019R1A2B5B02069406]

Ask authors/readers for more resources

In this letter, the authors propose a parallelization methodology to maximize the throughput of a single DL application using both GPU and NPU by exploiting various types of parallelism on TensorRT. With six real-life benchmarks, they achieved 81%-391% throughput improvement over the baseline inference using GPU only.
As deep learning (DL) inference applications are increasing, an embedded device tends to equip neural processing units (NPUs) in addition to a CPU and a GPU. For fast and efficient development of DL applications, TensorRT is provided as the software development kit for the NVIDIA hardware platform, including optimizer and runtime that delivers low latency and high throughput for DL inference. Like most DL frameworks, TensorRT assumes that the inference is executed on a single processing element, GPU, or NPU, not both. In this letter, we propose a parallelization methodology to maximize the throughput of a single DL application using both GPU and NPU by exploiting various types of parallelism on TensorRT. With six real-life benchmarks, we could achieve 81%-391% throughput improvement over the baseline inference using GPU only.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.3
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available