4.7 Article

Real-Time Monocular Depth Estimation Merging Vision Transformers on Edge Devices for AIoT

Journal

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/TIM.2023.3264039

Keywords

Estimation; Semantics; Real-time systems; Transformers; Feature extraction; Decoding; Task analysis; Artificial intelligence of things (AIoT); attention; monocular depth estimation; real-time; transformers

Ask authors/readers for more resources

This article proposes a novel encoder-decoder network for real-time monocular depth estimation on edge devices. The network merges semantic information at a global field via an efficient transformer-based module to provide more details of the object for depth assignment. The network achieves an outstanding balance between accuracy and speed.
Depth estimation is requisite to build the 3-D perceiving capability of artificial intelligence of things (AIoT). Real-time inference with extremely low computing resource consumption is critical on edge devices. However, most single-view depth estimation networks focus on the improvement of accuracy when running on high-end GPUs, which goes against the real-time requirement on edge devices. To address this issue, this article proposed a novel encoder-decoder network to realize real-time monocular depth estimation on edge devices. The proposed network merges semantic information at the global field via an efficient transformer-based module to provide more details of the object for depth assignment. The transformer-based module is integrated into the lowest level resolution of an encoder-decoder architecture to largely reduce the parameters of the vision transformer (ViT). In particular, we proposed a novel patch convolutional layer for low-latency feature extraction in the encoder and an SConv5 layer for effective depth assignment in the decoder. The proposed network achieves an outstanding balance between the accuracy and speed of the NYU Depth v2 dataset. A low root mean square error (RMSE) of 0.554 and a fast speed of 58.98 FPS on NVIDIA Jetson Nano device with TensorRT optimization are obtained on NYU Depth v2, outperforming most state-of-the-art real-time results.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available