☆ 4.7 Article

CNN, RNN, or ViT? An Evaluation of Different Deep Learning Architectures for Spatio-Temporal Representation of Sentinel Time Series

IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING (2023)

Journal

IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING

Volume 16, Issue -, Pages 44-56

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

DOI: 10.1109/JSTARS.2022.3219816

Keywords

Deep learning models; land-cover classification; sentinel images; spatio-temporal remote sensing images; vision transformer

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

This study compared the performance of different deep learning structures in extracting spatio-temporal information. The results showed that 3D CNN and Vision Transformer achieved the best performance, followed by 2D CNN and conventional methods. It was also found that using optical images alone was sufficient for land-cover classification, but SAR images could provide satisfactory results when optical images were unavailable.

Rich information in multitemporal satellite images can facilitate pixel-level land cover classification. However, what is the most suitable deep learning architecture for high-dimension spatio-temporal representation of remote sensing time series remains unclear. In this study, we theoretically analyzed the different mechanisms of the different deep learning structures, including the commonly used convolutional neural network (CNN), the high-dimension CNN [three-dimensional (3-D) CNN], the recurrent neural network, and the newest vision transformer (ViT), with regard to learning and representing the temporal information for spatio-temporal data. The performance of the different models was comprehensively evaluated on large-scale Sentinel-1 and Sentinel-2 time-series images covering the whole of Slovenia. First, the 3-D CNN, long short-term memory (LSTM), and ViT, which all have specific structures that preserve temporal information, can effectively extract the spatio-temporal information, with the 3-D CNN and ViT showing the best performance. Second, the performance of the 2-D CNN, in which the temporal information is collapsed, is lower than that of the 3-D CNN, LSTM, and ViT but outperforms the conventional methods. Thirdly,using both optical and synthetic aperture radar (SAR) images performs almost the same as using only optical images, indicating that the information that can be extracted from optical images is sufficient for land-cover classification. However, when optical images are unavailable, SAR imagescan provide satisfactorily classification results. Finally, the modern deep learning methods can effectively overcome the disadvantages in imaging conditions where parts of an image or images of some periods are missing. The testing data are available at gpcv.whu.edu.cn/data.

CNN, RNN, or ViT? An Evaluation of Different Deep Learning Architectures for Spatio-Temporal Representation of Sentinel Time Series

Journal

IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

CNN, RNN, or ViT? An Evaluation of Different Deep Learning Architectures for Spatio-Temporal Representation of Sentinel Time Series

Journal

IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper