4.7 Article

A ViT-Based Multiscale Feature Fusion Approach for Remote Sensing Image Segmentation

Journal

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/LGRS.2022.3187135

Keywords

Feature extraction; Image segmentation; Transformers; Decoding; Three-dimensional displays; Dams; Transforms; Dimension attention module (DAM); remote sensing image; semantic segmentation; vision transformer (ViT)

Funding

  1. Natural Science Foundation of Hunan Province [2019JJ80105]
  2. Changsha City Science and Technology Plan Project [kq2004071]
  3. Scientific Research Project of the Hunan Provincial Department of Education [20C1249]

Ask authors/readers for more resources

Semantic segmentation is essential for automatic analysis of remote sensing images. However, the utilization of abundant semantic information and irregular shape patterns in these images is challenging. To address this, a multiscale feature pyramid decoder (MFPD) is proposed to fuse image features extracted by vision transformer (ViT). The decoder employs a 2-D-to-3-D transform method to obtain multiscale feature maps and uses channel concatenation for fusion. A dimension attention module (DAM) is also designed to aggregate context information. Experimental results show that this approach outperforms other methods based on convolutional neural networks (CNN) and ViT in terms of mean intersection over union (mIoU).
Semantic segmentation plays an indispensable role in automatic analysis of remote sensing image data. However, the abundant semantic information and irregular shape patterns in remote sensing images are difficult to utilize, making it hard to segment remote sensing images only using convolution and single-scale feature maps. To achieve better segmentation performance, a multiscale feature pyramid decoder (MFPD) is proposed to fuse image features extracted by vision transformer (ViT). The decoder employs a novel 2-D-to-3-D transform method to obtain multiscale feature maps that contain rich context information and fuses the multiscale feature maps by channel concatenation. Furthermore, a dimension attention module (DAM) is designed to further aggregate the context information of the extracted remote sensing image features. This approach yields superior mean intersection over union (mIoU) on the Gaofen2-CZ dataset (60.42%) and GID-5 dataset (68.21%). Experimental results indicate that the comprehensive performance of our approach exceeds the compared segmentation methods based on convolutional neural network (CNN) and ViT.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available