4.7 Article

A Swin Transformer-Based Encoding Booster Integrated in U-Shaped Network for Building Extraction

Journal

REMOTE SENSING
Volume 14, Issue 11, Pages -

Publisher

MDPI
DOI: 10.3390/rs14112611

Keywords

building extraction; deep learning; U-shaped network; swin Transformer; encoding booster; self-attention; semantic information

Funding

  1. NSFC [61901341, 61403291]
  2. China Postdoctoral Science Foundation [2021TQ0260]
  3. National Natural Science Foundation of Shaanxi Province [2020JQ-301]
  4. GHfund [202107020822, 202202022633]

Ask authors/readers for more resources

This paper proposes a shifted-window (swin) Transformer-based encoding booster for efficient extraction of building areas in remote sensing images. By integrating the encoding booster in a specially designed U-shaped network, the feature-level fusion of local and large-scale semantics is achieved. Experimental results demonstrate that the proposed method achieves higher accuracy in extracting buildings of different scales compared to state-of-the-art networks.
Building extraction is a popular topic in remote sensing image processing. Efficient building extraction algorithms can identify and segment building areas to provide informative data for downstream tasks. Currently, building extraction is mainly achieved by deep convolutional neural networks (CNNs) based on the U-shaped encoder-decoder architecture. However, the local perceptive field of the convolutional operation poses a challenge for CNNs to fully capture the semantic information of large buildings, especially in high-resolution remote sensing images. Considering the recent success of the Transformer in computer vision tasks, in this paper, first we propose a shifted-window (swin) Transformer-based encoding booster. The proposed encoding booster includes a swin Transformer pyramid containing patch merging layers for down-sampling, which enables our encoding booster to extract semantics from multi-level features at different scales. Most importantly, the receptive field is significantly expanded by the global self-attention mechanism of the swin Transformer, allowing the encoding booster to capture the large-scale semantic information effectively and transcend the limitations of CNNs. Furthermore, we integrate the encoding booster in a specially designed U-shaped network through a novel manner, named the Swin Transformer-based Encoding Booster- U-shaped Network (STEB-UNet), to achieve the feature-level fusion of local and large-scale semantics. Remarkably, compared with other Transformer-included networks, the computational complexity and memory requirement of the STEB-UNet are significantly reduced due to the swin design, making the network training much easier. Experimental results show that the STEB-UNet can effectively discriminate and extract buildings of different scales and demonstrate higher accuracy than the state-of-the-art networks on public datasets.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available