☆ 4.6 Article

Transformer-Based Semantic Segmentation for Extraction of Building Footprints from Very-High-Resolution Images

SENSORS (2023)

Journal

SENSORS

Volume 23, Issue 11, Pages -

Publisher

MDPI

DOI: 10.3390/s23115166

Keywords

vision transformer; hyperparameter; building; self-attention; deep learning

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

This article explores the role of Vision Transformer networks in extracting building footprints from high-resolution satellite images. Different hyperparameter values were used to design and compare Transformer-based models, and their impact on accuracy was analyzed. The results suggest that smaller image patches and higher-dimensional embeddings contribute to higher accuracy. Furthermore, the Transformer-based network is shown to be scalable and can be trained with general-scale GPUs while achieving higher accuracy than convolutional neural networks.

Semantic segmentation with deep learning networks has become an important approach to the extraction of objects from very high-resolution remote sensing images. Vision Transformer networks have shown significant improvements in performance compared to traditional convolutional neural networks (CNNs) in semantic segmentation. Vision Transformer networks have different architectures to CNNs. Image patches, linear embedding, and multi-head self-attention (MHSA) are several of the main hyperparameters. How we should configure them for the extraction of objects in VHR images and how they affect the accuracy of networks are topics that have not been sufficiently investigated. This article explores the role of vision Transformer networks in the extraction of building footprints from very-high-resolution (VHR) images. Transformer-based models with different hyperparameter values were designed and compared, and their impact on accuracy was analyzed. The results show that smaller image patches and higher-dimension embeddings result in better accuracy. In addition, the Transformer-based network is shown to be scalable and can be trained with general-scale graphics processing units (GPUs) with comparable model sizes and training times to convolutional neural networks while achieving higher accuracy. The study provides valuable insights into the potential of vision Transformer networks in object extraction using VHR images.

Transformer-Based Semantic Segmentation for Extraction of Building Footprints from Very-High-Resolution Images

Journal

SENSORS

Publisher

MDPI

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Transformer-Based Semantic Segmentation for Extraction of Building Footprints from Very-High-Resolution Images

Journal

SENSORS

Publisher

MDPI

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper