4.7 Article

C3Net: Cross-Modal Feature Recalibrated, Cross-Scale Semantic Aggregated and Compact Network for Semantic Segmentation of Multi-Modal High-Resolution Aerial Images

Journal

REMOTE SENSING
Volume 13, Issue 3, Pages -

Publisher

MDPI
DOI: 10.3390/rs13030528

Keywords

semantic segmentation; multi-modal learning; deep neural network design

Funding

  1. National Natural Science Foundation of China [41701508, 61725105]

Ask authors/readers for more resources

The study introduces an efficient C3Net model for semantic segmentation of multi-modal remote sensing images, striking a balance between speed and accuracy. By utilizing backbone networks and plug-and-play modules, it effectively extracts and recalibrates multi-modal features, while reducing the number of model parameters by redesigning the semantic contextual extraction module based on lightweight convolutional groups. Additionally, a multi-level knowledge distillation strategy is proposed to enhance the performance of the compact model.
Semantic segmentation of multi-modal remote sensing images is an important branch of remote sensing image interpretation. Multi-modal data has been proven to provide rich complementary information to deal with complex scenes. In recent years, semantic segmentation based on deep learning methods has made remarkable achievements. It is common to simply concatenate multi-modal data or use parallel branches to extract multi-modal features separately. However, most existing works ignore the effects of noise and redundant features from different modalities, which may not lead to satisfactory results. On the one hand, existing networks do not learn the complementary information of different modalities and suppress the mutual interference between different modalities, which may lead to a decrease in segmentation accuracy. On the other hand, the introduction of multi-modal data greatly increases the running time of the pixel-level dense prediction. In this work, we propose an efficient C3Net that strikes a balance between speed and accuracy. More specifically, C3Net contains several backbones for extracting features of different modalities. Then, a plug-and-play module is designed to effectively recalibrate and aggregate multi-modal features. In order to reduce the number of model parameters while remaining the model performance, we redesign the semantic contextual extraction module based on the lightweight convolutional groups. Besides, a multi-level knowledge distillation strategy is proposed to improve the performance of the compact model. Experiments on ISPRS Vaihingen dataset demonstrate the superior performance of C3Net with 15x fewer FLOPs than the state-of-the-art baseline network while providing comparable overall accuracy.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available