4.7 Article

Aggregating Rich Hierarchical Features for Scene Classification in Remote Sensing Imagery

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/JSTARS.2017.2705419

Keywords

Convolutional neural networks (CNNs); mixed-resolution representation; remote sensing scene classification; vector of locally aggregated descriptors (VLAD)

Funding

  1. National Natural Science Foundation of China [61403375, 61472119, 61573352, 61375024, 91338202, 91646207]
  2. Priority Academic Program Development of Jiangsu Higher Education Institutions
  3. Jiangsu Collaborative Innovation Center on Atmospheric Environment and Equipment Technology

Ask authors/readers for more resources

Scene classification is one of the most important issues in remote sensing image processing. To obtain a high discriminative feature representation for an image to be classified, traditional methods usually consider to densely accumulate hand-crafted low-level descriptors (e.g., scale-invariant feature transform) by feature encoding techniques. However, the performance is largely limited by the hand-crafted descriptors as they are not capable of describing the rich semantic information contained in various remote sensing images. To alleviate this problem, we propose a novel method to extract discriminative image features from the rich hierarchical information contained in convolutional neural networks (CNNs). Specifically, the low-level and middle-level intermediate convolutional features are, respectively, encoded by vector of locally aggregated descriptors (VLAD) and then reduced by principal component analysis to obtain hierarchical global features; meanwhile, the fully connected features are average pooled and subsequently normalized to form new global features. The proposed encoded mixed-resolution representation (EMR) is the concatenation of all the above-mentioned global features. Due to the usage of encoding strategies (VLAD and average pooling), our method can deal with images of different sizes. In addition, to reduce the computational consumption in the training stage, we directly extract EMR from VGG-VD and ResNet pretrained on the ImageNet dataset. We show in this paper that CNNs pretrained on the natural image dataset are more easily applied to the remote sensing dataset when the local structure similarity between two datasets is higher. Experimental evaluations on the UC-Merced and Brazilian Coffee Scenes datasets demonstrate that our method is superior to the state of the art.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available