4.7 Article

AA-trans: Core attention aggregating transformer with information entropy selector for fine-grained visual classification

Journal

PATTERN RECOGNITION
Volume 140, Issue -, Pages -

Publisher

ELSEVIER SCI LTD
DOI: 10.1016/j.patcog.2023.109547

Keywords

Fine-grained visual; Image classification; Vision transformer; Attention aggregator; Information entropy

Ask authors/readers for more resources

Fine-grained visual classification is a difficult task due to the large inter-class variances and small intra-class variances. Existing approaches using CNN-based networks as feature extractors fail to locate the important parts. In this paper, we propose an attention aggregating transformer (AATrans) based on the ViT model to better capture minor differences among images and improve performance. We introduce the core attention aggregator (CAA) and the information entropy selector (IES) to enhance information sharing and select discriminative parts. Extensive experiments demonstrate the state-of-the-art performance of our proposed model on mainstream datasets.
The task of fine-grained visual classification (FGVC) is to distinguish targets from subordinate classifications. Since fine-grained images have the inherent characteristic of large inter-class variances and small intra-class variances, it is considered an extremely difficult task. Most existing approaches adopt CNNbased networks as feature extractors, which causes the extracted discriminative regions to contain most parts of the object in this way, thus failing to locate the really important parts. Recently, the vision transformer (ViT) has demonstrated its power on a wide range of image tasks, which uses an attention mechanism to capture global contextual information to establish a remote dependency on the target and thus extract more powerful features. Nevertheless, the ViT model still focuses more on global coarse-grained information rather than local fine-grained information, which may lead to its undesirable performance in fine-grained image classification. To this end, we redesigned an attention aggregating transformer (AATrans) to better capture minor differences among images by improving the ViT structure in this paper. In detail, we propose a core attention aggregator (CAA), which enables better information sharing between each transformer layer. Besides, we further propose an innovative information entropy selector (IES) to guide the network in acquiring discriminative parts of the image precisely. Extensive experiments show that our proposed model structure can achieve a new state-of-the-art performance on several mainstream datasets.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available