4.7 Article

Saliency Inside: Learning Attentive CNNs for Content-Based Image Retrieval

Journal

IEEE TRANSACTIONS ON IMAGE PROCESSING
Volume 28, Issue 9, Pages 4580-4593

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/TIP.2019.2913513

Keywords

Visual saliency; content-based image retrieval; bag-of-word; convolutional neural networks

Funding

  1. National Key Research and Development of China [2017YFC1703503]
  2. National Natural Science Foundation of China [61572065, 61532005, 61672072]
  3. Beijing Nova Program [Z181100006218063]
  4. Fundamental Research Funds for the Central Universities [2018JBZ001]

Ask authors/readers for more resources

In content-based image retrieval (CBIR), one of the most challenging and ambiguous tasks is to correctly understand the human query intention and measure its semantic relevance with images in the database. Due to the impressive capability of visual saliency in predicting human visual attention that is closely related to the query intention, this paper attempts to explicitly discover the essential effect of visual saliency in CBIR via qualitative and quantitative experiments. Toward this end, we first generate the fixation density maps of images from a widely used CBIR dataset by using an eye-tracking apparatus. These ground-truth saliency maps are then used to measure the influence of visual saliency to the task of CBIR by exploring several probable ways of incorporating such saliency cues into the retrieval process. We find that visual saliency is indeed beneficial to the CBIR task, and the best saliency involving scheme is possibly different for different image retrieval models. Inspired by the findings, this paper presents two-stream attentive convolutional neural networks (CNNs) with saliency embedded inside for CBIR. The proposed network has two streams that simultaneously handle two tasks. The main stream focuses on extracting discriminative visual features that are tightly related to semantic attributes. Meanwhile, the auxiliary stream aims to facilitate the main stream by redirecting the feature extraction to the salient image content that a human may pay attention to. By fusing these two streams into the Main and Auxiliary CNNs (MAC), image similarity can be computed as the human being does by reserving conspicuous content and suppressing irrelevant regions. Extensive experiments show that the proposed model achieves impressive performance in image retrieval on four public datasets.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available