☆ 4.7 Article

Multi-level similarity learning for image-text retrieval

INFORMATION PROCESSING & MANAGEMENT (2021)

Journal

INFORMATION PROCESSING & MANAGEMENT

Volume 58, Issue 1, Pages -

Publisher

ELSEVIER SCI LTD

DOI: 10.1016/j.ipm.2020.102432

Keywords

Cross modal retrieval; Semantic extraction; Graph matching

Funding

China Postdoctoral Science Foundation [2020M680884]
National Natural Science Foundation of China [61902277, 61772359, 61872267]
Tianjin New Generation Artificial Intelligence Major Program [19ZXZNGX00110, 18ZXZNGX00150]
Elite Scholar Program of Tianjin University [2019XRX-0035]
Baidu Pinecone Program

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

This paper proposes a multi-level representation learning method to improve the quality of image-text retrieval task by utilizing semantic-level, structural-level, and contextual-level information. The experiments demonstrate the superiority of this method on two commonly used datasets.

Image-text retrieval task has been a popular research topic and attracts a growing interest due to it bridges computer vision and natural language processing communities and involves two different modalities. Although a lot of methods have made a great progress in image-text task, it remains challenging because of the difficulty to learn the correspondence between two heterogeneous modalities. In this paper, we propose a multi-level representation learning for image-text retrieval task, which utilizes semantic-level, structural-level and contextual information to improve the quality of visual and textual representation. To utilize semantic-level information, we firstly extract the nouns, adjectives and number with high frequency as the semantic labels and adopt multi-label convolutional neural network framework to encode the semantic-level information. To explore the structure-level information of image-text pair, we firstly construct two graphs to encode the visual and textual information with respect to the corresponding modality and then, we apply graph matching with triplet loss to reduce the cross-modality discrepancy. To further improve the retrieval results, we utilize the contextual-level information from two modalities to refine the rank list and enhance the retrieval quality. Extensive experiments on Flickr30k and MSCOCO, which are two commonly datasets for image-text retrieval, have demonstrated the superiority of our proposed method.

Multi-level similarity learning for image-text retrieval

Journal

INFORMATION PROCESSING & MANAGEMENT

Publisher

ELSEVIER SCI LTD

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Multi-level similarity learning for image-text retrieval

Journal

INFORMATION PROCESSING & MANAGEMENT

Publisher

ELSEVIER SCI LTD

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper