☆ 4.7 Article

Multi-level knowledge-driven feature representation and triplet loss optimization network for image-text retrieval

INFORMATION PROCESSING & MANAGEMENT (2024)

Journal

INFORMATION PROCESSING & MANAGEMENT

Volume 61, Issue 1, Pages -

Publisher

ELSEVIER SCI LTD

DOI: 10.1016/j.ipm.2023.103575

Keywords

Image-text retrieval; Cross-modal retrieval; Prior knowledge; Triplet loss optimization

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

Image-text retrieval is important in connecting vision and language. This paper proposes a method that utilizes prior knowledge to enhance feature representations and optimize network training for better retrieval results.

Image-text retrieval plays a considerable role in associating vision and language. Existing mainstream approaches focus on fine-grained alignment while ignoring the influence of prior knowledge on model performance and the limitation of using a fixed margin in the triplet loss. In this paper, we propose a Multi-level Knowledge-driven feature representation and Triplet Loss Optimization Network (MKTLON) that exploits prior knowledge to enhance visual and textual feature representations and utilizes an adaptive margin of the triplet loss to optimize network training. Specifically, we first present an Enhanced feature Representation scheme based on the Self-Attention (ERSA) module, which incorporates the prior knowledge randomly initialized by uniform distribution into the matrices K and V in the self-attention mechanism. Subsequently, we adopt cascaded ERSA modules to encode images and texts to obtain multilevel visual and textual features with prior knowledge. Furthermore, we develop an adaptive margin optimization strategy that models the relevance scores of positive and negative samples as two independent Gaussian distributions, and obtain the optimized margin by minimizing the intersection of these two distributions. Extensive experiments on two benchmarks, Flickr30K (155,000 image-text pairs) and MSCOCO (616,435 image-text pairs), show the proposed MKTLON achieves 5.7% and 4.3% improvements on rSum, respectively, compared to the state-of-the-art method. The source code will be released at https://github.com/FlyCuteBird/ MKTLON.

Multi-level knowledge-driven feature representation and triplet loss optimization network for image-text retrieval

Journal

INFORMATION PROCESSING & MANAGEMENT

Publisher

ELSEVIER SCI LTD

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Multi-level knowledge-driven feature representation and triplet loss optimization network for image-text retrieval

Journal

INFORMATION PROCESSING & MANAGEMENT

Publisher

ELSEVIER SCI LTD

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper