Journal
INFORMATION PROCESSING & MANAGEMENT
Volume 61, Issue 1, Pages -Publisher
ELSEVIER SCI LTD
DOI: 10.1016/j.ipm.2023.103575
Keywords
Image-text retrieval; Cross-modal retrieval; Prior knowledge; Triplet loss optimization
Ask authors/readers for more resources
Image-text retrieval is important in connecting vision and language. This paper proposes a method that utilizes prior knowledge to enhance feature representations and optimize network training for better retrieval results.
Image-text retrieval plays a considerable role in associating vision and language. Existing mainstream approaches focus on fine-grained alignment while ignoring the influence of prior knowledge on model performance and the limitation of using a fixed margin in the triplet loss. In this paper, we propose a Multi-level Knowledge-driven feature representation and Triplet Loss Optimization Network (MKTLON) that exploits prior knowledge to enhance visual and textual feature representations and utilizes an adaptive margin of the triplet loss to optimize network training. Specifically, we first present an Enhanced feature Representation scheme based on the Self-Attention (ERSA) module, which incorporates the prior knowledge randomly initialized by uniform distribution into the matrices K and V in the self-attention mechanism. Subsequently, we adopt cascaded ERSA modules to encode images and texts to obtain multilevel visual and textual features with prior knowledge. Furthermore, we develop an adaptive margin optimization strategy that models the relevance scores of positive and negative samples as two independent Gaussian distributions, and obtain the optimized margin by minimizing the intersection of these two distributions. Extensive experiments on two benchmarks, Flickr30K (155,000 image-text pairs) and MSCOCO (616,435 image-text pairs), show the proposed MKTLON achieves 5.7% and 4.3% improvements on rSum, respectively, compared to the state-of-the-art method. The source code will be released at https://github.com/FlyCuteBird/ MKTLON.
Authors
I am an author on this paper
Click your name to claim this paper and add it to your profile.
Reviews
Recommended
No Data Available