期刊
INFORMATION PROCESSING & MANAGEMENT
卷 61, 期 1, 页码 -出版社
ELSEVIER SCI LTD
DOI: 10.1016/j.ipm.2023.103575
关键词
Image-text retrieval; Cross-modal retrieval; Prior knowledge; Triplet loss optimization
Image-text retrieval is important in connecting vision and language. This paper proposes a method that utilizes prior knowledge to enhance feature representations and optimize network training for better retrieval results.
Image-text retrieval plays a considerable role in associating vision and language. Existing mainstream approaches focus on fine-grained alignment while ignoring the influence of prior knowledge on model performance and the limitation of using a fixed margin in the triplet loss. In this paper, we propose a Multi-level Knowledge-driven feature representation and Triplet Loss Optimization Network (MKTLON) that exploits prior knowledge to enhance visual and textual feature representations and utilizes an adaptive margin of the triplet loss to optimize network training. Specifically, we first present an Enhanced feature Representation scheme based on the Self-Attention (ERSA) module, which incorporates the prior knowledge randomly initialized by uniform distribution into the matrices K and V in the self-attention mechanism. Subsequently, we adopt cascaded ERSA modules to encode images and texts to obtain multilevel visual and textual features with prior knowledge. Furthermore, we develop an adaptive margin optimization strategy that models the relevance scores of positive and negative samples as two independent Gaussian distributions, and obtain the optimized margin by minimizing the intersection of these two distributions. Extensive experiments on two benchmarks, Flickr30K (155,000 image-text pairs) and MSCOCO (616,435 image-text pairs), show the proposed MKTLON achieves 5.7% and 4.3% improvements on rSum, respectively, compared to the state-of-the-art method. The source code will be released at https://github.com/FlyCuteBird/ MKTLON.
作者
我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。
推荐
暂无数据