☆ 4.7 Article

Multi-level knowledge-driven feature representation and triplet loss optimization network for image-text retrieval

INFORMATION PROCESSING & MANAGEMENT (2024)

期刊

INFORMATION PROCESSING & MANAGEMENT

卷 61, 期 1, 页码 -

出版社

ELSEVIER SCI LTD

DOI: 10.1016/j.ipm.2023.103575

关键词

Image-text retrieval; Cross-modal retrieval; Prior knowledge; Triplet loss optimization

类别

Computer Science, Information Systems Information Science & Library Science

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

Image-text retrieval is important in connecting vision and language. This paper proposes a method that utilizes prior knowledge to enhance feature representations and optimize network training for better retrieval results.

Image-text retrieval plays a considerable role in associating vision and language. Existing mainstream approaches focus on fine-grained alignment while ignoring the influence of prior knowledge on model performance and the limitation of using a fixed margin in the triplet loss. In this paper, we propose a Multi-level Knowledge-driven feature representation and Triplet Loss Optimization Network (MKTLON) that exploits prior knowledge to enhance visual and textual feature representations and utilizes an adaptive margin of the triplet loss to optimize network training. Specifically, we first present an Enhanced feature Representation scheme based on the Self-Attention (ERSA) module, which incorporates the prior knowledge randomly initialized by uniform distribution into the matrices K and V in the self-attention mechanism. Subsequently, we adopt cascaded ERSA modules to encode images and texts to obtain multilevel visual and textual features with prior knowledge. Furthermore, we develop an adaptive margin optimization strategy that models the relevance scores of positive and negative samples as two independent Gaussian distributions, and obtain the optimized margin by minimizing the intersection of these two distributions. Extensive experiments on two benchmarks, Flickr30K (155,000 image-text pairs) and MSCOCO (616,435 image-text pairs), show the proposed MKTLON achieves 5.7% and 4.3% improvements on rSum, respectively, compared to the state-of-the-art method. The source code will be released at https://github.com/FlyCuteBird/ MKTLON.

Multi-level knowledge-driven feature representation and triplet loss optimization network for image-text retrieval

期刊

INFORMATION PROCESSING & MANAGEMENT

出版社

ELSEVIER SCI LTD

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Multi-level knowledge-driven feature representation and triplet loss optimization network for image-text retrieval

期刊

INFORMATION PROCESSING & MANAGEMENT

出版社

ELSEVIER SCI LTD

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文