4.7 Article

Global-local fusion based on adversarial sample generation for image-text matching

期刊

INFORMATION FUSION
卷 103, 期 -, 页码 -

出版社

ELSEVIER
DOI: 10.1016/j.inffus.2023.102084

关键词

Image-text matching; Global-local cognition; Adversarial sample generation; Dynamic fusion; Loss adjustment

向作者/读者索取更多资源

In the era of adversarial machine learning (AML), developing robust and generalized algorithms has become a key research topic. This study proposes a global similarity matching module and a global-local cognition fusion training mechanism based on relationship adversarial sample generation to improve image-text matching algorithm. Experimental results show significant improvements in accuracy and robustness, performing well in facing security challenges and promoting the fusion of visual and linguistic modalities.
In the increasingly popular era of adversarial machine learning (AML), developing more robust and generalized algorithms has become a key research topic. Image-text matching as the foundation of tasks such as video Q&A and text-image generation also faces various attacks in AML. Current image-text matching based on the similarity of matching fragments only focuses on the local matching results, which does not establish a comprehensive cognition of content in text and image, resulting in mismatching of the abstract scene when facing complex attacks. Meanwhile, existing methods are not sensitive enough to identify the internal relationship between objects in different local areas, which also confuse matching. Therefore, aiming at the above problems, a global similarity matching module is proposed, which is dynamically fused with local similarity to measure the matching results flexibly and improve the understanding of abstract scenes. Furthermore, a global-local cognition fusion training mechanism based on relationship adversarial sample generation is proposedto enhance understanding of internal relationships between objects in different local area through adversarial sample generation. Global loss is introduced to train the overall model, and adjust the proportion of global-local loss in the training process to better identified the relationships between objects in different local areas, and avoided confusion and matching caused by the similarity of matching objects. Experimental results show that the proposed method is 7.4 % (rSum) better than the SOTA method on the Flickr30K dataset, and 4.0 % (rSum using the 1K test set) better on the MS-COCO dataset. The proposed global-local fusion (GLF) based on adversarial sample generation for image-text matching algorithm improves the accuracy and robustness of image-text matching performs well in facing some security challenges, promoting the development of visual and linguistic modal fusion.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据