4.8 Article

Integrating Multi-Label Contrastive Learning With Dual Adversarial Graph Neural Networks for Cross-Modal Retrieval

出版社

IEEE COMPUTER SOC
DOI: 10.1109/TPAMI.2022.3188547

关键词

Semantics; Correlation; Data models; Task analysis; Graph neural networks; Generative adversarial networks; Training; Contrastive learning; cross-modal retrieval; deep learning; graph neural networks

向作者/读者索取更多资源

With the increasing amount of multimodal data, cross-modal retrieval has become a hot research topic, but existing techniques have limitations in eliminating modality heterogeneity, considering label relationships, and efficiently aligning representation and label similarity. To address these problems, this article proposes two models that use dual generative adversarial networks to project multimodal data into a common representation space, employ multi-hop graph neural networks to model label relation dependencies, and introduce a novel soft multi-label contrastive loss to align representation and label similarity. Experimental results on three benchmark datasets demonstrate the superiority of the proposed method.
With the growing amount of multimodal data, cross-modal retrieval has attracted more and more attention and become a hot research topic. To date, most of the existing techniques mainly convert multimodal data into a common representation space where similarities in semantics between samples can be easily measured across multiple modalities. However, these approaches may suffer from the following limitations: 1) They overcome the modality gap by introducing loss in the common representation space, which may not be sufficient to eliminate the heterogeneity of various modalities; 2) They treat labels as independent entities and ignore label relationships, which is not conducive to establishing semantic connections across multimodal data; 3) They ignore the non-binary values of label similarity in multi-label scenarios, which may lead to inefficient alignment of representation similarity with label similarity. To tackle these problems, in this article, we propose two models to learn discriminative and modality-invariant representations for cross-modal retrieval. First, the dual generative adversarial networks are built to project multimodal data into a common representation space. Second, to model label relation dependencies and develop inter-dependent classifiers, we employ multi-hop graph neural networks (consisting of Probabilistic GNN and Iterative GNN), where the layer aggregation mechanism is suggested for using propagation information of various hops. Third, we propose a novel soft multi-label contrastive loss for cross-modal retrieval, with the soft positive sampling probability, which can align the representation similarity and the label similarity. Additionally, to adapt to incomplete-modal learning, which can have wider applications, we propose a modal reconstruction mechanism to generate missing features. Extensive experiments on three widely used benchmark datasets, i.e., NUS-WIDE, MIRFlickr, and MS-COCO, show the superiority of our proposed method.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.8
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据