☆ 4.6 Article

Multi-level Alignment Network for Domain Adaptive Cross-modal Retrieval

NEUROCOMPUTING (2021)

期刊

NEUROCOMPUTING

卷 440, 期 -, 页码 207-219

出版社

ELSEVIER

DOI: 10.1016/j.neucom.2021.01.114

关键词

Cross-modal retrieval; Domain adaptation; Cross-dataset training; Adversarial learning

类别

Computer Science, Artificial Intelligence

资金

National Key Research and Development Program of China [2018YFB0804102, 2020YFB2103802]
NSFC [61902347, 61772466, U1936215, U1836202]
Zhejiang Provincial Natural Science Foundation [LQ19F020002, LR19F020003, LQ21F020010]
Public Welfare Technology Research Project of Zhejiang Province [LY21F020010]
Science and Technology Program of Zhejiang Province [2021C01120]
AlibabaZhejiang University Joint Institute of Frontier Technologies
Fundamental Research Funds for the Central Universities (Zhejiang University NGICS Platform)

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

This paper introduces a new task of domain adaptive cross-modal retrieval, addressing the scenario where training and testing data come from different domains. By proposing a Multi-level Alignment Network (MAN), the semantic, modality, and domain gaps are effectively reduced, enhancing the generalization ability for target data. Experiments show that MAN outperforms multiple baselines and achieves a new state-of-the-art in large-scale text-to-video retrieval.

Cross-modal retrieval is an important but challenging research task in the multimedia community. Most existing works of this task are supervised, which typically train models on a large number of aligned image-text/video-text pairs, making an assumption that training and testing data are drawn from the same distribution. If this assumption does not hold, traditional cross-modal retrieval methods may experience a performance drop at the evaluation. In this paper, we introduce a new task named as domain adaptive cross-modal retrieval, where training (source) data and testing (target) data are from different domains. The task is challenging, as there are not only the semantic gap and modality gap between visual and textual items, but also domain gap between source and target domains. Therefore, we propose a Multi-level Alignment Network (MAN) that has two mapping modules to project visual and textual modalities in a common space respectively, and three alignments are used to learn more discriminative features in the space. A semantic alignment is used to reduce the semantic gap, a cross-modality alignment and a cross-domain alignment are employed to alleviate the modality gap and domain gap. Extensive experiments in the context of domain-adaptive image-text retrieval and video-text retrieval demonstrate that our proposed model, MAN, consistently outperforms multiple baselines, showing a superior generalization ability for target data. Moreover, MAN establishes a new state-of-the-art for the large-scale text-to video retrieval on TRECVID 2017, 2018 Ad-hoc Video Search benchmark. (c) 2021 Elsevier B.V. All rights reserved.

Multi-level Alignment Network for Domain Adaptive Cross-modal Retrieval

期刊

NEUROCOMPUTING

出版社

ELSEVIER

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Multi-level Alignment Network for Domain Adaptive Cross-modal Retrieval

期刊

NEUROCOMPUTING

出版社

ELSEVIER

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文