☆ 4.7 Article

Unified Adaptive Relevance Distinguishable Attention Network for Image-Text Matching

IEEE TRANSACTIONS ON MULTIMEDIA (2023)

Journal

IEEE TRANSACTIONS ON MULTIMEDIA

Volume 25, Issue -, Pages 1320-1332

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

DOI: 10.1109/TMM.2022.3141603

Keywords

Semantics; Optimization; Visualization; Training; Task analysis; Representation learning; Correlation; Image-text matching; attention network; unified adaptive relevance distinguishable learning

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

Image-text matching is a fundamental cross-modal task that bridges the gap between vision and language. We propose a novel Unified Adaptive Relevance Distinguishable Attention (UARDA) mechanism to accurately learn semantic alignment by distinguishing relevant and irrelevant distributions. Experimental results show that UARDA outperforms state-of-the-arts and reduces retrieval time substantially.

Image-text matching, as a fundamental cross-modal task, bridges the gap between vision and language. The core is to accurately learn semantic alignment to find relevant shared semantics in image and text. Existing methods typically attend to all fragments with word-region similarity greater than empirical threshold zero as relevant shared semantics, e.g., via a ReLU operation that forces the negative to zero and maintains the positive. However, this fixed threshold is totally isolated with feature learning, which cannot adaptively and accurately distinguish the varying distributions of relevant and irrelevant word-region similarity in training, inevitably limiting the semantic alignment learning. To solve this issue, we propose a novel Unified Adaptive Relevance Distinguishable Attention (UARDA) mechanism, incorporating the relevance threshold into a unified learning framework, to maximally distinguish the relevant and irrelevant distributions to obtain better semantic alignment. Specifically, our method adaptively learns the optimal relevance boundary between these two distributions to improve the model to learn more discriminative features. The explicit relevance threshold is well integrated into similarity matching, which kills two birds with one stone as: (1) excluding the disturbances of irrelevant fragment contents to aggregate precisely relevant shared semantics for boosting matching accuracy, and (2) avoiding the calculation of irrelevant fragment queries for reducing retrieval time. Experimental results on benchmarks show that UARDA can substantially and consistently outperform state-of-the-arts, with relative rSum improvements of 2%-4% (16.9%-35.3% for baseline SCAN), and reducing the retrieval time by 50%-73%.

Unified Adaptive Relevance Distinguishable Attention Network for Image-Text Matching

Journal

IEEE TRANSACTIONS ON MULTIMEDIA

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Unified Adaptive Relevance Distinguishable Attention Network for Image-Text Matching

Journal

IEEE TRANSACTIONS ON MULTIMEDIA

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper