☆ 4.8 Article

Image-Text Embedding Learning via Visual and Textual Semantic Reasoning

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE (2023)

Journal

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE

Volume 45, Issue 1, Pages 641-656

Publisher

IEEE COMPUTER SOC

DOI: 10.1109/TPAMI.2022.3148470

Keywords

Image-text retrieval; vision and language; cross-modal representation learning; graph neural networks; deep learning

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

As a hot research topic, cross-modal retrieval between images and texts is challenging due to the lack of semantic concepts in current image representations. To address this, we introduce an intuitive and interpretable model that learns a common embedding space for image-text alignments. Our model incorporates semantic relationship information and performs global semantic reasoning to capture key objects and concepts. Experiments show that our method surpasses state-of-the-art approaches and is highly efficient at the inference stage.

As a bridge between language and vision domains, cross-modal retrieval between images and texts is a hot research topic in recent years. It remains challenging because the current image representations usually lack semantic concepts in the corresponding sentence captions. To address this issue, we introduce an intuitive and interpretable model to learn a common embedding space for alignments between images and text descriptions. Specifically, our model first incorporates the semantic relationship information into visual and textual features by performing region or word relationship reasoning. Then it utilizes the gate and memory mechanism to perform global semantic reasoning on these relationship-enhanced features, select the discriminative information and gradually grow representations for the whole scene. Through the alignment learning, the learned visual representations capture key objects and semantic concepts of a scene as in the corresponding text caption. Experiments on MS-COCO [1] and Flickr30K [2] datasets validate that our method surpasses many recent state-of-the-arts with a clear margin. In addition to the effectiveness, our methods are also very efficient at the inference stage. Thanks to the effective overall representation learning with visual semantic reasoning, our methods can already achieve very strong performance by only relying on the simple inner-product to obtain similarity scores between images and captions. Experiments validate the proposed methods are more than 30-75 times faster than many recent methods with code public available. Instead of following the recent trend of using complex local matching strategies [3], [4], [5], [6] to pursue good performance while sacrificing efficiency, we show that the simple global matching strategy can still be very effective, efficient and achieve even better performance based on our framework.

Image-Text Embedding Learning via Visual and Textual Semantic Reasoning

Journal

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE

Publisher

IEEE COMPUTER SOC

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Image-Text Embedding Learning via Visual and Textual Semantic Reasoning

Journal

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE

Publisher

IEEE COMPUTER SOC

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper