☆ 4.6 Article

Cross-modal transformer with language query for referring image segmentation

NEUROCOMPUTING (2023)

Journal

NEUROCOMPUTING

Volume 536, Issue -, Pages 191-205

Publisher

ELSEVIER

DOI: 10.1016/j.neucom.2023.03.011

Keywords

Referring image segmentation; Deep interaction; Cross -modal transformer; Semantics -guided detail enhancement

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

A cross-modal transformer with language queries is proposed for referring image segmentation, which enables deep interaction between vision and language and improves the accuracy of segmentation.

Referring image segmentation (RIS) aims to predict a segmentation mask for a target specified by a nat-ural language expression. However, the existing methods failed to implement deep interaction between vision and language is needed in RIS, resulting inaccurate segmentation. To address the problem, a cross -modal transformer (CMT) with language queries for referring image segmentation is proposed. First, a cross-modal encoder of CMT is designed for intra-modal and inter-modal interaction, capturing context-aware visual features. Secondly, to generate compact visual-aware language queries, a language-query encoder (LQ) embeds key visual cues into linguistic features. In particular, the combina-tion of the cross-modal encoder and language query encoder realizes the mutual guidance of vision and language. Finally, the cross-modal decoder of CMT is constructed to learn multimodal features of the ref-erent from the context-aware visual features and visual-aware language queries. In addition, a semantics-guided detail enhancement (SDE) module is constructed to fuse the semantic-rich multimodal features with detail-rich low-level visual features, which supplements the spatial details of the predicted segmentation masks. Extensive experiments on four referring image segmentation datasets demonstrate the effectiveness of the proposed method.(c) 2023 Elsevier B.V. All rights reserved.

Cross-modal transformer with language query for referring image segmentation

Journal

NEUROCOMPUTING

Publisher

ELSEVIER

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Cross-modal transformer with language query for referring image segmentation

Journal

NEUROCOMPUTING

Publisher

ELSEVIER

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper