4.7 Article

Scene-Text Oriented Referring Expression Comprehension

相关参考文献

注意:仅列出部分参考文献,下载原文获取全部文献信息。
Article Computer Science, Information Systems

RUArt: A Novel Text-Centered Solution for Text-Based Visual Question Answering

Zan-Xia Jin et al.

Summary: This paper proposes a novel method called RUArt for text-based visual question answering. It reads the image, understands the question, OCRed text, and objects, and mines the relationships among them. Experimental results show that RUArt effectively explores contextual information and stable relationships between text and objects.

IEEE TRANSACTIONS ON MULTIMEDIA (2023)

Article Computer Science, Artificial Intelligence

Unambiguous Text Localization, Retrieval, and Recognition for Cluttered Scenes

Xuejian Rong et al.

Summary: This paper proposes a method to accurately localize and recognize specific text instances in cluttered images through natural language descriptions. The method includes a dense text localization network, a context reasoning text retrieval model, and a recurrent text recognition module, which together achieve text instance detection, text bounding box ranking, and text verification or transcription in cluttered scenes.

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE (2022)

Proceedings Paper Computer Science, Artificial Intelligence

TransVG: End-to-End Visual Grounding with Transformers

Jiajun Deng et al.

Summary: In this paper, a neat and effective transformer-based framework, TransVG, is presented for visual grounding task. By leveraging transformers to establish multi-modal correspondence, the complex fusion modules are replaced by simple transformer encoder layers, resulting in higher performance. This approach reframes visual grounding as a direct coordinates regression problem and avoids making predictions out of a set of candidates.

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021) (2021)

Proceedings Paper Computer Science, Artificial Intelligence

Towards Accurate Text-based Image Captioning with Content Diversity Exploration

Guanghui Xu et al.

Summary: Understanding the relationship between text and image descriptions is crucial for machines to interpret complex scenes, yet faces challenges such as inadequate comprehensive description and content diversity. Existing methods focus on a single global caption, which cannot effectively describe complex text and visual information. To address this, a novel Anchor-Captioner method is proposed to generate diverse descriptions for images.

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021 (2021)

Proceedings Paper Computer Science, Artificial Intelligence

Scene Text Retrieval via Joint Text Detection and Similarity Learning

Hao Wang et al.

Summary: This paper proposes a method for scene text retrieval by directly learning cross-modal similarity between query text and text instances from natural images. By jointly optimizing scene text detection and similarity learning with an end-to-end trainable network, the proposed method achieves better performance than state-of-the-art approaches on benchmark datasets.

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021 (2021)

Proceedings Paper Computer Science, Artificial Intelligence

Improving OCR-based Image Captioning by Incorporating Geometrical Relationship

Jing Wang et al.

Summary: This paper proposes a method for OCR-based image captioning that enhances image descriptions by utilizing the geometrical relationships between OCR tokens. The LSTM-R architecture is introduced to learn relations between OCR tokens and achieves state-of-the-art performance on TextCaps.

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021 (2021)

Proceedings Paper Computer Science, Artificial Intelligence

TAP: Text-Aware Pre-training for Text-VQA and Text-Caption

Zhengyuan Yang et al.

Summary: The paper proposes a Text-Aware Pre-training (TAP) method for Text-VQA and Text-Caption tasks, which incorporates scene text during pretraining to improve aligned representation learning among text word, visual object, and scene text modalities. Pre-trained on a large-scale OCR-CC dataset, the approach outperforms the state of the art by large margins on multiple tasks.

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021 (2021)

Proceedings Paper Computer Science, Artificial Intelligence

Room-and-Object Aware Knowledge Reasoning for Remote Embodied Referring Expression

Chen Gao et al.

Summary: The Remote Embodied Referring Expression (REVERIE) task requires an agent to navigate to a remote object based on high-level language instructions, emphasizing goal-oriented exploration over strict instruction-following. This paper proposes a Cross-modality Knowledge Reasoning (CKR) model based on transformer architecture, with a Room-and-Object Aware Attention mechanism and a Knowledge-enabled Entity Relationship Reasoning module to address the unique challenges of REVERIE. Evaluation on the REVERIE benchmark shows that the CKR model significantly improves SPL and REVERIE-success rate by 64.67% and 46.05%, respectively.

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021 (2021)

Proceedings Paper Computer Science, Artificial Intelligence

Look Before You Leap: Learning Landmark Features for One-Stage Visual Grounding

Binbin Huang et al.

Summary: A 'Look Before You Leap' (LBYL) network is proposed for end-to-end trainable one-stage visual grounding, utilizing a landmark feature convolution module to encode relative spatial relations between objects based on language descriptions. This approach outperforms existing methods on ReferitGame and shows comparable or better results on RefCOCO and RefCOCO+ datasets, demonstrating its effectiveness in visual grounding tasks.

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021 (2021)

Article Computer Science, Information Systems

Referring Expression Comprehension: A Survey of Methods and Datasets

Yanyuan Qiao et al.

Summary: Referring expression comprehension aims to localize target object described by natural language, challenging with attention in computer vision and natural language processing, methods include CNN-RNN, modular networks and graph-based models.

IEEE TRANSACTIONS ON MULTIMEDIA (2021)

Proceedings Paper Computer Science, Artificial Intelligence

StacMR: Scene-Text Aware Cross-Modal Retrieval

Andres Mafla et al.

Summary: This paper introduces a new dataset for cross-modal retrieval involving scene-text instances, proposes approaches leveraging scene text, and conducts experiments to confirm the benefits of utilizing scene text.

2021 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION WACV 2021 (2021)

Proceedings Paper Computer Science, Artificial Intelligence

Label or Message: A Large-Scale Experimental Survey of Texts and Objects Co-Occurrence

Koki Takeshita et al.

Summary: This paper conducts a large-scale survey on the co-occurrence between visual objects and scene texts, focusing on the function of label texts. By analyzing the co-occurrence between objects and scene texts, valuable insights can be gained on how scene texts are useful for object recognition and vice versa.

2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR) (2021)

Article Computer Science, Artificial Intelligence

Relationship-Embedded Representation Learning for Grounding Referring Expressions

Sibei Yang et al.

Summary: Grounding referring expressions in images involves joint understanding of natural language and image content, extracting necessary information from both modalities, and aligning cross-modal semantic concepts. This paper proposes a Cross-Modal Relationship Inference Network using a relationship extractor and graph convolutional network to significantly surpass existing state-of-the-art methods in accurately extracting and associating multi-order relationships in referring expressions with image objects and contexts.

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE (2021)

Article Computer Science, Artificial Intelligence

Mask R-CNN

Kaiming He et al.

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE (2020)

Article Robotics

INGRESS: Interactive visual grounding of referring expressions

Mohit Shridhar et al.

INTERNATIONAL JOURNAL OF ROBOTICS RESEARCH (2020)

Article Computer Science, Artificial Intelligence

Unambiguous Scene Text Segmentation With Referring Expression Comprehension

Xuejian Rong et al.

IEEE TRANSACTIONS ON IMAGE PROCESSING (2020)

Proceedings Paper Computer Science, Artificial Intelligence

Scene Text Visual Question Answering

Ali Furkan Biten et al.

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019) (2019)

Article Computer Science, Information Systems

Bundled Object Context for Referring Expressions

Xiangyang Li et al.

IEEE TRANSACTIONS ON MULTIMEDIA (2018)

Article Computer Science, Information Systems

Words Matter: Scene Text for Image Classification and Retrieval

Sezer Karaoglu et al.

IEEE TRANSACTIONS ON MULTIMEDIA (2017)

Article Computer Science, Artificial Intelligence

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Ranjay Krishna et al.

INTERNATIONAL JOURNAL OF COMPUTER VISION (2017)

Proceedings Paper Computer Science, Artificial Intelligence

Unambiguous Text Localization and Retrieval for Cluttered Scenes

Xuejian Rong et al.

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017) (2017)

Proceedings Paper Computer Science, Artificial Intelligence

Mask R-CNN

Kaiming He et al.

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV) (2017)

Article Computer Science, Artificial Intelligence

Image Geo-Localization Based on Multiple Nearest Neighbor Feature Matching Using Generalized Graphs

Amir Roshan Zamir et al.

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE (2014)