4.0 Article

Visual question answering model based on the fusion of multimodal features by a two-wav co-attention mechanism

Related references

Note: Only part of the references are listed.
Article Computer Science, Information Systems

RUArt: A Novel Text-Centered Solution for Text-Based Visual Question Answering

Zan-Xia Jin et al.

Summary: This paper proposes a novel method called RUArt for text-based visual question answering. It reads the image, understands the question, OCRed text, and objects, and mines the relationships among them. Experimental results show that RUArt effectively explores contextual information and stable relationships between text and objects.

IEEE TRANSACTIONS ON MULTIMEDIA (2023)

Article Computer Science, Information Systems

Multilevel attention and relation network based image captioning model

Himanshu Sharma et al.

Summary: This paper presents a method for image captioning using a Local Relation Network (LRN) to understand the semantic concepts of objects and their relationships in an image. The proposed model achieves superior performance on three benchmark datasets, demonstrating its effectiveness in generating natural language descriptions.

MULTIMEDIA TOOLS AND APPLICATIONS (2023)

Article Computer Science, Information Systems

Vision-Language Transformer for Interpretable Pathology Visual Question Answering

Usman Naseem et al.

Summary: Pathology visual question answering (PathVQA) aims to answer medical questions using pathology images. Existing methods have limitations in capturing the high and low-level interactions between vision and language features required for VQA. Additionally, these methods lack interpretability in justifying the retrieved answers. To address these limitations, a vision-language transformer called TraP-VQA is introduced, which embeds vision and language features for interpretable PathVQA. Our experiments demonstrate that TraP-VQA outperforms state-of-the-art methods and validate its robustness on medical VQA datasets, along with the capability of the integrated vision-language model. Visualization results explain the reasoning behind the retrieved PathVQA answers.

IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS (2023)

Article Computer Science, Information Systems

Improving visual question answering by combining scene-text information

Himanshu Sharma et al.

Summary: This paper presents a model that combines multiple inputs to address visual question answering problems related to text in natural scenes. The experimental results demonstrate the effectiveness of the proposed model on different datasets, especially with significant improvements compared to previous models on the TextVQA and ST-VQA datasets.

MULTIMEDIA TOOLS AND APPLICATIONS (2022)

Article Computer Science, Artificial Intelligence

An Improved Attention and Hybrid Optimization Technique for Visual Question Answering

Himanshu Sharma et al.

Summary: This paper proposed a new VQA model that uses effective image features and graph neural network to answer questions related to foreground object and background region, and generate image captions based on visual relationships. The performance of the model is improved by combining two attention modules and a hybrid algorithm.

NEURAL PROCESSING LETTERS (2022)

Article Computer Science, Artificial Intelligence

A multimodal attention fusion network with a dynamic vocabulary for TextVQA

Jiajia Wu et al.

Summary: Visual question answering (VQA) is a significant problem in computer vision, where text-based VQA tasks have gained increasing attention. This study introduces an attention-based encoder-decoder network that integrates multimodal information of visual, linguistic, and location features, enhancing model performance and efficiency through attention mechanism and attention map loss.

PATTERN RECOGNITION (2022)

Article Computer Science, Artificial Intelligence

A framework for visual question answering with the integration of scene-text using PHOCs and fisher vectors

Himanshu Sharma et al.

Summary: This study proposes a novel VQA model based on text cues in images to enhance accuracy, utilizing PHOC, Fisher vector representation, and transformer model with dynamic pointer networks for answer decoding, showing effectiveness over existing models on popular datasets.

EXPERT SYSTEMS WITH APPLICATIONS (2022)

Article Computer Science, Information Systems

Image captioning improved visual question answering

Himanshu Sharma et al.

Summary: A novel VQA model based on image captioning is proposed in this paper, which integrates knowledge learned from the image captioning task and transfers it to the VQA task, resulting in improved answer generation accuracy on various VQA datasets.

MULTIMEDIA TOOLS AND APPLICATIONS (2022)

Article Computer Science, Artificial Intelligence

Structured Multimodal Attentions for TextVQA

Chenyu Gao et al.

Summary: This paper proposes an end-to-end structured multimodal attention neural network for text-based visual question answering. The model encodes the relationships between objects, text, and text in the image using a structural graph representation and uses a multimodal graph attention network for reasoning. The proposed model outperforms state-of-the-art models on TextVQA dataset and ST-VQA dataset tasks, and it also won first place in TextVQA Challenge 2020.

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE (2022)

Article Computer Science, Artificial Intelligence

CE-text: A context-Aware and embedded text detector in natural scene images

Yirui Wu et al.

Summary: Researchers have made significant progress in text detection using deep learning architectures in recent years, but direct application often leads to low accuracy. To address this issue, they propose a lightweight and context-aware deep convolutional neural network called CE-Text, which can run in embedded systems and achieve accurate text detection.

PATTERN RECOGNITION LETTERS (2022)

Article Engineering, Electrical & Electronic

Graph neural network-based visual relationship and multilevel attention for image captioning

Himanshu Sharma et al.

Summary: This study proposes an image captioning method based on local relation networks and multilevel attention approach. By considering the relationship between objects and image regions, the method generates significant context-based features corresponding to each region and enhances image representation through attention mechanisms. Experimental results demonstrate the superiority of the proposed method in caption generation over existing methods.

JOURNAL OF ELECTRONIC IMAGING (2022)

Proceedings Paper Computer Science, Artificial Intelligence

LaTr: Layout-Aware Transformer for Scene-Text VQA

Ali Furkan Biten et al.

Summary: We propose a novel multimodal architecture called LaTr for Scene Text Visual Question Answering (STVQA). The importance of the language module, especially when enriched with layout information, is revealed. Our single objective pre-training scheme that only requires text and spatial cues improves model performance on scanned documents and enhances robustness towards OCR errors. By leveraging a vision transformer and performing vocabulary-free decoding, LaTr outperforms existing STVQA methods on multiple datasets.

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022) (2022)

Proceedings Paper Computer Science, Artificial Intelligence

Grounding Answers for Visual Questions Asked by Visually Impaired People

Chongyan Chen et al.

Summary: This article introduces the VizWiz-VQA-Grounding dataset and compares it with five other datasets. The research reveals that current VQA and VQA-Grounding models often fail to accurately identify the correct visual evidence. This is particularly challenging for visual evidence that occupies a small fraction of the image, high-quality images, and visual questions that require text recognition skills.

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022) (2022)

Article Computer Science, Information Systems

Multiple attention encoded cascade R-CNN for scene text detection

Yirui Wu et al.

Summary: Inspired by instance segmentation algorithms, researchers proposed a multiple-context-aware and cascade CNN structure, which improves the performance of segmentation-based methods, especially in text detection in complex natural scenes. The method consists of two stages, namely feature generation and cascade detection, and shows high accuracy and efficiency in comparative experiments.

JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION (2021)

Article Computer Science, Artificial Intelligence

Visual question answering model based on graph neural network and contextual attention

Himanshu Sharma et al.

Summary: Visual Question Answering (VQA) is an emerging research area in computer vision and natural language processing, aiming to predict answers to natural questions related to images. However, current VQA approaches often overlook the relationship and reasoning among regions of interest. The proposed VQA model introduced in this paper considers previously attended visual content, leading to improved accuracy in answer prediction.

IMAGE AND VISION COMPUTING (2021)

Article Computer Science, Artificial Intelligence

DMRFNet: Deep Multimodal Reasoning and Fusion for Visual Question Answering and explanation generation

Weifeng Zhang et al.

Summary: Visual Question Answering (VQA) has attracted extensive attention in the artificial intelligence community, with multimodal reasoning and fusion being a central component in recent models. Existing VQA models often lack the ability to reason and fuse clues from multiple modalities, as well as interpretability. This paper proposes a novel multimodal reasoning and fusion model, utilizing Multi-Graph Reasoning and Fusion (MGRF) layer and Deep Multimodal Reasoning and Fusion Network (DMRFNet) to tackle these challenges and enhance interpretability through an explanation generation module.

INFORMATION FUSION (2021)

Article Engineering, Electrical & Electronic

Multi-scale relation reasoning for multi-modal Visual Question Answering

Yirui Wu et al.

Summary: This paper proposes a method using a deep neural network for multi-modal relation reasoning, which successfully constructs a regional attention scheme and multi-scale property to accurately answer questions about images.

SIGNAL PROCESSING-IMAGE COMMUNICATION (2021)

Proceedings Paper Computer Science, Artificial Intelligence

Localize, Group, and Select: Boosting Text-VQA by Scene Text Modeling

Xiaopeng Lu et al.

Summary: The paper proposes a novel model named LOGOS to tackle the Text-VQA task from multiple aspects, outperforming previous state-of-the-art methods without using additional OCR annotation data. Ablation studies demonstrate the capability of LOGOS to bridge different modalities and better understand scene text.

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2021) (2021)

Proceedings Paper Computer Science, Artificial Intelligence

TAP: Text-Aware Pre-training for Text-VQA and Text-Caption

Zhengyuan Yang et al.

Summary: The paper proposes a Text-Aware Pre-training (TAP) method for Text-VQA and Text-Caption tasks, which incorporates scene text during pretraining to improve aligned representation learning among text word, visual object, and scene text modalities. Pre-trained on a large-scale OCR-CC dataset, the approach outperforms the state of the art by large margins on multiple tasks.

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021 (2021)

Proceedings Paper Computer Science, Artificial Intelligence

SUTD-TrafficQA: A Question Answering Benchmark and an Efficient Network for Video Reasoning over Traffic Events

Li Xu et al.

Summary: This paper discusses the importance of traffic event cognition and reasoning in videos, and introduces a novel dataset SUTD-TrafficQA for benchmarking cognitive capability in complex traffic scenarios. By proposing 6 challenging reasoning tasks and introducing Eclipse method, the study achieves computation-efficient and reliable video reasoning with superior performance.

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021 (2021)

Article Computer Science, Artificial Intelligence

Re-Attention for Visual Question Answering

Wenya Guo et al.

Summary: In this paper, a re-attention framework is proposed to utilize answer information for describing visual contents in VQA. Experiments show that the proposed model performs favorably against state-of-the-art methods.

IEEE TRANSACTIONS ON IMAGE PROCESSING (2021)

Article Engineering, Electrical & Electronic

Visual question answering model based on visual relationship detection

Yuling Xi et al.

SIGNAL PROCESSING-IMAGE COMMUNICATION (2020)

Article Physics, Applied

Incorporating external knowledge for image captioning using CNN and LSTM

Himanshu Sharma et al.

MODERN PHYSICS LETTERS B (2020)

Article Computer Science, Information Systems

Multi-source Multi-level Attention Networks for Visual Question Answering

Dongfei Yu et al.

ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS (2019)

Proceedings Paper Computer Science, Artificial Intelligence

Scene Text Visual Question Answering

Ali Furkan Biten et al.

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019) (2019)

Proceedings Paper Computer Science, Artificial Intelligence

Rosetta: Large Scale System for Text Detection and Recognition in Images

Fedor Borisyuk et al.

KDD'18: PROCEEDINGS OF THE 24TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING (2018)

Proceedings Paper Computer Science, Artificial Intelligence

MUTAN: Multimodal Tucker Fusion for Visual Question Answering

Hedi Ben-younes et al.

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV) (2017)

Article Computer Science, Artificial Intelligence

Word Spotting and Recognition with Embedded Attributes

Jon Almazan et al.

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE (2014)