☆ 4.5 Article

Achieving Human Parity on Visual Question Answering

ACM TRANSACTIONS ON INFORMATION SYSTEMS (2023)

Journal

ACM TRANSACTIONS ON INFORMATION SYSTEMS

Volume 41, Issue 3, Pages -

Publisher

ASSOC COMPUTING MACHINERY

DOI: 10.1145/3572833

Keywords

Visual Question Answering; multi-modal pre-training; text and image content analysis; cross-modal interaction; visual reasoning

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

This paper introduces a novel hierarchical integration of vision and language for Visual Question Answering (VQA) task, achieving similar or even slightly better results than a human being. A hierarchical framework is proposed to tackle practical problems in VQA, including diverse visual semantics learning, enhanced multi-modal pre-training, and knowledge-guided model integration. Treating different types of visual questions with corresponding expertise plays an important role in boosting the performance of the VQA architecture.

The Visual Question Answering (VQA) task utilizes both visual image and language analysis to answer a textual question with respect to an image. It has been a popular research topic with an increasing number of real-world applications in the last decade. This paper introduces a novel hierarchical integration of vision and language AliceMind-MMU (ALIbaba's Collection of Encoder-decoders from Machine IntelligeNce lab of Damo academy - MultiMedia Understanding), which leads to similar or even slightly better results than a human being does on VQA. A hierarchical framework is designed to tackle the practical problems of VQA in a cascade manner including: (1) diverse visual semantics learning for comprehensive image content understanding; (2) enhanced multi-modal pre-training with modality adaptive attention; and (3) a knowledge-guided model integration with three specialized expert modules for the complex VQA task. Treating different types of visual questions with corresponding expertise needed plays an important role in boosting the performance of our VQA architecture up to the human level. An extensive set of experiments and analysis are conducted to demonstrate the effectiveness of the new research work.

Achieving Human Parity on Visual Question Answering

Journal

ACM TRANSACTIONS ON INFORMATION SYSTEMS

Publisher

ASSOC COMPUTING MACHINERY

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Achieving Human Parity on Visual Question Answering

Journal

ACM TRANSACTIONS ON INFORMATION SYSTEMS

Publisher

ASSOC COMPUTING MACHINERY

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper