☆ 4.6 Article

Vision-Language Transformer for Interpretable Pathology Visual Question Answering

IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS (2023)

期刊

IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS

卷 27, 期 4, 页码 1681-1690

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

DOI: 10.1109/JBHI.2022.3163751

关键词

Pathology images; interpretability; visual question answering; vision-language

类别

Computer Science, Information Systems Computer Science, Interdisciplinary Applications Mathematical & Computational Biology Medical Informatics

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

Pathology visual question answering (PathVQA) aims to answer medical questions using pathology images. Existing methods have limitations in capturing the high and low-level interactions between vision and language features required for VQA. Additionally, these methods lack interpretability in justifying the retrieved answers. To address these limitations, a vision-language transformer called TraP-VQA is introduced, which embeds vision and language features for interpretable PathVQA. Our experiments demonstrate that TraP-VQA outperforms state-of-the-art methods and validate its robustness on medical VQA datasets, along with the capability of the integrated vision-language model. Visualization results explain the reasoning behind the retrieved PathVQA answers.

Pathology visual question answering (PathVQA) attempts to answer a medical question posed by pathology images. Despite its great potential in healthcare, it is not widely adopted because it requires interactions on both the image (vision) and question (language) to generate an answer. Existing methods focused on treating vision and language features independently, which were unable to capture the high and low-level interactions that are required for VQA. Further, these methods failed to offer capabilities to interpret the retrieved answers, which are obscure to humans where the models' interpretability to justify the retrieved answers has remained largely unexplored. Motivated by these limitations, we introduce a vision-language transformer that embeds vision (images) and language (questions) features for an interpretable PathVQA. We present an interpretable transformer-based Path-VQA (TraP-VQA), where we embed transformers' encoder layers with vision and language features extracted using pre-trained CNN and domain-specific language model (LM), respectively. A decoder layer is then embedded to upsample the encoded features for the final prediction for PathVQA. Our experiments showed that our TraP-VQA outperformed the state-of-the-art comparative methods with public PathVQA dataset. Our experiments validated the robustness of our model on another medical VQA dataset, and the ablation study demonstrated the capability of our integrated transformer-based vision-language model for PathVQA. Finally, we present the visualization results of both text and images, which explain the reason for a retrieved answer in PathVQA.

Vision-Language Transformer for Interpretable Pathology Visual Question Answering

期刊

IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Vision-Language Transformer for Interpretable Pathology Visual Question Answering

期刊

IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文