☆ 4.5 Article

Multimodal graph inference network for scene graph generation

APPLIED INTELLIGENCE (2021)

期刊

APPLIED INTELLIGENCE

卷 51, 期 12, 页码 8768-8783

出版社

SPRINGER

DOI: 10.1007/s10489-021-02304-7

关键词

Scene graph generation; Visual relationship detection; Image understanding; Semantic analysis

类别

Computer Science, Artificial Intelligence

资金

National Natural Science Foundation of China [62076117, 61762061, 61763031]
Natural Science Foundation of Jiangxi Province, China [20161ACB20004]
Jiangxi Key Laboratory of Smart City [20192BCD40002]

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

A Multimodal Graph Inference Network (MGIN) is proposed in this study to improve the inference capability of triplets, especially for uncommon samples, by incorporating prior statistical knowledge and fusing visual and semantic features. The method achieves higher average recall and mean recall compared with state-of-the-art methods, with significant improvements in predicting relationships with low probability.

A scene graph can describe images concisely and structurally. However, existing methods of scene graph generation have low capabilities of inferring certain relationships, because of the lack of semantic information and their heavy dependence on the statistical distribution of the training set. To alleviate the above problems, a Multimodal Graph Inference Network (MGIN), which includes two modules; Multimodal Information Extraction (MIE) and Target with Multimodal Feature Inference (TMFI), is proposed in this study. MGIN can increase the inference capability of triplets, especially for uncommon samples. In the proposed MIE module, the prior statistical knowledge of the training set is incorporated into the network in a reprocess to relieve the problem of overfitting to the training set. Visual and semantic features are extracted in the MIE module and fused as unified multimodal features in the TMFI module. These features are efficient for the inference module to increase the prediction capability of MGIN, especially for some uncommon samples. The proposed method achieves 27.0% average mean recall and 55.9% average recall, with improvements of 0.48% and 0.50%, respectively, compared with state-of-the-art methods. It also increases the average recall of 20 relationships with the lowest probability by 4.91%.

Multimodal graph inference network for scene graph generation

期刊

APPLIED INTELLIGENCE

出版社

SPRINGER

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Multimodal graph inference network for scene graph generation

期刊

APPLIED INTELLIGENCE

出版社

SPRINGER

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文