3.8 Proceedings Paper

AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning

出版社

IEEE COMPUTER SOC
DOI: 10.1109/CVPR46437.2021.01113

关键词

-

资金

  1. CRA DREU program
  2. Stanford HAI Institute
  3. Brown Institute

向作者/读者索取更多资源

The paper introduces a new benchmark AGQA for evaluating compositional spatio-temporal reasoning, which minimizes bias by balancing answer distributions and types of question structures. Evaluations of existing models show that the best model achieves only 47.74% accuracy.
Visual events are a composition of temporal actions involving actors spatially interacting with objects. When developing computer vision models that can reason about compositional spatio-temporal events, we need benchmarks that can analyze progress and uncover shortcomings. Existing video question answering benchmarks are useful, but they often conflate multiple sources of error into one accuracy metric and have strong biases that models can exploit, making it difficult to pinpoint model weaknesses. We present Action Genome Question Answering (AGQA), a new benchmark for compositional spatio-temporal reasoning. AGQA contains 192M unbalanced question answer pairs for 9:6K videos. We also provide a balanced subset of 3:9M question answer pairs, 3 orders of magnitude larger than existing benchmarks, that minimizes bias by balancing the answer distributions and types of question structures. Although human evaluators marked 86:02% of our question-answer pairs as correct, the best model achieves only 47:74% accuracy. In addition, AGQA introduces multiple training/test splits to test for various reasoning abilities, including generalization to novel compositions, to indirect references, and to more compositional steps. Using AGQA, we evaluate modern visual reasoning systems, demonstrating that the best models barely perform better than non-visual baselines exploiting linguistic biases and that none of the existing models generalize to novel compositions unseen during training.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

3.8
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据