4.6 Article

Visual relationship detection with recurrent attention and negative sampling

Journal

NEUROCOMPUTING
Volume 434, Issue -, Pages 55-66

Publisher

ELSEVIER
DOI: 10.1016/j.neucom.2020.12.099

Keywords

Computer vision; Neural networks; Visual relations

Funding

  1. National Key R&D Program of China [2018YFB1308000]
  2. National Natural Science Foundation of China [61772508, U1713213, 61976143]
  3. Shenzhen Technology Project [JCYJ20170413152535587]
  4. CAS Key Technology Talent Program
  5. Guangdong Technology Program [2016B010108010, 2016B010125003, 2017B010110007]
  6. Shen-zhen Engineering Laboratory for 3D Content Generating Technologies [[2017] 476]
  7. Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences [2014DP173025]
  8. Guangdong-Hong Kong-Macao Joint Laboratory of HumanMachine Intelligence-Synergy Systems [2019B121205007]
  9. CAS Key Laboratory of Human-Machine Intelligence-Synergy Systems

Ask authors/readers for more resources

This paper presents a fast method for visual relationship detection based on recurrent attention and negative sampling, integrating Word2Vec model and binary masks for learning non-visual features and spatial location features, and using undersampling technique to alleviate the influence of imbalanced annotations. Experiments show that the proposed method achieves state-of-the-art results on benchmark VRD and Visual Genome (VG) datasets in most cases.
Detecting relationships between objects is important for the complete understanding of visual scenes, which will be helpful for applications such as visual question answering, image search, and robotic interactions. It is however a challenging task due to the high variation of object appearance and interactions, and the often incomplete annotations. In this paper, we propose a fast method for visual relationship detection based on recurrent attention and negative sampling. First, to learn non-visual features, we use the Word2Vec model to extract semantic embedding features of object categories, and use binary masks to represent spatial location features. And we integrate the recurrent attention mechanism into the detection pipeline, enabling the network to focus on several specific parts of an image when scoring predicates for a given object pair. Then we use an undersampling technique to alleviate the influence of imbalanced annotations, particularly for zero-shot detection. The proposed method is simple but experiments prove that it is efficient and achieves state-of-the-art results on the benchmark VRD and Visual Genome (VG) datasets in most cases. (c) 2021 Elsevier B.V. All rights reserved.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.6
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available