4.7 Article

What Happens in Crowd Scenes: A New Dataset About Crowd Scenes for Image Captioning

Journal

IEEE TRANSACTIONS ON MULTIMEDIA
Volume 25, Issue -, Pages 5400-5412

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/TMM.2022.3192729

Keywords

CrowdCaption; crowd scenes; image captioning; multimodal understanding

Ask authors/readers for more resources

Making machines capable of understanding and analyzing crowd scenes is crucial for building a smart city. However, crowd scene understanding captioning has been rarely studied due to a lack of related datasets. To address this gap, we propose the CrowdCaption dataset and a novel method (MAGC) that achieves state-of-the-art performance in generating detailed descriptions of crowd scenes.
Making machines endowed with eyes and brains to effectively understand and analyze crowd scenes is of paramount importance for building a smart city to serve people. This is of far-reaching significance for the guidance of dense crowds and accident prevention, such as crowding and stampedes. As a typical multimodal scene understanding task, image captioning has always attracted widespread attention. However, crowd scene understanding captioning is rarely studied due to the unobtainability of related datasets. Therefore, it is difficult to know what happens in crowd scenes. In order to fill this research gap, we propose a crowd scenes caption dataset named CrowdCaption which has the advantages of crowd-topic scenes, comprehensive and complex caption descriptions, typical relationships and detailed grounding annotations. The complexity and diversity of the descriptions and the specificity of the crowd scenes make this dataset extremely challenging to most current methods. Thus, we propose a Multi-hierarchical Attribute Guided Crowd Caption Network (MAGC) based on crowd objects, actions, and status (such as position, dress, posture, etc.) aiming to generate crowd-specific detailed descriptions. We conduct extensive experiments on our CrowdCaption dataset, and our proposed method reaches the state-of-the-art (SoTA) performance. We hope the CrowdCaption dataset can assist future studies related to crowd scenes in the multimodal domain.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available