4.6 Article

Chinese Image Caption Generation via Visual Attention and Topic Modeling

Journal

IEEE TRANSACTIONS ON CYBERNETICS
Volume 52, Issue 2, Pages 1247-1257

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/TCYB.2020.2997034

Keywords

Visualization; Decoding; Semantics; Predictive models; Feature extraction; Natural language processing; Chinese image caption; deep neural network; topic modeling; visual attention

Funding

  1. Major Projects of National Social Science Foundation of China [11ZD189]
  2. Innocation Teams in Colleges and Universities in Jinan [2018GXRC014]

Ask authors/readers for more resources

Automatic image captioning is a complex research issue in the field of artificial intelligence, involving computer vision and natural language processing. Despite the remarkable performance of the neural image caption (NIC) model, there are still challenges in achieving accurate and diverse image captions, as well as overcoming the deviation and monotony issues. The distinction between Chinese and English in syntax and semantics necessitates the development of specialized Chinese image caption generation methods. Our NICVATP2L model, incorporating visual attention and topic modeling, effectively addresses these challenges and outperforms existing NIC models.
Automatic image captioning is to conduct the cross-modal conversion from image visual content to natural language text. Involving computer vision (CV) and natural language processing (NLP), it has become one of the most sophisticated research issues in the artificial-intelligence area. Based on the deep neural network, the neural image caption (NIC) model has achieved remarkable performance in image captioning, yet there still remain some essential challenges, such as the deviation between descriptive sentences generated by the model and the intrinsic content expressed by the image, the low accuracy of the image scene description, and the monotony of generated sentences. In addition, most of the current datasets and methods for image captioning are in English. However, considering the distinction between Chinese and English in syntax and semantics, it is necessary to develop specialized Chinese image caption generation methods to accommodate the difference. To solve the aforementioned problems, we design the NICVATP2L model via visual attention and topic modeling, in which the visual attention mechanism reduces the deviation and the topic model improves the accuracy and diversity of generated sentences. Specifically, in the encoding phase, convolutional neural network (CNN) and topic model are used to extract visual and topic features of the input images, respectively. In the decoding phase, an attention mechanism is applied to processing image visual features for obtaining image visual region features. Finally, the topic features and the visual region features are combined to guide the two-layer long short-term memory (LSTM) network for generating Chinese image captions. To justify our model, we have conducted experiments over the Chinese AIC-ICC image dataset. The experimental results show that our model can automatically generate more informative and descriptive captions in Chinese in a more natural way, and it outperforms the existing image captioning NIC model.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.6
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available