4.7 Article

I(2)Transformer: Intra- and Inter-Relation Embedding Transformer for TV Show Captioning

Journal

IEEE TRANSACTIONS ON IMAGE PROCESSING
Volume 31, Issue -, Pages 3565-3577

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/TIP.2022.3159472

Keywords

Transformers; Semantics; Task analysis; Visualization; TV; Electronic mail; Graph neural networks; TV Show captioning; video and subtitle; intra-relation embedding; inter-relation embedding; transformer

Funding

  1. National Key Research and Development Plan [2018AAA0102000, 2019QY1801, 2019QY1802, 2019QY1800]
  2. National Natural Science Foundation of China [61972186, 61732005, U21B2027]
  3. Yunnan High-Tech Industry Development Project [201606]
  4. Yunnan Provincial Major Science and Technology Special Plan Projects [202103AA080015, 202002AD080001-5]
  5. Yunnan Basic Research Project [202001AS070014]
  6. Reserve Talents for Academic and Technological Leaders in Yunnan Province [202105AC160018]
  7. CAAI-Huawei MindSpore Open Fund, Youth Innovation Promotion Association of Chinese Academy of Sciences [2020108]
  8. CCF-Baidu Open Fund [2021PP15002000]

Ask authors/readers for more resources

TV show captioning plays an important role in generating linguistic sentences based on video and subtitles, with challenges such as scattered information and semantic gap. The proposed I(2)Transformer model achieves state-of-the-art performance by capturing intra- and inter-relations, and shows good generalization ability in other relevant tasks.
TV show captioning aims to generate a linguistic sentence based on the video and its associated subtitle. Compared to purely video-based captioning, the subtitle can provide the captioning model with useful semantic clues such as actors' sentiments and intentions. However, the effective use of subtitle is also very challenging, because it is the pieces of scrappy information and has semantic gap with visual modality. To organize the scrappy information together and yield a powerful omni-representation for all the modalities, an efficient captioning model requires understanding video contents, subtitle semantics, and the relations in between. In this paper, we propose an Intra- and Inter-relation Embedding Transformer (I(2)Transformer), consisting of an Intra-relation Embedding Block (IAE) and an Inter-relation Embedding Block (IEE) under the framework of a Transformer. First, the IAE captures the intra-relation in each modality via constructing the learnable graphs. Then, IEE learns the cross attention gates, and selects useful information from each modality based on their inter-relations, so as to derive the omni-representation as the input to the Transformer. Experimental results on the public dataset show that the I(2)Transformer achieves the state-of-the-art performance. We also evaluate the effectiveness of the IAE and IEE on two other relevant tasks of video with text inputs, i.e., TV show retrieval and video-guided machine translation. The encouraging performance further validates that the IAE and IEE blocks have a good generalization ability. The code is available at https://github.com/tuyunbin/I2Transformer.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available