4.7 Article

VR plus HD: Video Semantic Reconstruction From Spatio-Temporal Scene Graphs

期刊

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/JSTSP.2023.3323654

关键词

Scene graph; video generation; VR; spatio-temporal; semantic communication

向作者/读者索取更多资源

With the advancements in computer science and deep learning networks, artificial intelligence generation technology is becoming increasingly mature. This study focuses on the challenges of generating HD videos using deep learning models. The proposed model, StSg2vid, utilizes a spatio-temporal scene graph to represent the semantic information of each frame in the video. By leveraging graph convolutional neural networks, the model predicts the scene layout and generates frame images. Compared to state-of-the-art algorithms, the videos generated by our model achieve better results in both quantitative and qualitative evaluations. Additionally, this model has potential applications in the generation of virtual reality videos.
With the development of computer science and deep learning networks, AI generation technology is becoming increasingly mature. Video has become one of the most important information carriers in our daily life because of their large amount of data and information. However, because of their large amount of information and complex semantics, video generation models, especially High Definition (HD) video, have been a difficult problem in the field of deep learning. Video semantic representation and semantic reconstruction are difficult tasks. Because video content is changeable and information is highly correlated, we propose a HD video generation model from a spatio-temporal scene graph: the spatio-temporal scene graph to video (StSg2vid) model. First, we enter the spatio-temporal scene graph sequence as the semantic representation model of the information in each frame of the video. The scene graph used to describe the semantic information of each frame contains the motion progress of the object in the video at that moment, which is equivalent to a clock. A spatio-temporal scene graph transmits the relationship information between objects through the graph convolutional neural network and predicts the scene layout of the moment. Lastly, the image generation model predicts the frame image of the current moment. The frame at each moment depends on the scene layout at the current moment and the frame and scene layout at the previous moment. We introduced the flow net, wrapping prediction model and the spatially-adaptive normalization (SPADE) network to generate images of each frame forecast. We used the Action genome dataset. Compared with the current state-of-the-art algorithms, the videos generated by our model achieve better results in both quantitative indicators and user evaluations. In addition, we also generalized the StSg2vid model into virtual reality (VR) videos of indoor scenes, preliminarily explored the generation method of VR videos, and achieved good results.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据