4.7 Article

Video super-resolution via mixed spatial-temporal convolution and selective fusion

期刊

PATTERN RECOGNITION
卷 126, 期 -, 页码 -

出版社

ELSEVIER SCI LTD
DOI: 10.1016/j.patcog.2022.108577

关键词

Video super-Resolution; Mixed spatial-Temporal convolution; Selective feature fusion

资金

  1. NSF of China [61871328, 61901384, 61231016]
  2. 111 Project [B16039]
  3. Shaanxi Key Laboratory of Network Data Analysis and Intelligent Processing
  4. Centre for Augmented Reasoning at the Australian Institute for Machine Learning
  5. National Engineering Laboratory for Integrated Aero-Space-Ground-Ocean Big Data Application Technology

向作者/读者索取更多资源

Video super-resolution aims to recover high-resolution content from low-resolution frames by combining spatial-temporal information. This study proposes a mixed spatial-temporal convolution block and a similarity-based selective features strategy to enhance feature extraction and fusion, and introduces an attention-based motion compensation module to alleviate frame misalignment.
Video super-resolution aims to recover the high-resolution (HR) contents from the low-resolution (LR) observations relying on compositing the spatial-temporal information in the LR frames. It is crucial to model the spatial-temporal information jointly since the video sequences are three-dimensional spatial temporal signals. Compared with explicitly estimating motions between the 2D frames, 3D convolutional neural networks (CNNs) have been shown its efficiency and effectiveness for video super-resolution (SR), as a natural way of spatial-temporal data modelling. Though promising, the performance of 3D CNNs is still far from satisfactory. The high computational and memory requirements limit the development of more advanced designs to extract and fuse the information from a larger spatial and temporal scale. We thus propose a Mixed Spatial-Temporal Convolution (MSTC) block that simultaneously extracts the spatial information and the supplemented temporal dependency among frames by jointly applying 2D and 3D convolution. To further fuse the learned features corresponding to different frames, we propose a novel similarity-based selective features strategy, unlike precious methods directly stacking the learned features. Additionally, an attention-based motion compensation module is applied to alleviate the influence of misalignment between frames. Experiments on three widely used benchmark datasets and real-world dataset show that, relying on superior feature extraction and fusion ability, the proposed network can outperform previous state-of-the-art methods, especially for recovering the confusing details. (c) 2022 Elsevier Ltd. All rights reserved.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据