4.6 Article

Deepfake Video Detection via Predictive Representation Learning

出版社

ASSOC COMPUTING MACHINERY
DOI: 10.1145/3536426

关键词

Deepfake video detection; representation learning; deep learning; video understanding

资金

  1. National Key Research and Development Plan [2020AAA0140001]
  2. Beijing Natural Science Foundation [19L2040]
  3. National Natural Science Foundation of China [61772513]
  4. Youth Innovation Promotion Association, Chinese Academy of Sciences

向作者/读者索取更多资源

This paper proposes a predictive representative learning approach called Latent Pattern Sensing for deepfake video detection. The approach captures semantic change characteristics in deepfake videos by analyzing temporal inconsistencies and uses a unified framework to describe latent patterns across frames. Experimental results show the effectiveness of the approach, achieving high AUC scores on benchmark datasets.
Increasingly advanced deepfake approaches have made the detection of deepfake videos very challenging. We observe that the general deepfake videos often exhibit appearance-level temporal inconsistencies in some facial components between frames, resulting in discriminative spatiotemporal latent patterns among semantic-level feature maps. Inspired by this finding, we propose a predictive representative learning approach termed Latent Pattern Sensing to capture these semantic change characteristics for deepfake video detection. The approach cascades a Convolution Neural Network-based encoder, a ConvGRU-based aggregator, and a single-layer binary classifier. The encoder and aggregator are pretrained in a self-supervised manner to form the representative spatiotemporal context features. Then, the classifier is trained to classify the context features, distinguishing fake videos from real ones. Finally, we propose a selective self-distillation fine-tuning method to further improve the robustness and performance of the detector. In this manner, the extracted features can simultaneously describe the latent patterns of videos across frames spatially and temporally in a unified way, leading to an effective and robust deepfake video detector. Extensive experiments and comprehensive analysis prove the effectiveness of our approach, e.g., achieving a very highest Area Under Curve (AUC) score of 99.94% on FaceForensics++ benchmark and surpassing 12 states of the art at least 7.90%@AUC and 8.69%@AUC on challenging DFDC and Celeb-DF(v2) benchmarks, respectively.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.6
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据