4.6 Article

Deepfake Video Detection via Predictive Representation Learning

Publisher

ASSOC COMPUTING MACHINERY
DOI: 10.1145/3536426

Keywords

Deepfake video detection; representation learning; deep learning; video understanding

Funding

  1. National Key Research and Development Plan [2020AAA0140001]
  2. Beijing Natural Science Foundation [19L2040]
  3. National Natural Science Foundation of China [61772513]
  4. Youth Innovation Promotion Association, Chinese Academy of Sciences

Ask authors/readers for more resources

This paper proposes a predictive representative learning approach called Latent Pattern Sensing for deepfake video detection. The approach captures semantic change characteristics in deepfake videos by analyzing temporal inconsistencies and uses a unified framework to describe latent patterns across frames. Experimental results show the effectiveness of the approach, achieving high AUC scores on benchmark datasets.
Increasingly advanced deepfake approaches have made the detection of deepfake videos very challenging. We observe that the general deepfake videos often exhibit appearance-level temporal inconsistencies in some facial components between frames, resulting in discriminative spatiotemporal latent patterns among semantic-level feature maps. Inspired by this finding, we propose a predictive representative learning approach termed Latent Pattern Sensing to capture these semantic change characteristics for deepfake video detection. The approach cascades a Convolution Neural Network-based encoder, a ConvGRU-based aggregator, and a single-layer binary classifier. The encoder and aggregator are pretrained in a self-supervised manner to form the representative spatiotemporal context features. Then, the classifier is trained to classify the context features, distinguishing fake videos from real ones. Finally, we propose a selective self-distillation fine-tuning method to further improve the robustness and performance of the detector. In this manner, the extracted features can simultaneously describe the latent patterns of videos across frames spatially and temporally in a unified way, leading to an effective and robust deepfake video detector. Extensive experiments and comprehensive analysis prove the effectiveness of our approach, e.g., achieving a very highest Area Under Curve (AUC) score of 99.94% on FaceForensics++ benchmark and surpassing 12 states of the art at least 7.90%@AUC and 8.69%@AUC on challenging DFDC and Celeb-DF(v2) benchmarks, respectively.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.6
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available