| Literature DB >> 35408114 |
Ali Bou Nassif1, Qassim Nasir2, Manar Abu Talib3, Omar Mohamed Gouda1.
Abstract
Creating deepfake multimedia, and especially deepfake videos, has become much easier these days due to the availability of deepfake tools and the virtually unlimited numbers of face images found online. Research and industry communities have dedicated time and resources to develop detection methods to expose these fake videos. Although these detection methods have been developed over the past few years, synthesis methods have also made progress, allowing for the production of deepfake videos that are harder and harder to differentiate from real videos. This research paper proposes an improved optical flow estimation-based method to detect and expose the discrepancies between video frames. Augmentation and modification are experimented upon to try to improve the system's overall accuracy. Furthermore, the system is trained on graphics processing units (GPUs) and tensor processing units (TPUs) to explore the effects and benefits of each type of hardware in deepfake detection. TPUs were found to have shorter training times compared to GPUs. VGG-16 is the best performing model when used as a backbone for the system, as it achieved around 82.0% detection accuracy when trained on GPUs and 71.34% accuracy on TPUs.Entities:
Keywords: GPU; convolutional neural networks (CNNs); deepfake; optical flow; tensor processing units (TPU)
Mesh:
Year: 2022 PMID: 35408114 PMCID: PMC9002804 DOI: 10.3390/s22072500
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1Difference between real (top) and fake (bottom) frames passed to optical flow estimator.
Summary of related work in deepfake detection.
| Research Paper | Year | Method | Domain | Datasets | Hardware | Accuracy |
|---|---|---|---|---|---|---|
| DeepRhythm [ | 2020 | Heartbeat rhythms using PPG with attention network | Dual-spatial-temporal | FF++ * | GPU | Accuracy: 98.0% |
| FDFtNet [ | 2020 | Augmentation of pretrained CNN | Pixel-Level detection | PGGAN, | GPU | AUROC: 0.994 |
| Face X-Ray [ | 2020 | Detection of blending boundaries in the image | Pixel-Level detection | FF++ * | GPU | AUC: 95.4 |
| Visual Artifacts [ | 2019 | Visual artifacts (eyes, teeth and nose and face border) | Pixel-Level detection | Glow, | GPU | AUROC: 0.866 |
| Optical Flow [ | 2019 | Inter-frame correlations using optical flow | Spatio-temporal | FF++ –Face2Face | GPU | Accuracy: 81.61% |
| Recurrent Neural Networks [ | 2019 | Recurrent Neural Network | Spatio-temporal | HOHA | GPU | Accuracy: 97.1% |
| FF++ –Xception [ | 2019 | CNN-based Image classification | Pixel-Level detection | FF++ * | GPU | Accuracy: 96.36% |
| Eye Blinking [ | 2018 | Discrepancies in eye blinking across the frames | Spatio-temporal | CEW | GPU | AUROC: 0.98 |
| Edges & Optical flow [ | 2020 | Edges of optical flow images with XceptionNet | Spatio-temporal | FF++ * | GPU | Accuracy on DFDC-mini: 97.94% |
| Optical flow based CNN [ | 2021 | Optical flow-based CNN | Spatio-temporal | FF++ * | GPU | Accuracy on Optical flow only: 82.99% |
| This research paper | 2022 | Inter-frame correlations using optical flow | Spatio-temporal | FF++ –Deepfake, | GPU, TPU | AUROC: 0.879 |
* Means all types of manipulation methods were used in the paper.
Types of deepfake manipulation.
| Type | Photo | Audio | Video |
|---|---|---|---|
| Description | This type includes manipulations done on images, i.e., to generate a non-existent face image. | This type includes any type of manipulation done on audio records, i.e., impersonating or changing a person’s voice. | This type includes manipulations done on videos. |
| Class | Face and body swapping. |
Impersonating person’s voice. Changing a person’s voice. Speech to text usage to change part of audio to a specific text. |
Face-swapping. Face-morphing. Full body puppetry. |
| Example | FaceApp [ |
Synthesizing Obama: Learning lip sync from audio [ Waveglow [ |
Face2Face [ Pose transfer [ |
Deepfake datasets.
| Dataset | Year | Size (Videos) | Techniques |
|---|---|---|---|
| FF++ [ | 2019 | 1000 real 7000 fake (all techniques) | Deepfakes, Face2Face, face swap, NeuralTextures |
| Celeb-DF v2 [ | 2020 | 590/5639 | Deepfakes |
| DFDC [ | 2020 | 19,154/100,000 | 8 different deepfakes techniques |
Figure 2PWC-Net Network Architecture.
Figure 3Overall proposed system architecture.
Figure 4GPU architecture of the system.
Figure 5Four different augmentations: (a) Augmentation 1: applying the proposed augmentation by Jeon et al.; (b) Augmentation 2: training the FTT block alongside the CNN; (c) Augmentation 3: attaching MobileNet block at the end of the CNN; (d) Augmentation 4: attaching FTT block at the end of the CNN.
Figure 6Basic architecture for the TPU approach.
Frames used in the experiments.
| Dataset | Videos Used | Original Frames | Optical Flow Frames | Training/Validation/Test |
|---|---|---|---|---|
| FaceForensics++ –DF | 631 | 240,000 | 120,000 | 80,000/20,000/20,000 |
| FaceForensics++ –F2F | 545 | 240,000 | 120,000 | 80,000/20,000/20,000 |
| Celeb-DF | 1254 | 240,000 | 120,000 | 80,000/20,000/20,000 |
| DFDC | 962 | 240,000 | 120,000 | 80,000/20,000/20,000 |
Training parameters.
| Approach | Optimizer | Learning Rate | Compiler Loss | Last Dense | Epochs |
|---|---|---|---|---|---|
| GPU | Adam | 1e-4 | categorical_crossentropy | 2, softmax | 25 |
| GPU-Orignal | Adam | 1e-4 | binary_crossentropy | 1, sigmoid | 25 |
| Augmented | Adam | Default | categorical_crossentropy | 2, softmax | 25 |
| TPU | Adamax | 1e-4 | sparse_categorical_crossentropy | 2, softmax | 25 |
Figure 7Backbone CNNs validation accuracy vs. epochs: (a) linear scale; (b) logarithmic scale.
Backbone CNNs accuracy comparison. Values in bold are the best values in each category.
| Model | Time Per Epoch | Total Time | Accuracy |
|---|---|---|---|
| Inception V3 [ | 800 s | 335 min | 62.1% |
| ResNet 50 [ |
|
| 60.64% |
| ResNet 101 [ | 1207 s | 507 min | 65.89% |
| ResNet 152 [ | 1994 s | 839 min | 65.79% |
| Xception [ | 633 s | 264 min | 52.0% |
| VGG-19 | 698 s | 294 min | 80.1% |
| VGG-16 Binary (Amirini’s) [ | 446 s | 187 min | 75.27% |
| VGG-16 (Proposed) | 440 s | 183 min |
|
Figure 8Accuracy comparison between the proposed and the original trained on different datasets: (a) linear scale; (b) logarithmic scale.
Dataset evaluation on proposed vs. original. The highlighted values in bold are the best performing for each dataset.
| Model | Dataset | Accuracy | Overall Accuracy |
|---|---|---|---|
| Proposed | FaceForensics++ –DF |
|
|
| FaceForensics++ –F2f |
| ||
| Celeb-DF v2 |
| ||
| DFDC |
| ||
| Original [ | FaceForensics++ –DF | 75.27% | 63.435% |
| FaceForensics++ –F2f | 67.37% | ||
| Celeb-DF v2 | 50.0% | ||
| DFDC | 61.1% |
Accuracy comparison of the VGG-16 model trained and tested on different datasets. The values in bold are the best performing for each dataset.
| Validation | FF++ –Deepfake | FF++ –Face2Face | DFDC | Celeb-DF | Overall | |
|---|---|---|---|---|---|---|
| Trained | ||||||
| FF++ –Deepfake |
| AUROC: 0.710618 | AUROC: 0.521114 | AUROC: 0.528509 |
| |
|
| Acc: 0.6478 | Acc: 0.5184 | Acc: 0.5241 | |||
| FF++ –Face2Face | AUROC: 0.766970 |
| AUROC: 0.480113 | AUROC: 0.531422 | 0.6001 | |
| Acc: 0.6913 |
| Acc: 0.4859 | Acc: 0.52645 | |||
| DFDC | AUROC: 0.519190 | AUROC: 0.476737 |
| AUROC: 0.476790 | 0.5226 | |
| Acc: 0.5142 | Acc: 0.48485 |
| Acc: 0.4792 | |||
| CelebDF | AUROC: 0.529061 | AUROC: 0.529152 | AUROC: 0.464086 |
| 0.5650 | |
| Acc: 0.525 | Acc: 0.5185 | Acc: 0.4742 |
| |||
Figure 9Comparison between the best performing GPU and other models trained on the TPU: (a) linear scale; (b) logarithmic scale.
Different CNN models trained on TPU compared with the best-performing GPU model.
| Model | Time Per Epoch | Total Time | Accuracy |
|---|---|---|---|
| VGG-16-GPU | 440 s | 183 min | 82% |
| VGG-16 | 52 s | 22 min | 71.34% |
| VGG-19 | 57 s | 24.5 min | 63.56% |
| InceptionV3 | 72 s | 30.2 min | 58.72% |
| Xception | 70 s | 30 min | 52.10% |
| ResNet50V2 | 55 s | 23.1 min | 68.37% |
| ResNet101V2 | 85 s | 35.7 min | 69.27% |
| ResNet152V2 | 110 s | 46.3 min | 70.50% |
Figure 10The effect of each Augmentation on validation accuracy over 25 epochs: (a) linear scale; (b) logarithmic scale.
Test results for all augmentations (1–4).
| Augmentation | Training Time | Accuracy |
|---|---|---|
| No Augmentations |
|
|
| Augmentation 1 | 672 min | 77.5% |
| Augmentation 2 | 612 min | 61.5% |
| Augmentation 3 | 212 min | 76.0% |
| Augmentation 4 | 204 min | 75.45% |