| Literature DB >> 34054227 |
Bhuvanesh Singh1, Dilip Kumar Sharma1.
Abstract
Social media are the main contributors to spreading fake images. Fake images are manipulated images altered through software or by other means to change the information they convey. Fake images propagated over microblogging platforms generate misrepresentation and stimulate polarization in the people. Detection of fake images shared over social platforms is extremely critical to mitigating its spread. Fake images are often associated with textual data. Hence, a multi-modal framework is employed utilizing visual and textual feature learning. However, few multi-modal frameworks are already proposed; they are further dependent on additional tasks to learn the correlation between modalities. In this paper, an efficient multi-modal approach is proposed, which detects fake images of microblogging platforms. No further additional subcomponents are required. The proposed framework utilizes explicit convolution neural network model EfficientNetB0 for images and sentence transformer for text analysis. The feature embedding from visual and text is passed through dense layers and later fused to predict fake images. To validate the effectiveness, the proposed model is tested upon a publicly available microblogging dataset, MediaEval (Twitter) and Weibo, where the accuracy prediction of 85.3% and 81.2% is observed, respectively. The model is also verified against the newly created latest Twitter dataset containing images based on India's significant events in 2020. The experimental results illustrate that the proposed model performs better than other state-of-art multi-modal frameworks.Entities:
Keywords: EfficientNet; Fake images; Sentence transformer; Social media; Swish activation
Year: 2021 PMID: 34054227 PMCID: PMC8143443 DOI: 10.1007/s00521-021-06086-4
Source DB: PubMed Journal: Neural Comput Appl ISSN: 0941-0643 Impact factor: 5.606
Fig. 1Examples of fake images a child with three eyes—copy and move technique b Zuckerberg with Modi—image splicing technique ( a and b are from boomlive)
Summary of former studies on fake image detection
| Study | Methodology | Technique | Dataset | Performance |
|---|---|---|---|---|
| [ | Forensic (Copy&Move) | DWT + Surf | 50 Images from MICC-F2000 | Acc 95% |
| [ | Forensic (Copy&Move) MWLD | Multiscale Weber’s Law Descriptor | CASIA 1.0 and CASIA 2.0 | Acc 92.62 and 96.52% |
| [ | Forensic (Copy&Move) | FrQZMs | GRIP, FAU (Factor 1.2) | Pixel level F-measure of 0.8848 over GRIP and 0.9296 over FAU |
| [ | Forensic(image splicing) | Markov features of DOCT domain | CASIA 1.0 and CASIA 2.0 | Acc 98.77 and 97.59%, respectively |
| [ | Forensic (image splicing) | SVD + DCT + PCA | Columbia DVMM | Acc 80.79 (No PCA) and 98.78% (PCA) |
| [ | Machine learning (image splicing) | Logistic Regression using DWT + HOG + LBP | CASIA 1.0, CASIA 2.0, Columbia | Accuracy of 98.3, 99.5 & 98.8%, respectively |
| [ | Graph convolutional networks | Spatial & Temporal Structures | MediaEval 15/16 | Acc 75.2 and 77.3% |
| [ | Web retrieval | Image reverse search | BuzzFeed, PolitiFact, and BuzzFeed election | Acc 85.3, 88.0 and 86% |
| [ | Deep neural networks—text-based | Text embedding with CNN | Kaggle dataset | Acc 98.36% |
| [ | Deep neural networks | Prediction error filters + CNN | Self-generated images from 12 digital cameras | Acc 98.40% (original images) |
| [ | Deep neural networks | CNN with random mini-batches | CASIA-FASD, replay-attack | HTER 4.59 and 5.74 (Intra database) |
| [ | Deep neural networks | Fusion network with binary classification and residual loss branch | Columbia | mAP 0.99 |
| [ | Multi-modal—EANN | VGG19 for Image, Text CNN for text | Twitter, Weibo | Acc 0.648 and 0.795 (EANN-) |
| [ | Multi-modal—MVAE | VGG19 for Image, RNN + LSTM for text | Twitter, Weibo | Acc 0.745 & 0.824 |
| [ | Multi-modal—SAME | Image, text, and user profile. Text and user profile analysis using adversarial loss and image through CNN | PolitiFact, GossipCop | Micro F1 Score of 76.31 and 81.58 |
| [ | Multi-modal—SAFE | Text CNN for text and Image2Sentence for images. Cross-modal similarity | PolitiFact, GossipCop | Acc 8.74 and 0.838 |
| [ | Multi-modal—SpotFake | VGG19 for Image, BERT for text | Twitter, Weibo | Acc 0.77 and 0.892 |
Acc, Accuracy; AUROC area under ROC curve, F1 Score—it is harmonic mean of Precision and Recall, half total error rate (HTER)—average of FPR and FNR, mPA, mean average Precision the average of AP; BERT, bidirectional encoder representations from transformers; CNN, convolutional neural network; RNN, recurrent neural network; LSTM, long short-term memory
Fig. 2Architecture of the proposed system
Fig. 3Comparison of graphs between ReLU and Swish activation function [45]
Fig. 4The architecture for baseline network EfficientNet-B0 as provided by the authors [11]
Fig. 5Plot diagram of the proposed architecture with dimension information
Experiment results on different variations of EfficientNet
| Dataset /Accuracy | EfficientNet B0 | EfficientNet B1 | EfficientNet B2 | EfficientNet B3 | EfficientNet B4 | EfficientNet B5 |
|---|---|---|---|---|---|---|
| MediaEval | 0.853 | 0.793 | 0.778 | 0.795 | 0.753 | 0.692 |
| 0.812 | 0.809 | 0.809 | 0.801 | 0.798 | 0.793 |
Performance metrics of the proposed model
| Dataset | Accuracy | F1-Score | F2-Score |
|---|---|---|---|
| CASIA 2.0 | 87.13% | 0.87 | 0.877 |
| MediaEval | 85.3% | 0.85 | 0.843 |
| 81.2% | 0.80 | 0.80 | |
| Indian dataset | 58.3% | 0.54 | 0.565 |
Fig. 6Accuracies comparison of modalities a MediaEval b Weibo
Performance comparison-proposed model to other models
| Dataset | Model | Accuracy | Fake | Real | ||||
|---|---|---|---|---|---|---|---|---|
| Precision | Recall | F1-Score | Precision | Recall | F1-Score | |||
| MediaEval (Twitter) | VQA [ | 0.631 | 0.765 | 0.509 | 0.611 | 0.55 | 0.794 | 0.65 |
| att-RNN [ | 0.664 | 0.749 | 0.615 | 0.676 | 0.589 | 0.728 | 0.651 | |
| EANN- [ | 0.648 | 0.81 | 0.498 | 0.617 | 0.584 | 0.759 | 0.66 | |
| SpotFake [ | 0.7777 | 0.751 | 0.9 | 0.82 | 0.832 | 0.606 | 0.701 | |
| MVAE [ | 0.745 | 0.801 | 0.719 | 0.758 | 0.689 | 0.777 | 0.730 | |
| Proposed | 0.853 | 0.821 | 0.943 | 0.877 | 0.913 | 0.745 | 0.820 | |
| Weibo (Chinese) | VQA [ | 0.736 | 0.797 | 0.634 | 0.706 | 0.695 | 0.838 | 0.76 |
| att-RNN [ | 0.772 | 0.797 | 0.713 | 0.692 | 0.684 | 0.84 | 0.754 | |
| EANN- [ | 0.795 | 0.827 | 0.697 | 0.756 | 0.752 | 0.863 | 0.804 | |
| SpotFake [ | 0.8923 | 0.902 | 0.964 | 0.932 | 0.847 | 0.656 | 0.739 | |
| MVAE [ | 0.824 | 0.854 | 0.769 | 0.809 | 0.802 | 0.875 | 0.837 | |
| Proposed | 0.812 | 0.851 | 0.784 | 0.816 | 0.744 | 0.826 | 0.782 | |
Fig. 7MediaEval dataset—a confusion matrix b ROC curve c Precision–Recall graph
Fig. 8Weibo dataset—a confusion matrix b ROC curve c Precision–Recall graph
Fig. 9Examples of fake images from the Indian dataset—a Kamala Harris morphed photo b an altered Kamal Nath poster c a morphed placard held by CPM party people
Fig. 10Indian Dataset v.01 a accuracy comparison across modalities b ROC curve c Precision–Recall