| Literature DB >> 36095029 |
Yuyang Chen1, Feng Pan2.
Abstract
Detrimental to individuals and society, online hateful messages have recently become a major social issue. Among them, one new type of hateful message, "hateful meme", has emerged and brought difficulties in traditional deep learning-based detection. Because hateful memes were formatted with both text captions and images to express users' intents, they cannot be accurately identified by singularly analyzing embedded text captions or images. In order to effectively detect a hateful meme, the algorithm must possess strong vision and language fusion capability. In this study, we move closer to this goal by feeding a triplet by stacking the visual features, object tags, and text features of memes generated by the object detection model named Visual features in Vision-Language (VinVl) and the optical character recognition (OCR) technology into a Transformer-based Vision-Language Pre-Training Model (VL-PTM) OSCAR+ to perform the cross-modal learning of memes. After fine-tuning and connecting to a random forest (RF) classifier, our model (OSCAR+RF) achieved an average accuracy and AUROC of 0.684 and 0.768, respectively, on the hateful meme detection task in a public test set, which was higher than the other eleven (11) published baselines. In conclusion, this study has demonstrated that VL-PTMs with the addition of anchor points can improve the performance of deep learning-based detection of hateful memes by involving a more substantial alignment between the text caption and visual information.Entities:
Mesh:
Year: 2022 PMID: 36095029 PMCID: PMC9467312 DOI: 10.1371/journal.pone.0274300
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.752
Comparisons among our models with other published baselines.
| Models | Batch size | Detecting modality | Loss | Optimizer | Learning rate | Acc., | AUROC, | Acc., | AUROC, |
|---|---|---|---|---|---|---|---|---|---|
| dev set (n = 500) | dev set (n = 500) | test set (n = 1000) | test set (n = 1000) | ||||||
|
| 32 | Image | Cross entropy | AdamW | 1.00E-05 | 0.500±0.045 | 0.516±0.027 | 0.511±0.023 | 0.514±0.018 |
|
| 32 | Image | Cross entropy | AdamW | 5.00E-05 | 0.513±0.032 | 0.549±0.030 | 0.531±0.023 | 0.561±0.039 |
|
| 64 | Text | Cross entropy | AdamW | 5.00E-05 | 0.569±0.020 | 0.625±0.047 | 0.586±0.024 | 0.639±0.006 |
|
| 32 | Image&Text | Cross entropy | AdamW | 5.00E-05 | 0.589±0.031 | 0.641±0.040 | 0.619±0.011 | 0.679±0.018 |
|
| 32 | Image&Text | Cross entropy | AdamW | 1.00E-05 | 0.576±0.038 | 0.645±0.012 | 0.622±0.023 | 0.682±0.017 |
|
| 32 | Image&Text | Cross entropy | AdamW | 1.00E-05 | 0.603±0.042 | 0.672±0.018 | 0.631±0.014 | 0.694±0.006 |
|
| 32 | Image&Text | Cross entropy | AdamW | 5.00E-05 | 0.605±0.059 | 0.649±0.067 | 0.642±0.032 | 0.690±0.046 |
|
| 32 | Image&Text | Cross entropy | AdamW | 1.00E-05 | 0.633±0.020 | 0.717±0.035 | 0.659±0.007 | 0.732±0.015 |
|
| 32 | Image&Text | Cross entropy | AdamW | 5.00E-05 | 0.638±0.023 | 0.722±0.010 | 0.664±0.013 | 0.748±0.011 |
|
| 32 | Image&Text | Cross entropy | AdamW | 1.00E-05 | 0.656±0.009 | 0.730±0.035 | 0.664±0.009 | 0.739±0.016 |
|
| 32 | Image&Text | Cross entropy | AdamW | 5.00E-05 | 0.648±0.032 | 0.732±0.017 | 0.664±0.020 | 0.737±0.025 |
|
| 50 | Image&Tag&Text | Cross entropy | AdamW | 5.00E-06 | 0.666±0.038 | 0.758±0.042 | 0.677±0.010 | 0.762±0.016 |
|
| 50 | Image&Tag&Text | Cross entropy | AdamW | 5.00E-06 | 0.667±0.034 | 0.759±0.014 | 0.684±0.002 | 0.768±0.021 |
Footnotes: Acc., accuracy; AUROC, area under the receiver operating characteristic.
*Mean±standard error with the range was calculated from evaluations of four final models.