| Literature DB >> 34341673 |
Md Abul Bashar1, Richi Nayak1, Khanh Luong1, Thirunavukarasu Balasubramaniam1.
Abstract
In this world of information and experience era, microblogging sites have been commonly used to express people feelings including fear, panic, hate and abuse. Monitoring and control of abuse on social media, especially during pandemics such as COVID-19, can help in keeping the public sentiment and morale positive. Developing the fear and hate detection methods based on machine learning requires labelled data. However, obtaining the labelled data in suddenly changed circumstances as a pandemic is expensive and acquiring them in a short time is impractical. Related labelled hate data from other domains or previous incidents may be available. However, the predictive accuracy of these hate detection models decreases significantly if the data distribution of the target domain, where the prediction will be applied, is different. To address this problem, we propose a novel concept of unsupervised progressive domain adaptation based on a deep-learning language model generated through multiple text datasets. We showcase the efficacy of the proposed method in hate speech and fear detection on the tweets collection during COVID-19 where the labelled information is unavailable.Entities:
Keywords: Domain adaptation; Fear prediction; Hate speech; Small dataset; Text mining
Year: 2021 PMID: 34341673 PMCID: PMC8319196 DOI: 10.1007/s13278-021-00780-w
Source DB: PubMed Journal: Soc Netw Anal Min
Fig. 1Word cloud obtained from the random hate data (i.e. Source data)
Fig. 2Word cloud obtained from the East Asian hate data (i.e. target data)
Fig. 3Progressive transfer learning for domain adaptation
Fig. 4A single-layered RNN model
Fig. 5Progressive domain adaptation of language model
Fig. 6Model architecture of and for progressive domain adaptation
Comparing models for domain adaptation
| TP | TN | FP | FN | Ac | Pr | Re | F | CK | AUC | |
|---|---|---|---|---|---|---|---|---|---|---|
| LSTM-DA | 3215 | 2152 | 1746 | 683 | 0.825 | |||||
| LSTM-W | 2671 | 1968 | 1930 | 1227 | 0.595 | 0.581 | 0.685 | 0.629 | 0.190 | 0.595 |
| LSTM | 1793 | 2520 | 1378 | 2105 | 0.553 | 0.565 | 0.460 | 0.507 | 0.106 | 0.553 |
| ADDA-LSTM | 1021 | 2879 | 1019 | 2877 | 0.500 | 0.500 | 0.262 | 0.344 | 0.001 | 0.500 |
| ADDA-CNN | 229 | 3669 | 0.518 | 0.509 | 0.670 | 0.036 | 0.518 | |||
| PFAN-LSTM | 3438 | 790 | 3108 | 460 | 0.542 | 0.525 | 0.882 | 0.658 | 0.085 | 0.542 |
| PFAN-CNN | 1143 | 2755 | 0.521 | 0.538 | 0.293 | 0.380 | 0.041 | 0.521 | ||
| CNN | 2612 | 1994 | 1904 | 1286 | 0.591 | 0.578 | 0.670 | 0.621 | 0.182 | 0.591 |
| CNN-W | 2783 | 1890 | 2008 | 1115 | 0.599 | 0.581 | 0.714 | 0.641 | 0.199 | 0.599 |
| DNN | 2667 | 1771 | 2127 | 1231 | 0.569 | 0.556 | 0.684 | 0.614 | 0.139 | 0.569 |
| XGB | 2366 | 1873 | 2025 | 1532 | 0.544 | 0.539 | 0.607 | 0.571 | 0.087 | 0.544 |
| RF | 2329 | 2123 | 1775 | 1569 | 0.571 | 0.567 | 0.597 | 0.582 | 0.142 | 0.571 |
| SVM-L | 2291 | 2064 | 1834 | 1607 | 0.559 | 0.555 | 0.588 | 0.571 | 0.117 | 0.559 |
| SVM-N | 2539 | 1964 | 1934 | 1359 | 0.578 | 0.568 | 0.651 | 0.607 | 0.155 | 0.578 |
| kNN | 1596 | 2525 | 1373 | 2302 | 0.529 | 0.538 | 0.409 | 0.465 | 0.057 | 0.529 |
| MNB | 2851 | 1820 | 2078 | 1047 | 0.599 | 0.578 | 0.731 | 0.646 | 0.198 | 0.599 |
| RC | 1725 | 2432 | 1466 | 2173 | 0.533 | 0.541 | 0.443 | 0.487 | 0.066 | 0.533 |
Effect of Dataset Order in Progressive Domain Adaptation (W: Wiki103, R: RGT, C: CAST; WRC means is first pre-trained with W, then R, then C)
| TP | TN | FP | FN | Ac | Pr | Re | F | CK | AUC | |
|---|---|---|---|---|---|---|---|---|---|---|
| W | 3172 | 1911 | 1987 | 726 | 0.652 | 0.615 | 0.814 | 0.700 | 0.304 | 0.652 |
| W | 1714 | 2184 | 0.657 | 0.610 | 0.718 | 0.314 | 0.657 | |||
| W | 3215 | 683 | 0.825 | |||||||
| W | 3339 | 1867 | 2031 | 559 | 0.668 | 0.622 | 0.857 | 0.721 | 0.336 | 0.668 |
| W | 3055 | 1967 | 1931 | 843 | 0.644 | 0.613 | 0.784 | 0.688 | 0.288 | 0.644 |
Comparing models for in-domain performance
| TP | TN | FP | FN | Ac | Pr | Re | F | CK | AUC | |
|---|---|---|---|---|---|---|---|---|---|---|
| LSTM-DA | 351 | 318 | 51 | 0.873 | ||||||
| LSTM-W | 346 | 297 | 103 | 34 | 0.824 | 0.771 | 0.911 | 0.835 | 0.650 | 0.827 |
| LSTM | 309 | 302 | 98 | 71 | 0.783 | 0.759 | 0.813 | 0.785 | 0.567 | 0.784 |
| CNN-W | 274 | 126 | 0.808 | 0.739 | 0.826 | 0.618 | 0.811 | |||
| CNN | 338 | 277 | 123 | 42 | 0.788 | 0.733 | 0.889 | 0.804 | 0.579 | 0.791 |
| DNN | 316 | 298 | 102 | 64 | 0.787 | 0.756 | 0.832 | 0.792 | 0.575 | 0.788 |
| XGB | 310 | 310 | 90 | 70 | 0.795 | 0.775 | 0.816 | 0.795 | 0.590 | 0.795 |
| RF | 309 | 316 | 84 | 71 | 0.801 | 0.786 | 0.813 | 0.799 | 0.603 | 0.802 |
| SVM-L | 281 | 307 | 93 | 99 | 0.754 | 0.751 | 0.739 | 0.745 | 0.507 | 0.753 |
| SVM-N | 298 | 76 | 82 | 0.797 | 0.797 | 0.784 | 0.790 | 0.594 | 0.797 | |
| kNN | 204 | 321 | 79 | 176 | 0.673 | 0.721 | 0.537 | 0.615 | 0.342 | 0.670 |
| MNB | 324 | 271 | 129 | 56 | 0.763 | 0.715 | 0.853 | 0.778 | 0.528 | 0.765 |
| RC | 273 | 322 | 78 | 107 | 0.763 | 0.778 | 0.718 | 0.747 | 0.524 | 0.762 |
Filtering keywords
| List of fear-related keywords (Lambert et al. |
| afraid, fear, feared, fearful, fearing, fears, frantic, fright, horr, panic, scare, scaring, scary, terrified, terrifies, terrify, terror |
| List of East Asia-related keywords (Vidgen et al. ( |
| chinavirus, wuhan, wuhanvirus, chinavirusoutbreak, wuhancoronavirus, wuhaninfluenza, wuhansars, chinacoronavirus, wuhan2020, chinaflu, wuhanquarantine, chinesepneumonia, coronachina, wohan |
Fear prediction results
| TP | TN | FP | FN | Ac | Pr | Re | F1 | CK | AUC |
|---|---|---|---|---|---|---|---|---|---|
| 67 | 227 | 24 | 71 | 0.756 | 0.736 | 0.486 | 0.585 | 0.422 | 0.695 |
Fig. 7Word cloud obtained from fear tweets in CAST Dataset
Fig. 9Fear and hate distribution over the time
Fig. 8Word cloud obtained from East Asian Hate Tweets in CAST
Fig. 10Fear and hate distribution over the time in log scale