| Literature DB >> 35494847 |
José Alberto Benítez-Andrades1, Álvaro González-Jiménez2, Álvaro López-Brea2, Jose Aveleira-Mata3, José-Manuel Alija-Pérez3, María Teresa García-Ordás3.
Abstract
With the growth that social networks have experienced in recent years, it is entirely impossible to moderate content manually. Thanks to the different existing techniques in natural language processing, it is possible to generate predictive models that automatically classify texts into different categories. However, a weakness has been detected concerning the language used to train such models. This work aimed to develop a predictive model based on BERT, capable of detecting racist and xenophobic messages in tweets written in Spanish. A comparison was made with different Deep Learning models. A total of five predictive models were developed, two based on BERT and three using other deep learning techniques, CNN, LSTM and a model combining CNN + LSTM techniques. After exhaustively analyzing the results obtained by the different models, it was found that the one that got the best metrics was BETO, a BERT-based model trained only with texts written in Spanish. The results of our study show that the BETO model achieves a precision of 85.22% compared to the 82.00% precision of the mBERT model. The rest of the models obtained between 79.34% and 80.48% precision. On this basis, it has been possible to justify the vital importance of developing native transfer learning models for solving Natural Language Processing (NLP) problems in Spanish. Our main contribution is the achievement of promising results in the field of racism and hate speech in Spanish by applying different deep learning techniques.Entities:
Keywords: BERT; BETO; Hate speech; Natural language processing; Text classification; Transfer learning
Year: 2022 PMID: 35494847 PMCID: PMC9044360 DOI: 10.7717/peerj-cs.906
Source DB: PubMed Journal: PeerJ Comput Sci ISSN: 2376-5992
Summary of the objectives and weaknesses of the literature review conducted.
| Reference | Findings | Weaknesses |
|---|---|---|
| ( | Studies that corroborate good results in the task of text classification using conventional machine learning techniques (SVMs, Logistic regression). | These articles are not focused on obtaining predictive models for classifying categories of racism and xenophobia in Spanish texts. |
| ( | HaterNet: dataset of 6,000 labelled tweets as hate speech or not. LDA, QDA, Random Forest, Ridge Logistic Regression, SVM and an LSTM combined with an MLP applied to HaterNet. | Tweets labelled as racist, but not racist. Very small percentage of tweets are related to racism, which is one of the topics that interests us for this work. |
| ( | HateEval: two datasets, one in english and the second one in Spanish. Spanish dataset has 5,000 labelled tweets in the training set. | Only 1,971 tweets are related to racism and 987 of this subset were mislabelled. Tweets from some people dnouncing racism that are labelled as racist. |
| ( | Convolutional neural networks applied to different problems. | CNN applied to different problems, not only for text classification and not focused on racism and xenophobia. |
| ( | Theory about recurrent neural networks (RNN) | Usually, training the BERT model from scratch on similar dataset could produce much better result ( |
| ( | Long Short-Term Memory Network (LSTM) applied to different problems. | Not all of them are focused on text classification. Those that are, do not make comparisons with BERT. |
| ( | SVM, linear regression (LR), Naive Bayes (NB), decision trees (DT), LSTM and a lexicon-based classifier applied to a dataset composed by tweets related to xenophobia and misogyny. | The tweets are labelled with value 1 if speaks about xenophobia or misogyny. This may bias the results of the model. The best model was 74.2% of F1-score. |
| ( | LSTM, Bidirectional Long Short-Term Memory Networks (Bi-LSTM) and CNN, mBert, XLM and BETO applied to HatEval and HaterNet. | The dataset are biased because the subsets were mislabelled. |
Figure 1Workflow of the research conducted.
Spanish keywords used to collect dataset and their translation to English language.
| Spanish keyword | Translation to English |
|---|---|
| moro, mora | moor |
| gitano, gitana, gypsy | gypsy |
| puto simio | fucking ape |
| negro de mierda, negra de mierda, puto negro, puta negra | fucking nigga |
| inmigracion, inmigrante | immigration, immigrant |
| patera | small boat |
| mena (menores extranjeros no acompanados) | unaccompanied foreign minor immigration, immigrant |
| panchito, panchita, sudaca | spic, greaser |
Figure 2High-level representation of Transformers.
Figure 3CNN architecture scheme.
Figure 4LSTM architecture scheme.
Figure 5CNN + LSTM architecture scheme.
The best transfer learning hyperparameters.
| Hyperparameter | Options | BETO | mBERT |
|---|---|---|---|
| Model type | (cased, uncased) | uncased | cased |
| Epochs | (2, 4, 8) | 8 | 4 |
| Batch size | (8, 16, 32, 64) | 8 | 8 |
| Optimizer | (Adam, Adafactor) | Adam | Adam |
| Learning rate | (2e−5, 3e−5, 4e−5) | 4e−5 | 4e−5 |
The best deep learning hyperparameters.
| Hyperparameter | Options | CNN | LSTM | CNN + LSTM |
|---|---|---|---|---|
| Batch size | (16, 32, 64, 128) | 32 | 64 | 64 |
| Dropout | (0.25, 0.5) | N/A | 0.5 | 0.5 |
| Optimizer | (Adam, SGD) | Adam | Adam | Adam |
| Activation function | (Relu, Tanh) | Relu | Relu | Relu |
| Learning rate | (1e−2, 2e−2, 1e−3, 2e−3) | 2e−3 | 2e−3 | 2e−3 |
Results obtained for all models.
| Model | Non-racist | Racist | Macro-averaged | Runtime | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| P (%) | R (%) | F1 (%) | P (%) | R (%) | F1 (%) | P (%) | R (%) | F1 (%) | ||
| BETO |
|
|
|
|
|
|
|
|
| 1,230 s |
| mBERT | 83.28 | 81.11 | 82.18 | 80.73 |
| 81.82 | 82.00 | 82.02 | 82.00 | 1,129 s |
| CNN | 80.13 | 81.43 | 80.78 | 80.21 | 78.84 | 79.52 | 80.17 | 80.14 | 80.15 | 840 s |
| LSTM | 78.90 | 84.04 | 81.39 | 82.05 | 76.45 | 79.15 | 80.48 | 80.24 | 80.27 | 844 s |
| CNN + LSTM | 77.58 | 83.39 | 80.38 | 81.11 | 74.74 | 77.80 | 79.34 | 79.07 | 79.09 | 938 s |
Note:
The best results are highlighted in bold.
Figure 6Confusion matrices of the best performing models (BETO and mBERT).