| Literature DB >> 35035023 |
Utkarsh Sharma1, Prateek Pandey1, Shishir Kumar2.
Abstract
Online social media has become a major source of information gathering for a huge section of society. As the amount of information flows in online social media is enormous but on the other hand, the fact-checking sources are limited. This shortfall of fact-checking gives birth to the problem of misinformation and disinformation in the case of the truthfulness of facts on online social media which can have serious effects on the wellbeing of society. This problem of misconception becomes more rapid and critical when some events like the recent outbreak of Covid-19 happen when there is no or very little information is available anywhere. In this scenario, the identification of the content available online which is mostly propagated from person to person and not by any governing authority is very needed at the hour. To solve this problem, the information available online should be verified properly before being conceived by any individual. We propose a scheme to classify the online social media posts (Tweets) with the help of the BERT (Bidirectional Encoder Representations from Transformers)-based model. Also, we compared the performance of the proposed approach with the other machine learning techniques and other State of the art techniques available. The proposed model not only classifies the tweets as relevant or irrelevant, but also creates a set of topics by which one can identify a text as relevant or irrelevant to his/her need just by just matching the keywords of the topic. To accomplish this task, after the classification of the tweets, we apply a possible topic modelling approach based on latent semantic analysis and latent Dirichlet allocation methods to identify which of the topics are mostly propagated as false information. © Ohmsha, Ltd. and Springer Japan KK, part of Springer Nature 2022.Entities:
Keywords: BERT; Covid-19; Online social media; Text mining; Tweet classification
Year: 2022 PMID: 35035023 PMCID: PMC8743740 DOI: 10.1007/s00354-021-00151-1
Source DB: PubMed Journal: New Gener Comput ISSN: 0288-3635 Impact factor: 1.048
Sample of labelled tweets for model training
| Tweet | Label |
|---|---|
| @ramainoane @nthakoana happiness is in the air & corona non-existent | Irrelevant |
| Guy yesterday walks into pub in Hackney: pint of corona I will get the virus later | Irrelevant |
| Lockdown is work, we are supporting your decision, but stop train because corona viruses is raining no a few months… | Relevant |
| National Guard chief tested negative for corona today after an initial positive test result 🠑‡ | Relevant |
Literature summary of previous work on the identification of relevant information on social media
| Author | Technique | Dataset | Key finding | Limitations |
|---|---|---|---|---|
| Kwon et al. [ | Tweets related to social distancing were partitioned into six facets depending on their applicability to society | Tweets related to the hashtag "coronavirus", collected for 3 months from January to March | Social distancing is described as six facets and their spatio-temporal analysis to the different states of the US is represented | Only a single keyword is used for crawling tweets (#coronavirus) |
| Singh et al. [ | A sentiment classification model is developed based on a BERT classifier with average likes over the period, average retweets over the period, intensity analysis, polarity and subjectivity, and word cloud as the five metrics for classification | Tweets scraped for 80 days from Jan 2020 to April 2020 with hashtags #COVID2019 OR #COVID19 OR coronavirus maintained as two separate datasets one for the entire world and second for India | The model achieved an accuracy of 94% with overall neutral sentiments worldwide and a few negative sentiments from the dataset of India | Classification based on only sentiment analysis. Tweets were collected in a very early period of the pandemic when the spread was not worldwide |
| Pinto et al. [ | A model has been developed to identify the relevance of the posts in social media based on the expert based initial tagging of the dataset into 7 different labels. Then text-based classification algorithms have been applied to carry on the prediction | A total of 941 documents comprising of social media posts from Twitter, Facebook posts, Facebook comments tagged as a relevant, interesting, controversial, meaningful, novel, reliable and wide scope | Classification is done on three different parameters (1) with linguistic features (65% accuracy with Naïve Bayes or SVM), (2) based on the initial prediction of six journalistic criteria (79% accuracy of Random forest) | Only traditional classification algorithms were used in the study for the classification |
| Rudra et al. [ | A summarization model is developed to categorize the tweets related to an epidemic like Ebola or MERS. Tweets will be categorized as disease-related-symptoms, prevention, disease transmission, treatment, death report and non-disease tweets | Collected 200,000 Tweets each related to Ebola and MERS using the AIDR Platform | The proposed model can achieve 80% accuracy for in-class classification and 75% accuracy for the cross-domain scenario | The size of the training dataset was only 2000 tweets |
| Madichetty [ | Classification of informative tweets during disasters has been done by using CNN based feature extraction and ANN-based classifier | Tweets about the natural disaster Hurricane Harvey are collected from August 26, 2017, to September 20, 2017, with an 80:20 ratio used for training and testing | The proposed method achieves an accuracy of 75.9 over the existing machine learning approaches that use unigram, bigram and trigram features | The dataset was collected for a very short period of time (less than a month) |
| Bhoi et al. [ | An LSTM + CNN based classification model is used to identify the relevant tweets in case of some disaster along with a ranking of tweets and also a mapping regarding some essential service requirements stated in the tweets | FIRE-2016 dataset of 49,913 tweets regarding Nepal earthquake 2015 is used and 43,816 non-disaster-Tweets are crawled from the Twitter free API | The proposed model attains an accuracy of 89.47 which as compared to other approaches found to be the highest | Training dataset classified based on the author’s perceived classes rather than using linguistic-based criteria |
Fig. 1Character count comparison of labelled tweets
Fig. 2Word count comparison of labelled tweets
Parameters for BERT-base and BERT-large models
| BERT-base | BERT-large | |
|---|---|---|
| Transformer block | 12 | 24 |
| Hidden layers | 768 | 1024 |
| Self-attention heads | 12 | 16 |
| Output parameters | 110 M | 340 M |
Fig. 3BERT pre-training module for unlabelled sentence pairs
Fig. 4BERT classification model
Fig. 5Word cloud representing the frequent word in both relevant (left) and irrelevant (right) tweets
Classification result comparison of machine learning algorithms with different word embedding models
| Word-embedding | Algorithm | Accuracy | AUC | Time |
|---|---|---|---|---|
| Tf-idf1 | SVM | 0.78 | 0.59 | 5.74 s |
| Tf-idf1 | Random Forest | 0.77 | 0.55 | 34.2 s |
| Tf-idf1 | Decision Tree | 0.71 | 0.59 | 729 ms |
| Tf-idf1 | KNN | 0.75 | 0.59 | 496 ms |
| Tf-idf1 | Naïve Bayes | 0.61 | 0.57 | 1.29 s |
| Tf-idf1 | LDA | 0.58 | 0.57 | 35.2 s |
| Tf-idf1 | Logistic Regression | 0.76 | 0.55 | 547 ms |
| Tf-idf1 | ADA Boost | 0.75 | 0.52 | 2 min 29 s |
| 0.59 | 1.97 s | |||
| Tf-idf2 | SVM | 0.77 | 0.57 | 14.6 s |
| Tf-idf2 | Random Forest | 0.76 | 0.54 | 1 min 43 s |
| Tf-idf2 | Decision Tree | 0.72 | 0.55 | 1.53 s |
| Tf-idf2 | KNN | 0.31 | 0.53 | 438 ms |
| Tf-idf2 | Naïve Bayes | 0.66 | 0.61 | 2.22 s |
| Tf-idf2 | LDA | 0.76 | 0.54 | 2 min 41 s |
| Tf-idf2 | Logistic Regression | 0.76 | 0.57 | 571 ms |
| Tf-idf2 | ADA Boost | 0.75 | 0.55 | 7 min 29 s |
| Tf-idf2 | XG Boost | 0.74 | 0.56 | 3.9 s |
| Tf-idf3 | SVM | 0.77 | 0.58 | 25 s |
| Tf-idf3 | Random Forest | 0.77 | 0.56 | 2 min 14 s |
| Tf-idf3 | Decision Tree | 0.67 | 0.54 | 2.92 s |
| Tf-idf3 | KNN | 0.27 | 0.5 | 777 ms |
| Tf-idf3 | Naïve Bayes | 0.73 | 0.63 | 7.98 s |
| Tf-idf3 | LDA | 0.75 | 0.56 | 4min16s |
| Tf-idf3 | Logistic Regression | 0.75 | 0.58 | 928 ms |
| Tf-idf3 | ADA Boost | 0.76 | 0.55 | 5 min 4 s |
| Tf-idf3 | XG Boost | 0.75 | 0.56 | 9.2 s |
| Word2Vec | SVM | 0.66 | 0.63 | 20.7 s |
| Word2Vec | Random Forest | 0.76 | 0.64 | 6.34 s |
| Word2Vec | Decision Tree | 0.65 | 0.6 | 2.05 s |
| Word2Vec | KNN | 0.43 | 0.6 | 2.76 s |
| Word2Vec | Naïve Bayes | 0.55 | 0.59 | 336 ms |
| Word2Vec | LDA | 0.67 | 0.64 | 718 ms |
| Word2Vec | Logistic Regression | 0.67 | 0.63 | 617 ms |
| Word2Vec | ADA Boost | 0.74 | 0.55 | 9 min 21 s |
| Word2Vec | XG Boost | 0.75 | 0.56 | 5.2 s |
Tuning parameters for CNN classification model
| Layer (type) | Output shape | Parameters |
|---|---|---|
| embedding (Embedding) | (None, 35, 100) | 904,800 |
| conv1d (Conv1D) | (None, 34, 32) | 60 |
| max_pooling1d (MaxPooling1D) | (None, 17, 32) | 432 |
| dropout (Dropout) | (None, 17, 32) | 0 |
| dense (Dense) | (None, 17, 32) | 1056 |
| dropout_1 (Dropout) | (None, 17, 32) | 0 |
| dense_1(Dense) | (None, 17, 16) | 528 |
| global_max_pooling1d (Global) | (None, 16) | 0 |
| dense_2 (Dense) | (None, 1) | 17 |
Total params: 912,833 Trainable params: 912,833 Non-trainable params: 0 | ||
Fig. 6Learning rate calculation of the BERT model
Fig. 7Classification accuracy (left) and cross-entropy loss (right) using BERT
Comparison with the SOTA techniques
| Model | Accuracy (%) | Precision | Recall |
|---|---|---|---|
| XGBoost | 79 | 79.4 | 82.7 |
| CNN [ | 83.1 | 84.2 | 78.3 |
| Bonet-Jover et al. [ | 75 | 80.2 | 76.8 |
| Pinto et al. [ | 79 | 80 | 84 |
| Chakraborty et al. [ | 79 | 82.13 | 84.56 |
| BERT (proposed) | 92.8 | 93.6 | 91.2 |
Fig. 8Most common words in topics for relevant tweets using LSA
Fig. 9Most common words in topics for relevant tweets using LDA
Fig. 10Most common words in topics for irrelevant tweets using LSA
Fig. 11Most common words in topics for irrelevant tweets using LDA
Fig. 12t-SNE clustering of 8 topics using LSA approach
Fig. 13t-SNE clustering of 8 topics using LDA approach