Literature DB >> 33972817

TClustVID: A novel machine learning classification model to investigate topics and sentiment in COVID-19 tweets.

Md Shahriare Satu¹, Md Imran Khan², Mufti Mahmud³, Shahadat Uddin⁴, Matthew A Summers^5,6, Julian M W Quinn^5,7, Mohammad Ali Moni^5,8.

Abstract

COVID-19, caused by SARS-CoV2 infection, varies greatly in its severity but presents with serious respiratory symptoms with vascular and other complications, particularly in older adults. The disease can be spread by both symptomatic and asymptomatic infected individuals. Uncertainty remains over key aspects of the virus infectiousness (particularly the newly emerging variants) and the disease has had severe economic impacts globally. For these reasons, COVID-19 is the subject of intense and widespread discussion on social media platforms including Facebook and Twitter. These public forums substantially influence public opinions and in some cases can exacerbate the widespread panic and misinformation spread during the crisis. Thus, this work aimed to design an intelligent clustering-based classification and topic extracting model named TClustVID that analyzes COVID-19-related public tweets to extract significant sentiments with high accuracy. We gathered COVID-19 Twitter datasets from the IEEE Dataport repository and employed a range of data preprocessing methods to clean the raw data, then applied tokenization and produced a word-to-index dictionary. Thereafter, different classifications were employed on these datasets which enabled the exploration of the performance of traditional classification and TClustVID. Our analysis found that TClustVID showed higher performance compared to traditional methodologies that are determined by clustering criteria. Finally, we extracted significant topics from the clusters, split them into positive, neutral and negative sentiments, and identified the most frequent topics using the proposed model. This approach is able to rapidly identify commonly prevailing aspects of public opinions and attitudes related to COVID-19 and infection prevention strategies spreading among different populations.

Entities: Chemical Disease Species

Keywords: COVID-19; Classification; Machine learning; TClustVID; Topics modelling; Twitter data

Year: 2021 PMID： 33972817 PMCID： PMC8099549 DOI： 10.1016/j.knosys.2021.107126

Source DB: PubMed Journal: Knowl Based Syst ISSN： 0950-7051 Impact factor: 8.038

Introduction

COVID-19 has become a global concern as a major and dangerous public health threat. The World Health Organization (WHO) declared COVID-19 a Public Health Emergency of International Concern (PHEIC) on February 28, 2020. During the 1960s various coronaviruses were identified as infectious to humans in the upper respiratory tract, notably human coronavirus 229E and OC43 [1]. Numerous coronaviruses may circulate in wild mammalian populations, with some causing minor human health problems. However, this picture changed with the emergence of severe acute respiratory syndrome (SARS-CoV) at 2002 and the Middle East Respiratory Syndrome coronavirus (MERS-CoV) at 2012 that infect respiratory tract epithelial tissues to cause serious and often deadly respiratory disease [2]. Pandemic coronavirus SARS-CoV2 causes the pandemic disease COVID-19 that shows flu and pneumonia-like symptoms with cardiovascular complications with severity ranging from undetectable to rapidly lethal. The spread of this disease has been causing huge economic disruption, personal health fears, and uncertainties that have dominated both the news and social media. The massive use of web and mobile technologies gives opportunities to the population to share their opinions on social media platforms such as Facebook and Twitter. Emotion plays a significant role in conducting effective human-to-human communication and provides major effort to take proper decisions [3]. Text is one of the essential components for affective computing as most of the people use text message/sms using computer to express their pinion [4]. During the COVID-19 pandemic, various social media have been used to communicate daily activities and thoughts, including many significant messages (texts) left by users sharing their general feelings about their personal situation, health status, tips to stay well, and other related information [5]. Such messages may provide large-scale insights into behavioral responses to the pandemic. However, it is not easy to judge whether various social media carries important information, not least because semantic abstruseness makes it hard to understand many messages. Nevertheless, machine learning and computational methods have increasingly been used to scrutinize social media data in the biomedical sector [6]. Content relating to COVID-19 may be useful to extract significant information for individuals and policy-makers. Twitter, in particular, is a popular micro-blogging and public networking service widely used for messaging and posting [7]. Automatic classification of tweets into particular classes is challenging, not least because these messages are short, 140 characters, or less [8]. In recent years, sentiment analysis is useful to process social media data like blogs, wikis, micro-blogging and other online collaborative media [9]. It is a branch of affective computing that classify as text either positive or negative. So, the analysis requires identification of sentiments in Twitter messages (tweets) which contain abbreviations, spelling variations and ambiguous or informal language. The objective of this work is to investigate the type of tweets being communicated and to extract information on significant topics that are useful to understand the COVID-19 pandemic situation. Details of working methodology where A. Data preprocessing B. Traditional classification and evaluation C. Clustering, classification and evaluation D. Comparison the outcomes between traditional and TClustVID E. Select the best clusters/datasets and Identify positive, neutral and negative clusters F. Extract topics by LDA and represent top frequent topics from it. Average performance of various classifiers for evaluating them using (a) traditional way (b) TClustVID corresponding to the nine twitter experimental datasets. Compute SHAP values to determine COVID-19 (a) Positive (b) Neutral (c) Negative topics. In this study, we collected several Twitter datasets and investigated sentiment topics related to COVID-19 by designing a novel machine learning model named TClustVID. This model was used to explore significant subsets using clustering methods and select them by verifying high classification performance. Each of these tweet clusters was split into the positive, negative and neutral group, and employed latent dirichlet allocation (LDA) to extract key topics. We then interpreted and identified more significant topics. This methodology can be used to generate relevant information on public and human social behavior dealing with COVID-19 issues for researchers and policymakers. The key contributions of this work are described briefly as follows: In TClustVID, we have incorporated clustering and classification to facilitate the extraction of significant topics concerning the pandemic. Multiple tweet datasets were used to verify the results of the proposed model in primary and different clusters. Significant topics were represented using various word clouds that render them more visible and understandable. The identification of the most frequently raised topics can make awareness of the underlying matters, particularly related to widespread concern.

Literature review

Affective computing and sentiment analysis is the key to the advancement of artificial intelligence. It has a great potentiality to become a sub-component technology for other systems [3]. Sentiment analysis is broadly categorized into symbolic and sub-symbolic approaches [10]. Popular sources of affect words are created knowledge bases to identify polarity text e.g., WordNet-Affect, SentiWordNet, SenticNet. Therefore, the integration of logical reasoning was happened with deep learning in SenticNet6 to infer polarity of text [4], [10]. Dragoni et al. [9] proposed commonsense ontology based on SenticNet that supports word embedding, domain information and polarity representation for sentiment analysis. Poria et al. [11] provided three deep learning based architectures where different facets of analysis to be considered for multimodal sentiment analysis. Chaturvedi et al. [12] introduced a convolutional fuzzy commonsense reasoning model which projects features into four dimensional space in order to increase classification performance. Jiang et al. [13] proposed joint-aspect level sentiment modification which trained aspect-specific sentiment words extraction and aspect-level sentiment transformation modules. Baired et al. [14] presented a lexical knowledge base approach where SenticNet was used to explore natural language concept and fine tune various feature types from the large scale multimodel dataset. Besides, several NLP works were performed based on the knowledge-based and statistical methods are combined for investigating short messaging, microblogging (e.g., Twitter) sentiment analysis. Khatua et al. [15] represented their work in the context of 2014 Ebola and 2016 Zika outbreaks where they suggest domain-specific word vectors are better than pre-trained Word2Vec (contrived from Google News) or Global Vector for Word Representation of Stanford NLP group (GloVe). Ahmed et al. [16] provided a query expansion model that accelerates the initial queries with expansion terms. In this case, various word embedding models such as Word2Vec, GloVe, and fastText are trained tweet corpus. Behera et al. [17] proposed a hybrid model combining convolutional neural network (CNN) and long short term memory (LSTM) called Co-LSTM, which is highly adaptable with big social data. Alike recent relevant works of sentiment analysis, some recent studies have been attempted to scrutinize COVID-19 tweets in bulk for public health research purposes, although it is likely that they have been mined for more commercial purposes. Aljameel et al. [18] gathered 2,42,525 tweets from five regions in Saudi Arabia to analyze their sentiments using support vector machine (SVM), k-nearest neighbor (KNN) and Naïve Bayes (NB). Alomari et al. [19] investigated 14 million tweets where they extracted significant features using TF–IDF based correlation analysis and explored relevant topics using LDA. Al-rakmi et al. [20] gathered 4,00,000 tweets and implemented entropy and correlation based feature selection and ensemble methods using NB, Bayes Net, KNN, C4.5, random forest (RF) and SVM. Boot-Itt and Skunkan [21] explored 1,09,990 tweets to analyze their sentiments using NRC sentiments lexicon and LDA. Gencoglu et al. [22] investigated 26 million tweets using language agnostic BERT sentence embedding models and further classified sentiments using KNN, LR and Bayesian hyperparameter optimization. Kouzy et al. [23] explored tweets using 14 trending hashtags and keywords about COVID-19 and investigated the magnitude of misinformation by comparing terms and hashtags. Kaur et al. [24] translated 16,138 tweets into English and scrutinized sentiments and emotions using TextBlob and IBM Tone analyzer, respectively. Medford et al. [25] gathered all twitter user data from January 14th to 28th, 2020 and investigated sentiments and explored topics using LDA. Mackey et al. [26] collected 4,492,954 tweets from the United States, United Kingdom, India and Australia where they extracted topics using biterm topics model (BTM) with topics clusters. Nemes and Kiss [27] analyzed tweets using TextBlob and RNN. Samual et al. [28] investigated 9000 tweets and got non-textual variables using N-Gram and further analyzed sentiments using NB, Linear regression, LR and KNN. Xiang et al. [29] gathered 82,893 tweets for sentiment analysis and topics modeling using NRC Lexicon and LDA respectively. Xue et al. [30] extracted 4 million English language tweets using N-Gram, NRC Lexicon and LDA analysis. Also, they [31] scrutinized 1.9 million English language tweets using machine learning models and LDA. Yin et al. [32] utilized 13 million tweets by inspected them using VADER and dynamic LDA model. Zhang et al. [33] perused tweets by employing N-Gram model and TF–IDF as well as explored sentiments using DT, LR, KNN, RF and SVM respectively.

Drawbacks of previous works

There are few observed issues and potential pitfalls in interpreting recently published work. Most have not proposed a framework for the investigation of tweets and employ both sentiment analysis and topic modeling. In addition, many works have specified their analysis as specialized to particular regions or languages, and cannot easily generalize those approaches globally. For sentiment classification, a small number of machine learning methods have been implemented as well as verified their results with only a small number of evaluation metrics. Most times, they focused on the specific issues (e.g., psychological or human needs). However, they did not extract the most significant topics needed to realize this pandemic situation by individuals. In studies using topic modeling, positive, negative and neutral topics were not specified in their work. Hence it is difficult to gain an understanding of the current situation of pandemic according to this perspective. The details visualization of topics orientation was given in this work. Word cloud of various topics.

Materials and methods

We proposed a machine learning based COVID-19 tweet analytical model that can be used to explore significant topics from Twitter datasets. To process them, different natural language processing techniques are used along with machine learning methods as illustrated in Fig. 1. The working project is provided at the following link https://github.com/shahriariit/COVID-19-Twitter-Data-Analysis.

Fig. 1

Details of working methodology where A. Data preprocessing B. Traditional classification and evaluation C. Clustering, classification and evaluation D. Comparison the outcomes between traditional and TClustVID E. Select the best clusters/datasets and Identify positive, neutral and negative clusters F. Extract topics by LDA and represent top frequent topics from it.

Data description

The COVID-19 Twitter datasets has been collected from the IEEE Data portal that originated from the LSTM model, developed by Rabindra Lamsal, who monitors the real-time feeds of COVID-19-related tweets [34]. It generates over 0.3 million requests every 24 h and its time-series graph is updated at every 30 s. Almost 16 million tweets were identified before March 20th 2020. Each database (*.db) contains three attributes where the first, second, and third columns have been indicated date and time, tweets, and sentiment scores, respectively. However, these sentiment scores have been manipulated within the range [0, 2] where the most negative, neutral, and positive sentiments are indicated as 0, 1 and 2, respectively. Eight twitter datasets (corona_tweets_1M.db, corona_tweets_1M_2, corona_tweets_1M, corona_tweets_2L, corona_tweets_2M.db, corona_tweets_2M_2, corona_tweets_2M_3 and corona-_tweets_3M) have been investigated and deemed suitable models to classify tweets in this study. Each dataset has been represented as the tweets related to COVID-19 of each day before March 20th 2020. We gathered datasets of a couple of days to understand and extract various topics everyday. The first seven of these datasets are denoted as dataset-1, dataset-2, dataset-3, dataset-4, dataset-5, dataset-6, and dataset-7. In this study, corona_tweets_3M was split into dataset-8 and dataset-9 because the computational cost is manipulated very high for the corona_tweets_3M.

Data preprocessing

In the preprocessing steps, different twitter datasets have been prepared for manipulation. These types of tweets contain various HTML tags, punctuation, numbers, single characters and multiple spaces. Several functions were used to clean datasets in this step. The symbols were replaced with empty spaces. Again, every single character which does not indicate any meaningful communication was replaced with space respectively. Finally, all multiple spaces were removed from these tweets. This process was employed in the nine twitter datasets and combined for further analysis. Table 1 represents the number of tweets before and after prepossessing steps.

Table 1

Number of cleaned tweets COVID-19 after data preprocessing.

Primary dataset	# tweets (N = 19,797,541)	Denoted	# tweets (N = 19,712,979)
Before preprocessing		After preprocessing
corona_tweets_1M.db	1,578,957	Dataset-1	1,569,619
corona_tweets_1M_2	1,889,781	Dataset-2	1,880,297
corona_tweets_1M	1,903,768	Dataset-3	1,894,526
corona_tweets_2L	2,80,304	Dataset-4	2,76,566
corona_tweets_2M.db	2,322,153	Dataset-5	2,312,104
corona_tweets_2M_2	2,268,634	Dataset-6	2,257,529
corona_tweets_2M_3	2,081,576	Dataset-7	2,072,575
corona_tweets_3M	7,472,368	Dataset-8	3,724,882
		Dataset-9	3,724,881

Number of cleaned tweets COVID-19 after data preprocessing.

Tokenization

After the pre-processing steps, tokenization procedures were used to generate a word-to-index dictionary whereby each word is created as a key in the corpus. Hence, the corresponding unique index has been indicated the value of the keys. In the training phase, each list is held on each sentence where the size is dissimilar. Thus, the maximum length of the list is fixed. If the length of any list is exceeded, it is truncated into the maximum permitted length. Zeros are added to the endpoint of a shortlist until it reaches a maximum length, a process is termed padding. Employing word embedding is useful to extract significant words and investigate similarity along with semantic relations more precisely. Pennington et al. [35] proposed a certain weighted least squares model that trains and counts global word-word co-appearance for efficient statistical usage. This is called GloVe that is also publicly available [15]. Thus, this embedding word vector has been used to create a dictionary that holds a word as a key and the corresponding list as values [36]. Finally, an embedding matrix is generated whereby each row number matches with the index of the word in the corpus. Raw tweets contain text instances which cannot handle by machine learning procedure. Therefore, we run data pre-processing and tokenization process to make it executable for clustering and classification computation.

Traditional analysis

In the traditional process, we have been manipulated by various data pre-processing, tokenization and implemented different baseline classifiers into twitter datasets. Therefore, various well known classifiers were applied in the primary datasets using 10 fold cross validation and compared the results with TClustVID. However, both traditional and TClustVID have been used the same baseline classifier which indicates at Section 3.6.

TClustVID: Clustered based classification and topics modeling approach

In the beginning, different preprocessing and tokenization process has been implemented into COVID-19 twitter datasets and split them into several groups applying k-means method. Clustering is an unsupervised technique to partition a set of the dataset into subsets/clusters. This procedure is helpful to improve the performance of machine learning methods by creating clusters. There are existing various algorithms such as k-means, k-medoids, fuzzy C-means, hierarchical, and density based clustering [37], [38]. K-medoids is not the best choice for analyzing sparse data like tweets. Then, fuzzy C-means is useful to the sheer volumes of tweets and contains low scalability where human annotation really expensive. The performance of hierarchical clustering is slower than the k-means method. Density based techniques are highly efficient for clustering unstructured data and less prone to outliers and noise. In this work, we processed a large amount of tweet data where K-means defines the mean point within the cluster by optimizing the Euclidean distance between each instance in less time [38], [39]. The default values of k 5 are mainly used in this work. Each cluster has been contained positive, negative and neutral tweets where generated tokens were replaced by primary tweets and re-tokenized in each cluster. Baseline classifiers have then been used to investigate the performance of individual clusters using 10 fold cross-validation. Different evaluation metrics such as accuracy, the area under the curve (AUC), f-measure, g-mean, sensitivity and specificity were used to assess these results. The detailed working steps of TClustVID is represented briefly in the Algorithm summary 1. Compared the classification results of traditional approach and TClustVID, the best performing clusters can be used to extract more frequent topics. These clusters are divided into positive, neutral and negative sentiments for further analysis. Therefore, LDA has been used to explore significant positive, neutral and negative topics from the high performing nine clusters. 20 topics were extracted from each cluster. We represented individual topics in a word cloud where each contains different words/tokens. According to the weights of tokens, this cloud represents different word. However, LDA cannot interpret these topics so we manually analyzed the words/tokens of each topics to define them.

Baseline classification

In previous studies, the various classifiers such as decision tree (DT), Gradient Boosting (GB), K-Nearest Neighbor (KNN), Logistic Regression (LR), Multi-Layer Perceptron (MLP), Naïve Bayes (NB), Random Forest (RF), Support Vector Machine (SVM) and XGBoost (XGB) have been commonly used to investigate different twitter datasets for sentiment analysis. These classifiers were used in similar kinds of tasks such as C5.0 (DT), KNN, SVM, LR and ZeroR [40], personality prediction using KNN, NB, SVM, and XGB [41], [42], spam detection using RF, NB, SMO and Ibk (KNN equivalent) [43], sentiment analysis using NB, SVM, and MLP of top colleges [44], prediction of alternation price fluctuation using GB [45]. Following this literature, we selected them to investigate COVID-19 twitter dataset and explore the best clusters. Positive topics of Cluster-3. Neutral topics of Cluster-3. Negative topics of Cluster-3. Top frequency of (a) Positive (b) Neutral (c) Negative COVID-19 associated topics.

Evaluation metrics

A confusion matrix is needed to estimate the performance of the classifier that indicates the number of correct and incorrect predictions by considering known true values. Based on positive and negative classes, this shows True Positive (TP), True Negative (TN), False Positive (FP) and False Negative (FN) values for the data fitting. Accuracy: represents the efficiency of the algorithm in terms of predicting true values. AUC: is used to explore machine learning models considering the TP and TN rates represent how well positive classes are isolated from negative classes. F-measure: represents the harmonic mean of precision and recall. Geometric-mean (G-mean): specifies the root of the class-specific sensitivity product and makes a trade-off between the expansion of accuracy on each class and balancing accuracy. Sensitivity: The portion of appropriately detected actual positives is indicated as sensitivity. Specificity: The portion of correctly identified actual negatives is denoted as specificity.

Experimental result

Sentiment analysis through classification approach

In this study, our proposed TClustVID has detected positive, negative, and neutral tweets more accurately using a clustering based classification and explored more significant thematic topics. However, primary datasets were cleaned using different data preprocessing procedures and Word-to-index dictionaries were then created using GloVe embedding tokenization. Several classification algorithms such as DT, GB, KNN, LR, MLP, NB, RF, SVM and XGB were analyzed sentiments of the COVID-19 datasets using the sci-kit-learn machine learning python library [46], [47]. The results of individual classifiers for nine COVID-19 twitter datasets are represented at Table 2.

Table 2

The results of sentiment classification for individual datasets.

Dataset	Classifier	Accuracy	AUC	F-Measure	Sensitivity	Specificity	Dataset	Accuracy	AUC	F-Measure	Sensitivity	Specificity
		Traditional Analysis						TClustVID
Dataset-01	LSTM	0.927	0.897	0.926	0.927	0.868	Cluster-01	0.983	0.979	0.983	0.983	0.974
	DT	0.915	0.901	0.916	0.915	0.886		0.952	0.945	0.952	0.952	0.938
	GB	0.788	0.653	0.746	0.788	0.518		0.816	0.713	0.790	0.816	0.610
	KNN	0.910	0.880	0.909	0.910	0.850		0.946	0.930	0.945	0.946	0.914
	LR	0.695	0.502	0.576	0.695	0.308		0.679	0.500	0.552	0.679	0.322
	MLP	0.840	0.766	0.830	0.840	0.692		0.901	0.869	0.899	0.901	0.836
	NB	0.654	0.503	0.597	0.654	0.352		0.644	0.502	0.577	0.644	0.359
	RF	0.924	0.898	0.923	0.924	0.873		0.957	0.944	0.957	0.957	0.932
	SVM	0.757	0.600	0.694	0.757	0.442		0.803	0.693	0.772	0.803	0.583
	XGB	0.787	0.657	0.750	0.787	0.527		0.854	0.774	0.840	0.854	0.695

Dataset-02	LSTM	0.968	0.949	0.968	0.968	0.929	Cluster-02	0.988	0.977	0.988	0.988	0.967
	DT	0.931	0.899	0.931	0.931	0.866		0.964	0.943	0.964	0.964	0.921
	GB	0.816	0.567	0.755	0.816	0.318		0.856	0.626	0.819	0.856	0.396
	KNN	0.924	0.865	0.922	0.924	0.806		0.958	0.918	0.957	0.958	0.878
	LR	0.787	0.501	0.695	0.787	0.216		0.807	0.505	0.727	0.807	0.203
	MLP	0.867	0.730	0.854	0.867	0.592		0.925	0.841	0.921	0.925	0.756
	NB	0.213	0.500	0.076	0.213	0.787		0.192	0.500	0.063	0.192	0.808
	RF	0.937	0.888	0.936	0.937	0.838		0.968	0.936	0.967	0.968	0.905
	SVM	0.787	0.501	0.695	0.787	0.215		0.806	0.502	0.724	0.806	0.197
	XGB	0.820	0.578	0.764	0.820	0.336		0.875	0.694	0.855	0.875	0.513

Dataset-03	LSTM	0.915	0.922	0.915	0.915	0.929	Cluster-03	0.985	0.987	0.985	0.985	0.988
	DT	0.911	0.930	0.911	0.911	0.950		0.960	0.967	0.960	0.960	0.973
	GB	0.699	0.717	0.674	0.699	0.735		0.846	0.838	0.836	0.846	0.830
	KNN	0.893	0.918	0.893	0.893	0.942		0.950	0.958	0.950	0.950	0.965
	LR	0.514	0.548	0.426	0.514	0.582		0.668	0.651	0.628	0.668	0.635
	MLP	0.793	0.827	0.788	0.793	0.860		0.909	0.914	0.908	0.909	0.918
	NB	0.485	0.551	0.441	0.485	0.616		0.212	0.504	0.178	0.212	0.796
	RF	0.911	0.933	0.911	0.911	0.955		0.959	0.966	0.959	0.959	0.973
	SVM	0.344	0.520	0.308	0.344	0.696		0.463	0.617	0.482	0.463	0.770
	XGB	0.722	0.766	0.710	0.722	0.809		0.847	0.846	0.838	0.847	0.845

Dataset-04	LSTM	0.904	0.915	0.903	0.904	0.926	Cluster-04	0.957	0.957	0.956	0.957	0.956
	DT	0.892	0.915	0.892	0.892	0.937		0.943	0.949	0.943	0.943	0.956
	GB	0.621	0.614	0.553	0.621	0.607		0.818	0.788	0.806	0.818	0.758
	KNN	0.873	0.901	0.873	0.873	0.929		0.930	0.939	0.930	0.930	0.948
	LR	0.547	0.556	0.457	0.547	0.565		0.747	0.722	0.728	0.747	0.697
	MLP	0.765	0.797	0.758	0.765	0.829		0.882	0.877	0.878	0.882	0.872
	NB	0.533	0.536	0.422	0.533	0.539		0.274	0.506	0.139	0.274	0.737
	RF	0.892	0.918	0.892	0.892	0.943		0.943	0.950	0.942	0.943	0.958
	SVM	0.397	0.519	0.398	0.397	0.641		0.326	0.523	0.340	0.326	0.720
	XGB	0.683	0.691	0.648	0.683	0.699		0.825	0.809	0.817	0.825	0.792

Dataset-05	LSTM	0.904	0.927	0.903	0.904	0.951	Cluster-05	0.968	0.975	0.968	0.968	0.983
	DT	0.866	0.899	0.866	0.866	0.932		0.902	0.925	0.902	0.902	0.949
	GB	0.534	0.625	0.494	0.534	0.715		0.624	0.684	0.587	0.624	0.744
	KNN	0.841	0.880	0.841	0.841	0.920		0.878	0.907	0.878	0.878	0.937
	LR	0.431	0.552	0.367	0.431	0.673		0.454	0.557	0.386	0.454	0.659
	MLP	0.624	0.712	0.622	0.624	0.801		0.749	0.800	0.744	0.749	0.851
	NB	0.419	0.529	0.305	0.419	0.639		0.429	0.524	0.344	0.429	0.619
	RF	0.865	0.900	0.865	0.865	0.934		0.900	0.924	0.900	0.900	0.949
	SVM	0.338	0.525	0.258	0.338	0.711		0.424	0.537	0.362	0.424	0.650
	XGB	0.548	0.647	0.532	0.548	0.745		0.645	0.717	0.639	0.645	0.789

Dataset-06	LSTM	0.876	0.908	0.877	0.876	0.941	Cluster-06	0.977	0.982	0.977	0.977	0.987
	DT	0.879	0.909	0.879	0.879	0.938		0.932	0.948	0.932	0.932	0.963
	GB	0.602	0.659	0.562	0.602	0.715		0.763	0.785	0.748	0.763	0.807
	KNN	0.858	0.893	0.859	0.858	0.929		0.917	0.936	0.917	0.917	0.955
	LR	0.474	0.561	0.400	0.474	0.648		0.526	0.581	0.465	0.526	0.636
	MLP	0.714	0.778	0.712	0.714	0.842		0.846	0.874	0.845	0.846	0.902
	NB	0.450	0.522	0.315	0.450	0.594		0.475	0.515	0.328	0.475	0.554
	RF	0.879	0.910	0.879	0.879	0.942		0.931	0.948	0.931	0.931	0.964
	SVM	0.418	0.530	0.341	0.418	0.643		0.536	0.568	0.433	0.536	0.600
	XGB	0.642	0.719	0.637	0.642	0.796		0.774	0.813	0.772	0.774	0.851

Dataset-07	LSTM	0.903	0.919	0.903	0.903	0.936	Cluster-07	0.983	0.986	0.983	0.983	0.990
	DT	0.908	0.929	0.908	0.908	0.951		0.955	0.965	0.955	0.955	0.975
	GB	0.664	0.718	0.656	0.664	0.773		0.810	0.830	0.806	0.810	0.850
	KNN	0.889	0.915	0.889	0.889	0.942		0.941	0.954	0.941	0.941	0.967
	LR	0.451	0.538	0.380	0.451	0.624		0.548	0.598	0.501	0.548	0.647
	MLP	0.768	0.813	0.764	0.768	0.859		0.885	0.905	0.885	0.885	0.925
	NB	0.219	0.501	0.083	0.219	0.783		0.220	0.503	0.094	0.220	0.787
	RF	0.909	0.931	0.909	0.909	0.954		0.954	0.964	0.954	0.954	0.975
	SVM	0.353	0.517	0.353	0.353	0.681		0.299	0.539	0.251	0.299	0.780
	XGB	0.635	0.705	0.632	0.635	0.774		0.815	0.843	0.814	0.815	0.871

Dataset-08	LSTM	0.908	0.921	0.907	0.908	0.935	Cluster-08	0.976	0.981	0.976	0.976	0.985
	DT	0.870	0.901	0.870	0.870	0.931		0.910	0.929	0.910	0.910	0.948
	GB	0.600	0.654	0.557	0.600	0.709		0.687	0.698	0.655	0.687	0.708
	KNN	0.847	0.884	0.847	0.847	0.921		0.853	0.884	0.853	0.853	0.915
	LR	0.501	0.582	0.440	0.501	0.663		0.516	0.547	0.428	0.516	0.578
	MLP	0.650	0.722	0.635	0.650	0.794		0.795	0.825	0.790	0.795	0.856
	NB	0.460	0.529	0.332	0.460	0.599		0.489	0.536	0.379	0.489	0.582
	RF	0.870	0.903	0.870	0.870	0.936		0.909	0.930	0.909	0.909	0.951
	SVM	0.440	0.513	0.326	0.440	0.585		0.409	0.505	0.337	0.409	0.601
	XGB	0.597	0.678	0.577	0.597	0.759		0.678	0.724	0.669	0.678	0.770

Dataset-09	LSTM	0.897	0.912	0.896	0.897	0.928	Cluster-09	0.976	0.981	0.976	0.976	0.986
	DT	0.870	0.900	0.870	0.870	0.931		0.911	0.930	0.911	0.911	0.949
	GB	0.600	0.654	0.557	0.600	0.709		0.686	0.698	0.651	0.686	0.711
	KNN	0.847	0.884	0.847	0.847	0.921		0.856	0.886	0.856	0.856	0.917
	LR	0.498	0.579	0.437	0.498	0.660		0.508	0.541	0.420	0.508	0.574
	MLP	0.650	0.715	0.633	0.650	0.780		0.802	0.830	0.797	0.802	0.859
	NB	0.221	0.500	0.083	0.221	0.780		0.250	0.507	0.191	0.250	0.764
	RF	0.869	0.902	0.870	0.869	0.936		0.910	0.931	0.910	0.910	0.952
	SVM	0.345	0.508	0.300	0.345	0.671		0.270	0.515	0.243	0.270	0.759
	XGB	0.599	0.680	0.579	0.599	0.760		0.676	0.726	0.668	0.676	0.776

In traditional analysis, a number of classifiers such as LSTM, DT, RF, GB, KNN, MLP, NB, RF, SVM and XGB have been implemented. Therefore, LSTM gave the highest accuracy, f-measure and sensitivity and DT provided maximum AUC and specificity for dataset-1. Also, this classifier outperformed other classifiers in all evaluation metrics for dataset-2, 5 and 8, respectively. In addition, LSTM provided the highest accuracy, f-measure and sensitivity as well as RF provided the best AUC and specificity for dataset-3 and 4, individually. However, DT generated the maximum accuracy and sensitivity while RF gave the highest AUC, f-measure and specificity for dataset-6. Again, RF outperformed other classifiers in all metrics for dataset-7. LSTM showed the highest AUC, f-measure and sensitivity and RF provided the best accuracy and specificity for dataset-9. In contrast, individual classifiers were employed into different datasets using TClustVID where their results have been improved over the traditional analysis. However, the same classification methods that have been used in a general way were employed into Twitter datasets using TClustVID. Several clusters have been produced that were used to generate classification results where TClustVID has been identified those clusters whose were given the best classification results among them. In this case, LSTM outperforms other classifiers with all evaluation metrics for all datasets. In Fig. 2(a), the average outcomes of different classifiers such as LSTM, DT, KNN, MLP, XGB, GB, SVM and LR are represented using a traditional approach. Similarly, TClustVID manipulated average results of the same classifiers used by TClustVID and compared its findings with traditional procedure (see Fig. 2(b)). In this case, LSTM provided the highest average accuracy, AUC, f-measure, sensitivity and specificity for both traditional way and TClustVID. In addition, TClustVID showed better results compared to more traditional approaches (see Fig. 2).

Fig. 2

Average performance of various classifiers for evaluating them using (a) traditional way (b) TClustVID corresponding to the nine twitter experimental datasets.

The results of sentiment classification for individual datasets. However, we measured shapley additive explanations (SHAP) values of various tokens to determine positive, neutral and negative sentiments more effectively. SHAP is a game theoretic technique to interpret the findings of any machine learning model. Therefore, the result of TClustVID for LSTM has been evaluated in each cluster and explored which tokens are responsible to classify positive, neutral and negative sentiments. Fig. 3 shows the probability of SHAP values for different tokens in different nine clusters.

Fig. 3

Compute SHAP values to determine COVID-19 (a) Positive (b) Neutral (c) Negative topics.

Along with observing the performance of various classifiers, we noted that TClustVID shows better performance than traditional analysis. Hence, a topic modeling approach is used to produce high performing clusters for the extraction of significant topics in the next section (see Fig. 4).

Fig. 4

Word cloud of various topics.

Topic modeling approach

Extraction of clusters using TClustVID

A comprehensive analysis of different classifiers in traditional and TClustVID analyses indicated that TClustVID is the best model to identify significant groups of tweets from large COVID-19 Twitter datasets. The data obtained from the identification of groups/clusters were significant because they showed the highest classification accuracy were achieved compared to traditional analysis in primary data. In the TClustVID analysis, we generated significant clusters from each of these twitter datasets (for positive neutral, and negative categories) that showed greatly improved results for the different classifiers. These clusters have been denoted as Cluster-1, Cluster-2, Cluster-3, Cluster-4, Cluster-5, Cluster-6, Cluster-7, Cluster-8, and Cluster-9, respectively.

Topics exploration using LDA

A number of topics were then extracted from these clusters where within nine clusters seven of them produced positive, neutral and negative topics and two of them extracted positive and neutral topics using LDA. Each topic contains 10 tokens along with related weights and they can be used to prioritize each token. 20 topics were identified from each of the categories (positive, neutral and negative) in these clusters. Therefore, all topics of individual clusters are represented as a word cloud in the supplementary section. In this paper, extracted positive, neutral and negative topics of cluster-3 are visualized with word cloud in Fig. 5, Fig. 6, Fig. 7 individually.

Fig. 5

Positive topics of Cluster-3.

Fig. 6

Neutral topics of Cluster-3.

Fig. 7

Negative topics of Cluster-3.

Qualitative analysis

As LDA cannot interpret the meaning of topics, we defined their themes by determining the meaning and weight values in different groups manually. The themes of positive, neutral and negative topics are indicated in Table 3, Table 4, Table 5 respectively. These tasks are not simple because many pre-processed words do not have any semantic meaning. However, it can be hard to understand the association between the different words/tokens in these topics and these interpretations may slightly differ from that used in other types of reviewing.

Table 3

Positive themes of all significant clusters.

	Cluster-1	Cluster-2	Cluster-3	Cluster-4	Cluster-5

Theme-1	Culture	Prevention	Kids	Wish	Sunny
Theme-2	Nationality	Situation	Wish	News	Watch
Theme-3	Prevention	Situation	Testing	Situation	Affect
Theme-4	Caring	Homework	Treatment	Help	Situation
Theme-5	Blaming	News	Testing	Help	Treatment
Theme-6	Believe	News	Caring	Facts	Awareness
Theme-7	Die	News	Feeling	Control	Medicine
Theme-8	Caring	Wish	Situation	Infectious	Treatment
Theme-9	Discrimination	Awareness	Scaring	Right	Medicine
Theme-10	Situation	Financial state	Buying	Awareness	Awareness
Theme-11	Crisis	News	Fun	Wish	Prevention
Theme-12	Financial Help	Avoidness	Right	News	Situation
Theme-13	Condition	Crisis	Panic	Situation	Awareness
Theme-14	Wish	Food	Protection	Distance & Treatment	Treatment
Theme-15	Lockdown	Blaming	Health	Annoying	Awareness
Theme-16	Closing	Situation	Awareness	Situation	Humor
Theme-17	Closing	Lockdown	Panic	Job	Situation
Theme-18	Awareness	Awareness	Effect	Stay Safe	Risk
Theme-19	Financial help	Annoying	Micro-Organism	Awareness	Situation
Theme-20	Caring	Awareness	News	Wish	Risk

	Cluster-6	Cluster-7	Cluster-8	Cluster-9

Theme-1	Right	Testing & Treatment	Survive	Shut
Theme-2	Need	Interest	Flu	Honest
Theme-3	Covid	Need	Move	Media
Theme-4	Social media	Social distance	Overreact	Right
Theme-5	Awareness	Social distance	Situation	Testing
Theme-6	Flight	Epidemic	Rumor	Caring
Theme-7	Messege	Social distance	Fight & Caring	Isolation
Theme-8	Right	Symptoms	Cases	Survive
Theme-9	Treatment	Effect	Disease	Home
Theme-10	Wish	Confirmed	Cases	Wish
Theme-11	Situation	Coronavirus	Awareness	Worried
Theme-12	Warning	Message	Infectious	Situation
Theme-13	Testing & Treatment	Coronavirus	Social guys	Quarantine
Theme-14	Cases	Social distance	Situation	Love
Theme-15	Message	Tourism	Quarantine	Scaring
Theme-16	Message	Tourism	Awareness	Do not Move
Theme-17	Situation	Coronavirus	Facts	Affect
Theme-18	Tourism	Outbreak	Schools	Wind
Theme-19	Coronavirus	Coronavirus	Crisis & Prevention	Awareness
Theme-20	Awareness	Awareness	Financial enrichment	Fuck

Table 4

Neutral themes of all significant clusters.

	Cluster-1	Cluster-2	Cluster-3	Cluster-4	Cluster-5

Theme-1	Financial lose	Warning	Outbreak	Situation	Awareness
Theme-2	Fact	Food	Sharing	Panic	Infectious
Theme-3	Warning	Situation	Wish	Situation	Situation
Theme-4	Estimate	Situation	Gonna	Entertainment	Need
Theme-5	Blaming	Testing	Caring	Protection	Wish
Theme-6	Pleased	Rumor	Caring	Dead	Food
Theme-7	Financial lose	Warning	Panic	Health	Break
Theme-8	Pandemic warning	Visiting	Survive	Stay Home	Treatment
Theme-9	Awareness	Joke	Awareness	Avoid	Want
Theme-10	Disease	Panic	Treatment	Fact	Prevention
Theme-11	Warning	Situation	Playing game	Awareness	Awareness
Theme-12	Caring	Panic	Coronavirus	Protection	Panic
Theme-13	Panic	Closing	Homework	Awareness	Situation
Theme-14	Panic	Panic	Ramadhan news	Situation	Awareness
Theme-15	Awareness	Panic	Sanitation	Fact	Prevention
Theme-16	Panic	Situation	Wish	Panic	Coronavirus
Theme-17	Blaming	Homework	Situation	Wish	Avoid
Theme-18	Joke	Blaming	Coronavirus	Update	Food
Theme-19	Joke	Panic	Avoid	Cases	Situation
Theme-20	Annoyed	Annoyed	Stop spreading	Hospitalize	Coronavirus

	Cluster-6	Cluster-7	Cluster-8	Cluster-9

Theme-1	Vacine	Ruin	Situation	Tourism
Theme-2	News	Cases	Watch	Outbreak
Theme-3	Message	Coronavirus	Virus	Situation
Theme-4	Prevention	Awareness	Touch	Situation
Theme-5	Dead	Wait & Things	Symptom	Quarantine
Theme-6	News	Crisis	Problem	Education
Theme-7	Panic	Symptom	Shot	Education
Theme-8	Protection	News	Like	Virus
Theme-9	Awareness	Symptom	Situation	Pandemic
Theme-10	Situation	Infectious	Sick	Dead
Theme-11	Thread	Expose	Dead	Education
Theme-12	Wish	Caring	Body	Awareness
Theme-13	Situation	Help & Need	Flu	Body
Theme-14	Awareness	Protection	Wish	Need
Theme-15	Message	Testing	Panic	Caring
Theme-16	Situation	Blaming	Watch	Panic
Theme-17	Media	Cure	Time	Fact
Theme-18	Coronavirus	Message	Panic	Cases
Theme-19	Cases	Stay Home	Contract	Public
Theme-20	Health	Situation	Awareness	Exhibit

Table 5

Negative themes of all significant clusters.

	Cluster-1	Cluster-2	Cluster-3	Cluster-4	Cluster-5	Cluster-6	Cluster-7
Theme-1	Financial crisis	Panic	Anxiety	Warning	Serious	Financial crisis	Worry
Theme-2	Panic	Media	Die	Avoid	Blaming	Hope	Excuse
Theme-3	Panic	Food	Panic	Warning	Message	Panic	Fake News
Theme-4	Situation	Jobless	Panic	Sick	Buy	Dead	Sad
Theme-5	Isolation	Restriction	Incur	Blaming	Hate	Situation	Situation
Theme-6	Stopping	Food	Panic	Situation	Avoid	Fever	Coronavirus
Theme-7	Disease	Situation	Panic	Covid	Stopping	Awareness	Media
Theme-8	Spreading	Food	Situation	Afraid	Infectious	Situation	Catch & Game
Theme-9	Situation	Jobless	Situation	Situation	Scare	Food	Ebola
Theme-10	Avoid	Situation	Panic	Blaming	Erazi	Lack of protection	Worst
Theme-11	Treatment	Panic	Situation	Crisis	Crisis	Need	Sick
Theme-12	Panic	News	Sick	Panic	Panic	Lockdown	Quarantine
Theme-13	Fear	Closing	Coronavirus	Die	Long lasting	Fear	Disease
Theme-14	Disease	Blaming	Situation	Spreading	Propaganda	Wrong	Scare
Theme-15	Situation	Social distance	Suffer	Treatment	Fake	Toilet	Panic
Theme-16	Situation	Panic	Situation	Danger	Lock	Hate	Covid
Theme-17	Habitual Fact	Non-Realiable	Panic	Fake News	Panic	Dead	Disease
Theme-18	Humor	Infectous	Situation	Wrong	Outbreak	Danger	Situation
Theme-19	Panic	Disease	Die	Treatment	Accept	Cold	Panic
Theme-20	Panic	Care	Fake News	Dead	Hope	Ebola	Annoy

In the different categories of tweets, we manipulated the frequency of different topics that appears several times. Positive, neutral and negative topics have been identified what activities are generated in the context. To understand individual topics into different themes, we considered the best themes which are appeared more than 1 times (see Fig. 8). The examples of positive topics of cluster-3 are shown as the word cloud in Fig. 5. In addition, The themes of positive topics within different clusters are shown in Table 3 and the top frequent positive themes are shown in Fig. 8(a). For the positive cases, awareness and situation are the most frequent themes that appear many times in different clusters. Both of these appear 17 times in different significant clusters. Awareness has specified those actions whose are taken by individuals and situation symbolizes the general situation of particular places/incidents where pandemic news indicates a generic situation relating to COVID-19. Wishes appear 8 and new appears 7 times in this study. Furthermore, caring, coronavirus, right and treatment are gathered 5 times, and message, and social distance are found 4 times this effort. Subsequently, cases, prevention, testing and tourism are obtained 3 times in the COVID-19 situation. In addition, other precaution related themes such as affect, annoying, blaming, closing, crisis, effect, facts, financial help, help, infectious, lockdown, medicine, need, panic, quarantine, risk and scaring are represented their frequency 2 times in different clusters. They are appeared regularly and specifies how we can improve this condition. However, some of negative themes, for instance blaming, crisis, infectious, panic, risk appeared in positive cases but their frequencies are not greater. More upcoming positive issues are also addressed in this analysis included financial help, help, lockdown, quarantine and medicine.

Fig. 8

Top frequency of (a) Positive (b) Neutral (c) Negative COVID-19 associated topics.

In the neutral category, there are appeared the mixture of positive and negative topics which indicates the most frequent topics in recent timeframes. For example, we have represented an example of neutral topics as a world cloud in Fig. 6. Besides this, neutral themes of different clusters are provided in Table 4 and top frequent themes are shown in Fig. 8(b). Therefore, situation, panic and awareness are found 19, 16 and 13 times in the following list of twitter topics. Panic is a related theme to explain epidemic conditions and news. In addition, wish and coronavirus appear 6 times as well as caring which appears 5 times at negative tweets. Consequently, blaming, cases, die, warning and protection appear 4 times while education, food, joke, message, news, prevention, and symptom appear 3 times in this condition. The rest of the themes perform 2 times to represent neutral topics. The issues such as those related to before and after the COVID-19 pandemic like Financial, lose, crisis, food, education also arose in this analysis. Positive themes of all significant clusters. The negative topics using the word cloud are represented in Fig. 7. Thus, the themes of negative topics have been provided in Table 5 and topmost frequent themes are shown in Fig. 8(c). In this category, panic and situation appear most of the times than other topics. Both of them appear 20 and 18 times respectively. Dead and disease appear 6 and 5 times enabling estimation of its influence. Thus, food and blaming occur 4 times and treatment, sick, fake news and avoid represent 3 times to represent significant topics. Some cases like food and treatment indicate the level of crisis perceived. The rest of the themes are provided with a frequency of 2 in this work. Therefore, they are shown in the top list of feelings or perceptions relating to COVID-19 that are negative. Neutral themes of all significant clusters. Negative themes of all significant clusters.

Discussion

Comparison of TClustVID with recent published work

Proposed TClustVID is overcame many of the pitfalls that are evident in many recent work. In current work, we present a well-organized machine learning model that has been employed into common COVID-19 oriented tweets where different regions are not specified like previous studies [18], [28], [48], [49]. Both sentiment analysis and topics modeling were used to explore COVID-19 related themes than many works [18], [20], [22], [24], [27], [48], [49], [50], [51]. However, many machine learning classifiers have been implemented in which we compared our proposed model with more traditional analyses to evaluate performance. However, most previous studies [18], [28], [31] used only a small number of classifiers to verify their tasks. Our work was also able to extract reliable themes of positive, negative and neutral topic to explore clusters and realize the condition of COVID-19.

Implications

Twitter refer to a reasonable and proficient platform to validate the efficiency of public health communication. Real-time epidemiological data are required to properly and comprehensively characterize user discussion, self-reporting capabilities and rapid evaluation of pandemic situation. In this study, we developed a machine learning based framework named TClustVID and investigated various types of public tweets related to COVID-19, identifying related sentiments and extracted associated topics from a number of localities. This efficiently provides significant insights on how people interpret mixed around COVID-19 messages. There are numerous theoretical and practical implications about this model which is described as follows.

Theoretical implications of the study

This proposed method has extracted positive, negative and neutral topics to scrutinize its contents and extract significant values to give various information about related issues. TClustVID has been designed to focus on particular types of analyses such as psychological and emotional analysis. However, it can easily be generalized and adapted to analyze any specific topics of interest. This study is very useful to verify these kinds of analysis in various perspective. However, demographic analysis, comparison and discussion can give a concrete idea about various source. Theoretical understanding gained from this work can be used for addressing similar types of problems but also doing so at a lower cost. From the limitations and suggestions, researchers can take numerous new challenges in future work.

Practical implications

Users of this model can isolate individuals one from another by giving relaxation and support via social media. It safeguards people interest and needs in the society. This analytical approach can be used for comprehensive contact tracing, unidentified hot spots of COVID-19 infection and increase the accuracy, predictability to find out COVID-19 cases. This model can be employed to explore how to improve public health campaigns on the leading topics featuring in twitter conversations to give timely responses and improve initiatives taken by agencies. This work has mainly focused on a number of particular common concerns relating to working conditions. Many tweets have been posted about working from home during this outbreak. It can be explored an opportunity to follow patterns of vaccine acceptance and failure or criticism against it. Also, it allows assessment of real-time trends for COVID-19 treatment, medical equipment, diagnosis, cross correlating its information with medical information and other factors. A new surveillance system can be built to examine web-based contents using this model for better understanding of public emotions and concerns. This works can be generalized to analyze other social media data such as Instagram, Facebook and YouTube The scientific community can also be studied to determine for their the similarity and dissimilarity from public comments using this model. Our work has generated useful data for agencies, local leaders, health providers and municipalities. This can enable governments to coordinate the flow of information and combat misinformation about the pandemic.

Limitations of the sudy

Twitter gives the community interaction and its user profiles represent a relatively small demographic data for further analysis. We only gathered tweets using a few numbers of keywords from one social media platform. This study has only investigated English language tweets. In addition, machine and deep learning methods have not been implemented into a large amount of COVID-19 oriented tweets. Again, the interpretation of topics is a challenging task, hence some manual interpretation of topics may misinterpret in the topics modeling.

Challenges and future suggestions

A number of challenges can be considered for investigating COVID-19 tweets for sentiment analysis and topics modeling. In different social media such as Twitter, many cases of showing irrelevant, fake, misinformed and insufficient data has been found. In addition, these tweets needed to be collected from different domains of the social media. It is difficult for researchers to work with this dataset as processing such dataset requires a high degree of technical skill. It is often hard to define which keywords are appropriate to gather COVID-19 related tweets and identify desired data. Moreover, decision makers face troubles to identify people’s sentiments on a subject or to characterize their beliefs. Also, there remain a lack of scientific studies, to gather knowledge for designing a new model. In these difficult circumstances, we will need to face these challenges. Along with Twitter, the records of other social media (such as Facebook, YouTube, Instagram and Reddit) need to be investigated to explore knowledge about COVID-19 pandemic from users. To overcome the general lack of published literature on the subject, most relevant previous works about pandemic situation can be useful for getting solutions from them. However, COVID-19 related hashtags and keywords need to be explored using to recently developed academic literature and sentiment and opinion mining tasks. New developments such as TClustVID can also be used with modifications to analyze more similar but heterogeneous records of various sources.

Conclusion

In this work, we have proposed a clustered based machine learning model named TClustVID that has given the best performance outcomes in sentiment analysis and topics modeling by analyzing COVID-19 twitter datasets compared to other methods. TClustVID first extracted various clusters from individual datasets using k-means algorithm [38], then the proposed model was used to separate different classifiers into clusters and one of them represents the highest classification accuracy in each dataset. We subsequently compared the topmost clustering result of each dataset with traditional analysis with TClustVID showing the maximum outcomes for each case. Furthermore, the best clusters identified provided more significant topics in each dataset and represents public opinions on Twitter. It also explored more significant information that can be abstracted from very large numbers of tweets by extracting commonly occurring topics and interpreting their themes. This model is helpful to identify important themes about the situation at the time the tweets were sent, and can enable designing better strategies to counter the pandemic that take human responses and behavior into account. This knowledge was extracted from positive, neutral and negative tweets and identified high frequency information features transmitted and commented as the response to the epidemic condition. As noted in the Study Limitations (Section 5.3) and future guidelines of this work (Section 5.4), more COVID-19 oriented social media data from different sources can in future be collected and investigated using TClustVID (and improved versions of TClustVID) and other techniques currently being used, which will enable efficient extraction and analysis of significant information about COVID-19 and other health emergencies.

CRediT authorship contribution statement

Md. Shahriare Satu: Conceptualization, Methodology, Resources, Data curation, Writing—original draft preparation, Visualization. Md. Imran Khan: Conceptualization, Methodology, Software, Data curation, Visualization. Mufti Mahmud: Methodology, Formal analysis, validation. Shahadat Uddin: Formal analysis, validation. Matthew A. Summers: Formal analysis, validation. Julian M.W. Quinn: Writing—review and editing. Mohammad Ali Moni: Writing—review and editing, supervision.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

11 in total

1. Sentiment Analysis on COVID-19 Twitter Data Streams Using Deep Belief Neural Networks.

Authors: Jatla Srikanth; Avula Damodaram; Yuvaraja Teekaraman; Ramya Kuppusamy; Amruth Ramesh Thelkar
Journal: Comput Intell Neurosci Date: 2022-05-06

2. COVID-19 analytics: Towards the effect of vaccine brands through analyzing public sentiment of tweets.

Authors: Khandaker Tayef Shahriar; Muhammad Nazrul Islam; Md Musfique Anwar; Iqbal H Sarker
Journal: Inform Med Unlocked Date: 2022-05-20

3. Leveraging Tweets for Artificial Intelligence Driven Sentiment Analysis on the COVID-19 Pandemic.

Authors: Nora A Alkhaldi; Yousef Asiri; Aisha M Mashraqi; Hanan T Halawani; Sayed Abdel-Khalek; Romany F Mansour
Journal: Healthcare (Basel) Date: 2022-05-13

4. Bioinformatics and system biology approach to identify the influences of COVID-19 on cardiovascular and hypertensive comorbidities.

Authors: Asif Nashiry; Shauli Sarmin Sumi; Salequl Islam; Julian M W Quinn; Mohammad Ali Moni
Journal: Brief Bioinform Date: 2021-03-22 Impact factor: 11.622

5. Infoveillance of the Croatian Online Media During the COVID-19 Pandemic: One-Year Longitudinal Study Using Natural Language Processing.

Authors: Slobodan Beliga; Sanda Martinčić-Ipšić; Mihaela Matešić; Irena Petrijevčanin Vuksanović; Ana Meštrović
Journal: JMIR Public Health Surveill Date: 2021-12-24

6. Spatial evolution patterns of public panic on Chinese social networks amidst the COVID-19 pandemic.

Authors: Yixin Yang; Yingying Zhang; Xiaowan Zhang; Yihan Cao; Jie Zhang
Journal: Int J Disaster Risk Reduct Date: 2022-01-03 Impact factor: 4.320

7. Effects of Bacille Calmette Guerin (BCG) vaccination during COVID-19 infection.

Authors: Utpala Nanda Chowdhury; Md Omar Faruqe; Md Mehedy; Shamim Ahmad; M Babul Islam; Watshara Shoombuatong; A K M Azad; Mohammad Ali Moni
Journal: Comput Biol Med Date: 2021-09-29 Impact factor: 4.589

8. The Evolution of Rumors on a Closed Social Networking Platform During COVID-19: Algorithm Development and Content Study.

Authors: Andrea W Wang; Jo-Yu Lan; Ming-Hung Wang; Chihhao Yu
Journal: JMIR Med Inform Date: 2021-11-23

9. Improved Transfer-Learning-Based Facial Recognition Framework to Detect Autistic Children at an Early Stage.

Authors: Tania Akter; Mohammad Hanif Ali; Md Imran Khan; Md Shahriare Satu; Md Jamal Uddin; Salem A Alyami; Sarwar Ali; Akm Azad; Mohammad Ali Moni
Journal: Brain Sci Date: 2021-05-31

Review 10. A Comprehensive Review on the Behaviour of Motorcyclists: Motivations, Issues, Challenges, Substantial Analysis and Recommendations.

Authors: Sarah Najm Abdulwahid; Moamin A Mahmoud; Bilal Bahaa Zaidan; Abdullah Hussein Alamoodi; Salem Garfan; Mohammed Talal; Aws Alaa Zaidan
Journal: Int J Environ Res Public Health Date: 2022-03-17 Impact factor: 3.390