Literature DB >> 36092470

Twitter conversations predict the daily confirmed COVID-19 cases.

Rabindra Lamsal¹, Aaron Harwood¹, Maria Rodriguez Read¹.

Abstract

As of writing this paper, COVID-19 (Coronavirus disease 2019) has spread to more than 220 countries and territories. Following the outbreak, the pandemic's seriousness has made people more active on social media, especially on the microblogging platforms such as Twitter and Weibo. The pandemic-specific discourse has remained on-trend on these platforms for months now. Previous studies have confirmed the contributions of such socially generated conversations towards situational awareness of crisis events. The early forecasts of cases are essential to authorities to estimate the requirements of resources needed to cope with the outgrowths of the virus. Therefore, this study attempts to incorporate the public discourse in the design of forecasting models particularly targeted for the steep-hill region of an ongoing wave. We propose a sentiment-involved topic-based latent variables search methodology for designing forecasting models from publicly available Twitter conversations. As a use case, we implement the proposed methodology on Australian COVID-19 daily cases and Twitter conversations generated within the country. Experimental results: (i) show the presence of latent social media variables that Granger-cause the daily COVID-19 confirmed cases, and (ii) confirm that those variables offer additional prediction capability to forecasting models. Further, the results show that the inclusion of social media variables introduces 48.83%-51.38% improvements on RMSE over the baseline models. We also release the large-scale COVID-19 specific geotagged global tweets dataset, MegaGeoCOV, to the public anticipating that the geotagged data of this scale would aid in understanding the conversational dynamics of the pandemic through other spatial and temporal contexts.

Entities: Chemical

Keywords: ARIMAX models; Granger causality; Pandemic forecast; Social media analytics; Time series analysis; Twitter analytics; VAR models

Year: 2022 PMID： 36092470 PMCID： PMC9444159 DOI： 10.1016/j.asoc.2022.109603

Source DB: PubMed Journal: Appl Soft Comput ISSN： 1568-4946 Impact factor: 8.263

Introduction

COVID-19 (Coronavirus disease 2019) is a respiratory illness caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the first case identified in Wuhan, China, in December 2019, has since spread globally, spanning to an ongoing pandemic [1]. The disease was declared as a public health emergency of international concern on January 30, 2020, and as a pandemic on March 11, 2020, by the World Health Organization. As of November 9, 2021, more than 251 million global cases and more than 5 million deaths have been confirmed [2]. During the early phase of the pandemic, countries and territories around the globe initiated partial and/or complete lockdowns to contain the spread of the virus. Mass vaccination campaigns have also been started after late 2020 with vaccines such as Oxford-AstraZeneca, Pfizer, Moderna, Johnson and Johnson, and Sinovac [3]. In the case of Australia, the country’s first case of COVID-19 was confirmed by Victoria Health Authorities on January 25, 2020 [4]. Since then, as of November 9, 2021, 182,870 cases and 1841 deaths have been confirmed as the country is currently facing its third wave of COVID-19 infections [2]. Fig. 1 shows the daily confirmed numbers and the cumulative numbers of COVID-19 infections in Australia between late January 2020 and early September 2021. Also illustrated in Fig. 1, Australia experienced the first wave of COVID-19 infections during March–April 2020, the second wave during June–October 2020, and the ongoing third wave since June 2021. Other than during the waves, the daily COVID-19 infections in Australia are within two digits. The highest confirmed cases for a single day during the first wave were 497, during the second wave were 716; while the third wave is ongoing and reporting significantly large figures each day [2].

Fig. 1

Daily (new) and total (cumulative) COVID-19 cases reported in Australia between January 25, 2020 (first COVID-19 case reported), and September 9, 2020.

Since the outbreak, the pandemic’s gravity has made people more vocal on social media, especially on microblogging platforms such as Twitter and Weibo. As people share what they are experiencing, observing, and gathering, multiple terms related to the pandemic have emerged and remained on-trend on these platforms for months now. Previous studies have shown that such public discourse contributes to a better understanding of an ongoing crisis. With this consideration, this study attempts to incorporate the public discourse in designing pandemic-related time series forecasting models specially targeted for the steep-hill region of a pandemic’s ongoing wave. The modeling and early prediction of the prevalence of virus are essential to provide situational information to decision making bodies and authorities to estimate the requirements of resources and equipment needed to cope with the consequences of the virus [5]. This study, therefore, focuses on the forecast of COVID-19 spread while addressing the following research questions (RQs): Daily (new) and total (cumulative) COVID-19 cases reported in Australia between January 25, 2020 (first COVID-19 case reported), and September 9, 2020. RQ1: Geotagged data plays a crucial role while modeling location-specific information [6]. Inclusion of social media variables into forecast models requires a large amount of geotagged data. Therefore, we would like to know what portion of the Twitter volume is geotagged? After the release of Twitter’s Academic Track-based Full-archive search and count APIs, finally, it is possible to address this research question—earlier researchers were able to access only a sample of overall Twitter volume. RQ2: Is there a presence of latent variables within geotagged Twitter data that Granger-cause the daily COVID-19 confirmed cases time series? RQ3: If the answer to RQ2 is ‘yes’, do those variables provide additional prediction capability to time series forecasting models? RQ4: Is “the volume of public discourse in the last few days” predictive of the steep-hill curve of COVID-19 cases during an ongoing wave? The paper is organized as follows: Section 2 presents related work, Section 3 explains the design of the time series dataset (includes data collection, sentiment analysis, topic modeling), Section 4 presents experimentation and discussion, and Section 5 is the conclusion.

Related work

In the past, modeling and forecasting of cases and transmission risks have been done across multiple areas: human West Nile virus cases and mosquito infection rates [7], hepatitis A virus infection [8], seasonal outbreaks of influenza [9] and its real-time tracking [10], Ebola outbreak [11], H1N1-2009 [12], international spread of Middle East respiratory syndrome (MERS) [13]. There have also been some notable works in the area of forecasting the daily confirmed cases of the ongoing pandemic. Maleki et al. [14] modeled the total number of global confirmed cases and recovered cases using autoregressive models based on the two-piece distributions for predicting the global cases between April 21, 2020–April 31, 2020. In [15], Salgotra et al. performed time series prediction of COVID-19 confirmed and death cases across major Indian cities for the period May 15, 2020–May 25, 2020, based on genetic programming [16]. Papastefanopoulos et al. [17] used both traditional statistical methods and machine learning approaches for estimating the percentage of active cases per population, up to 7 days into the future, for ten countries including the United States, Spain, Italy, the United Kingdom, and Germany. The authors showed that, overall, the traditional approaches like ARIMA (Autoregressive Integrated Moving Average) prevail over methods based on machine learning in the forecast of COVID-19 time series due to lack of a large amount of data. Similarly, Saba et al. [18] observed ARIMA and SARIMA (Seasonal ARIMA) models producing relatively better results, in the forecast of daily COVID-19 cases during complete and herd lockdowns, than machine learning algorithms such as Polynomial Regression, K-nearest neighbors, Random Forests, Support Vector Machine, and Decision Trees. In [19], Singh et al. used a hybrid model with discrete wavelet decomposition and ARIMA to forecast the cases of COVID-19. ARIMA and its variations appear to be the most favored techniques for COVID-19 cases time series forecast. Different parameterized ARIMA and its variants have been used across studies targeting regions such as India [20], [21], Pakistan [22], Saudi Arabia [23], Mainland China, Italy, South Korea, Thailand [24], US, Brazil, Russia, Spain [21], North America, South America, Africa, Asia and Europe [25], Italy, Spain, and France [26], and the most hit countries [27], [28]. Furthermore, models such as Susceptible, Exposed, Infection and Recover (SEIR), Infection and Recover (SIR) and their variations, and others such as Agent-based models, Curve-fitting and Logistic growth models have also been applied extensively for mathematical modeling of the COVID-19 situations for forecasting purposes [29], [30], [31], [32], [33], [34]. Social media platforms, such as Facebook and Twitter, have an active user base of millions and hold an enormous amount of socially generated data through the exchange of conversations. These platforms have become an active source of information during day-to-day life as well as during mass emergencies such as the ongoing pandemic [5]. During mass emergencies, the number of user activities across these platforms increases exponentially, as people: (a) generate trends on search engines such as Google, (b) share their safety status or query the safety status of their near ones, and (c) also share what they have seen, felt, or heard. These socially generated activities can be collected and analyzed for understanding the relationship between public discourse and how an emergency event unfolds at the ground level [5]. For example, in [35], Chew et al. used semantic word vectors as a representation of the public’s response to the pandemic to forecast the daily growth rate in the number of global confirmed COVID-19 cases with a lead-time of 1 day for the period January 25, 2020, and May 11, 2020. The authors extracted vector representations from more than 100 million English language tweets, trained a deep neural network on the vectors alongside the historical time series of growth rates and reported that their neural nets based approach outperforms traditional time series and machine learning models. In [36] Qin et al. collected social media search indexes (SMSI) for the COVID-19 specific keywords – dry cough, chest distress, coronavirus, fever, and pneumonia – from December 31, 2019, to February 9, 2020. The authors used the lagged series of the search indexes to predict the COVID-19 case numbers for the same period and reported that the cases’ trend correlated significantly with the lagged series. COVID-19 specific search query volumes on Google, Baidu, and Weibo have also been observed correlated to laboratory-confirmed and suspected cases of COVID-19 [37]. Similar results were reported by Cousins et al. [38]—the search-engine query patterns were observed predictive of COVID-19 case rates. In [39], Li et al. collected around 115k Weibo posts originating from Wuhan, China, between December 23, 2019, and January 30, 2020, and designed a regression model to observe the COVID-19 related posts being predictive of the number of cases reported. Similarly, Shen et al. [40] used more than 15 million Weibo posts created between November 1, 2019, and March 31, 2020, and designed a machine learning classifier to identify “sick” related posts. The count of such “sick” posts were observed Granger-causing the daily number of COVID-19 cases. In [6], Comito reported that the number of Twitter posts increases before confirmed cases follow a similar trend, suggesting that social media discourse can be a front indicator of epidemics spreading. The overall view of the Twitter-based COVID-19 cases forecast methodology. The studies dealing with the early forecasts of confirmed cases, may it be COVID-19 or previous epidemics outbreaks, rely majorly on the “volume of conversations” feature, i.e. overall count, sentiment-based count, or a specific category-based count [5]. The issue with “volume of conversations” feature is its reliability and robustness. Methodologies based on this feature get heavily affected by avalanches of autogenerated conversations. Furthermore, as per our literature search, the effectiveness of the latent variables within the publicly available social media conversations has not been studied for their possible influence on the trend of a pandemic/epidemic outbreak. While addressing these limitations, this study contributes the following to the literature: (a) the study proposes an effective representation for microblog conversations, such that the “volume of conversations” feature can be represented at a more granular level to decrease the intensity of possible forecast biases, (b) the study provides evidences that confirm the significance of social media variables in forecasting the future trend of a steep-hill curve of a pandemic/epidemic outbreak, and (c) we release a large-scale COVID-19 specific geotagged tweets dataset, MegaGeoCOV 1 , to the public. The dataset was curated for this study, and as per Twitter’s terms of use [41], we only release the tweet identifiers, which can be hydrated using tools such as Hydrator2 (desktop application) or twarc3 (python library) to rebuild the dataset locally.

Time series

We implement the methodology illustrated in Fig. 2 for our time series analysis. In this section, we discuss the data collection procedure and the time series formulation approach in detail, and in the next section (Section 4), we design the forecasting models on a set of influential time series and experiment with social media variables in the design of pandemic related forecasting models to address our research questions.

Fig. 2

The overall view of the Twitter-based COVID-19 cases forecast methodology.

Data collection

We considered Twitter as a primary data source since its acts as an instant, short, and frequent basis of communication, and most importantly it allows researchers to access the publicly available data on its platform through a wide range of API endpoints [42]. Some of the widely used Twitter’s endpoints include Tweet lookup endpoint for looking up tweets using tweet identifiers, Search endpoint for searching most recent 7 days, or the full-archive of tweets, Tweet counts endpoint for retrieving a count of tweets that match given query, Filtered stream endpoint for retrieving real-time public tweets, and Sampled stream endpoint for retrieving approximately 1% of all real-time public Tweets. In this study, we used Twitter’s new academic track endpoint, the Full-archive search endpoint,4 for collecting COVID-19 specific tweets created between January 01, 2020, and September 9, 2021. The following keywords (plain word) and hashtags (word preceded by # symbol) were considered while searching and collecting for COVID-19 specific tweets: coronavirus, #coronavirus, covid, #covid, covid19, #covid19, covid-19, #covid-19, pandemic, #pandemic, quarantine, #quarantine, #lockdown, lockdown, ppe, n95, #ppe, #n95, pneumonia, #pneumonia, virus, #virus, mask, #mask, vaccine, vaccines, #vaccine, #vaccines, lungs, and flu. The keywords selection was done based on previously proposed sets of keywords [43], [44]. Additionally, we used the Full-archive count API for getting the descriptive statistics (presented in Table 1) of the daily COVID-19 public discourse on Twitter.

Table 1

Descriptive statistics of the daily COVID-19 public discourse on Twitter.

	All Tweets	Geotagged Tweets (global)	Tweets from Australia	% of tweets geotagged (global)
mean	4.62 million	33.2 k	493	0.694538
std	3.74 million	30.8 k	337	0.129555
minimum	59.6 k	615	7	0.449497
25%	3.06 million	18.9 k	272	0.595030
median	3.77 million	24.4 k	408	0.682665
75%	4.64 million	33.7 k	660	0.781970
maximum	25.8 million	183 k	2297	1.439504

Generally, there are two classes of geographical metadata available with tweets. The first class is related to “tweet location” in which a location is shared by a Twitter user while creating a tweet. The location data is attached with the tweet either as exact geocoordinates (a point location) or as a bounding box (a general location). The second class is related to “account location” which is based on the location provided by a user on his/her public profile. Since the account location field is not validated by Twitter, we only considered the tweets having exact geocoordinates or bounding boxes while designing the forecasting models. In total, 21.36 million geotagged COVID-19 specific tweets were retrieved from the API endpoint. We name this large-scale geotagged global tweets dataset MegaGeoCOV. The dataset is briefly explored in terms of numbers across multiple attributes (general overview given in Table 2)—countries, cities and states (in Table 3), languages (in Table 4), and frequency distribution (in Fig. 3).

Table 2

Overview of MegaGeoCOV.

Total tweets (unique ids)	21,305,691
Duplicate tweets (exact copy)	137,836
Countries and territories	245
Cities and states	260,732
Languages	64 (and undefined)

Table 3

Top 15 global locations in MegaGeoCOV.

(a) Top countries/territories (N = 245).

Country/territory	Tweets

United States	7,375,997
United Kingdom	2,279,064
India	1,563,017
Brazil	1,379,733
Canada	756,466
Spain	625,599
Indonesia	509,498
Argentina	434,454
Mexico	430,478
Philippines	383,215
Australia	366,033
South Africa	357,674
France	339,001
Italy	324,028
Nigeria	293,242

(b) Top cities and states (N = 260,732).

City/state	Tweets

Los Angeles	240,374
Rio de Janerio	192,986
Manhattan	185,021
New Delhi	173,854
Mumbai	155,855
Sao Paulo	148,202
Toronto	141,963
Florida	122,370
Chicago	120,930
Brooklyn	112,231
Houston	111,836
Melbourne	111,038
Washington	98,907
Madrid	96,592
Buenos Aires	95,759

Table 4

Most frequent languages (N 64) in MegaGeoCOV.

Language	ISOa	Tweets
English	en	13,854,642
Spanish	es	2,545,726
Portuguese	pt	1,389,951
Indonesian	in	708,023
Undefined	–	689,301
French	fr	415,434
Italian	it	280,087
Tagalog	tl	274,845
Hindi	hi	221,280
Turkish	tr	157,962
German	de	143,874

Other languagesa in order of their frequencies:
nl, ca, ja, th, ar, pl, et, ru, sv, ht, lt, mr, ro, cs, fi, da, el, ur, ta,
zh, sl, ne, gu, bn, lv, no, vi, cy, te, kn, uk, hu, ko, or, fa, is, eu,
si, ml, iw, bg, sr, pa, dv, km, my, am, sd, ckb, ps, lo, hy, ka, bo ,

ISO 639-1 Language Code

Fig. 3

Daily distribution of COVID-19 specific tweets between January 1, 2020, and September 9, 2020.

Daily distribution of COVID-19 specific tweets between January 1, 2020, and September 9, 2020. Descriptive statistics of the daily COVID-19 public discourse on Twitter. Overview of MegaGeoCOV. Top 15 global locations in MegaGeoCOV.

Australian tweets

MegaGeoCOV has more than 90 tweet objects, each object representing various tweet metadata. From MegaGeoCOV, we extracted tweets originating from Australia (by conditioning the geo.country object) and considered only the created_at (date and time), text (tweet), geo.full_name (geolocation) objects for curating Australia-specific COVID-19 tweets dataset, from here on termed as dataset . Since the geo.full_name object followed the [city, state] data structure, all other geolocation-specific objects were ignored as this object was enough for extracting both city- and state-level information. Tweets selection. Out of the 366,033 tweets originating from Australia, only the tweets geotagged with exact geocoordinates or bounding box coordinates were considered. Twitter does not validate the account location field. Entries such as “My Home”, “My Dream”, “Solar System”, “Milky Way Galaxy”, etc are equally valid. Further, some users can have one location on their public profile and create tweets from some other location. Therefore we considered only the tweets whose geolocation was shared by users while creating tweets. Next, we filtered out tweets that had less than terms within the text body. After following these selection criteria, the numbers in the dataset dropped down to 305,418 unique tweets identifiers and 304,885 unique tweets. Most frequent languages (N 64) in MegaGeoCOV. ISO 639-1 Language Code The geo.fullname object was split into two subparts based on its [small region, larger region] data structure. This data structure was not the same for all the tweets in the dataset—some had single location detail such as just “Melbourne”, and “New South Wales”. In such cases, the single location details were considered small regions. Following this step, there were 3724 small region unique entries (shown in Table 5) and larger region unique entries (shown in Table 5) in dataset . As a general overview of the dataset, Table 5 lists the top Australian locations (cities/towns/states) participating in the COVID-19 Twitter discourse, and Table 6 lists the most frequent unigrams and bigrams used by Australian Twitter users during the discourse.

Table 5

Top Australian locations in MegaGeoCOV.

(a) Top small regions (N = 3724).

Small regions	Tweets

Melbourne	94,330
Sydney	70,118
Brisbane	21,298
Perth	18,143
Adelaide	13,372
Canberra	8,366
Gold Coast	6,483
New Castle	3,574
Sunshine Coast	2,843
Central Coast	2,190

(b) Top larger regions (N = 125).

Larger regions	Tweets

Victoria	107,560
New South Wales	89,142
Queensland	37,107
Western Australia	20,259
South Australia	15,301
Australia	14,703
Australian Capital Territory	8,373
Not Available	4,678
Tasmania	3,280
Northern Territory	2,141

Table 6

20 most frequent unigrams and bigrams used by Australian Twitter users in the COVID-19 discourse.

(a) Unigrams

Unigram	Frequency

covid	46,942
lockdown	34,016
people	30,936
virus	21,844
vaccine	19,380
covid-19	18,132
#covid-19	17,231
quarantine	16,618
pandemic	15,842
australia	13,456
mask	12,936
time	12,891
coronavirus	12,602
health	12,111
cases	11,444
#coronavirus	8,859
government	8,679
nsw	8,481
home	8,202
work	8,104

(b) Bigrams

Bigram	Frequency

(‘hotel’, ‘quarantine’)	3,161
(‘wear’, ‘mask’)	2,037
(‘2’, ‘weeks’)/(’14’, ’days’)	1,974
(‘aged’, ‘care’)	1,970
(‘wearing’, ‘mask’)	1,517
(‘new’, ‘cases’)	1,463
(‘social’, ‘distancing’)	1,424
(‘public’, ‘health’)	1,303
(‘new’, ‘daily’)	1,302
(‘many’, ‘people’)	1,239
(‘mental’, ‘health’)	1,221
(‘federal’, ‘government’)	1,154
(‘covid’, ‘cases’)	1,098
(‘last’, ‘year’)	1,077
(‘vaccine’, ‘rollout’)	1,074
(‘stay’, ‘home’)	1,037
(‘face’, ‘mask’)	1,002
(‘tested’, ‘positive’)	907
(‘covid’, ‘vaccine’)	858
(‘covid’, ‘test’)	805

The dataset at this stage is , where , the first component, , represents date/time attribute, the second component, , represents individual tweets, and the third component, , represents geolocation information of the individual tweets. Top Australian locations in MegaGeoCOV. 20 most frequent unigrams and bigrams used by Australian Twitter users in the COVID-19 discourse.

Sentiment analysis with BERT

There exists a plethora of pre-trained sentiment analysis models and libraries suitable for sentiment analysis of short texts. Short-length texts and common use of informal grammar, abbreviations, spelling errors, and hashtags make it difficult in using pre-trained sentiment analyzers trained on formally written and typographical errors-free large-scale text corpora to handle sentiment analysis tasks on Twitter data. Further, in our case, we required a sentiment analyzer capable of understanding COVID-19 specific tweets. Therefore, we finetuned a pre-trained language model, BERTweet [45], for our sentiment analysis task. The language model has been reported to outperform existing state-of-the-art models across multiple NLP tasks including text classification. BERTweet has the same architecture as BERT [46] and is trained on 850 million English Tweets (cased) and additional 23 million COVID-19 English Tweets (cased) using the RoBERTa [47] pre-training procedure. We finetuned the pre-trained BERTweet (bertweet-covid19-base-cased) model using the transformers library [48] on the SemEval-2017 Task 4 A dataset5 and achieved an accuracy of 0.7231 on the validation set built using the scikit-learn’s train_test_split function [49] with parameters (given for results reproducibility) test_size 0.2, random_state 41, and stratify setting on the sentiment column. The fine-tuned model (hereafter termed as BERTsent) outputs three labels each with a probability score for sentiment analysis: 0 representing “negative” sentiment, 1 representing “neutral” sentiment, and 2 representing “positive” sentiment. The model effectively classifies sentences such as “I had covid”. as negative and just the word “covid” as neutral by a significant probabilistic margin. We release both the PyTorch and TensorFlow versions of BERTsent from the Hugging Face Hub.6 Next, we used BERTsent to compute sentiment probabilities for each tweet in dataset . Dataset at this stage gets a new component, , that represents the sentiment of individual tweets. Output label with the highest probability was considered as the sentiment of a tweet. Dataset at this stage is:

Topic modeling

Next, we identify topics that best describe all the tweets in dataset . We implemented one of the commonly preferred topic modeling techniques – Latent Dirichlet Allocation (LDA) [50] – using Gensim’s LdaMallet module [51] which is a Python wrapper for LDA from MALLET [52]. LDA maps all the tweets in dataset to the topics such that terms in each tweet are mostly captured by the topics. A “topic” represents a group of words that often occur together. Algorithm 1 briefly summarizes the steps taken in implementing LDA on the tweets present in dataset . Steps (iv), (v), (vii), and (viii) of Algorithm 1 were implemented using Gensim’s Python library. For both unigrams and bigrams, the minimum term frequency was set to 500 to ignore sparsely appearing terms. For lemmatization, we used spaCy’s Python library and considered only the Noun part-of-speech for building the topic models. Gensim’s LdaMallet module was employed for building LDA models of a varying number of topics . Having the “right” solely based on mathematical goodness-of-fit does not necessarily mean that the topics have the best interpretability [54]. Therefore, the best was identified based on both the average topic coherence score [53] and the human interpretability of the produced topics. The value of was set in the range 5–50. LDA models were created for each , and for each model the topic coherence scores were averaged. The highest average topic coherence scores were observed at and ; however, the interpretability of topics was relatively better at . The final LDA model with was used for assigning topics to each tweet in Dataset . Appendix presents the LDA results obtained on . Tweets were assigned topics based on a probability distribution generated by —a tweet is assigned to a topic whose probability score is the highest in the distribution. With the addition of the topic component, , dataset becomes:

Design of time series

Next, a time series dataset was created based on dataset for the period January 1, 2020, to September 9, 2021. Dataset was grouped by the date/time component, , and the frequency of tweets across each day was summed for computing the volume of tweets over different topics and sentiments. From here, additional (number of topics 18 × number of sentiments 3) 54 components were generated, where each component represented topics and sentiments combined form. can be represented as a tensor of the following form: where, index associates with the date component, index associates with the topic component, and index associates with the sentiment component. These indices take the values: ( representing January 1, 2020, and representing September 9, 2021); ; and .

Lagged time series

For topic and sentiment components in , an additional of days lagged components were generated to create a new time series dataset , that takes the following tensor form: Generating the lagged components introduces NULL values in the last 14 samples of ; therefore, consists of tweets time series data for the period: January 15, 2020–September 9, 2021, after the loss of 14 days’ data. is created so that the forecasting models trained on it can regress on the lagged variables present in to look up to 14 days back for making forecasts. The maximum lag of 14 was considered because of: (a) incubation period of the virus and suggested quarantine period [55], (b) research works confirming the correlation between social media activities and future trends of the evolution of the virus [37].

Experimentation and discussion

Feature selection

At this stage, there are 54 components in . We performed feature selection based on Granger Causality [56] to identify the set of features that are better predictors of daily confirmed COVID-19 cases. Tests were performed for all the variables in to check if causes , where , and COVID-19 confirmed cases. The data source for was OWID [57]. Granger Causality is a statistical concept that determines if a time series helps forecast another. A time series is said to“Granger-cause” a time series if the lagged values of contain information that helps predict exceeding the predictive ability carried by the lagged values of alone. Plots of the variables (listed in Table 7) in that Granger-cause at most lags (10). For each subplot, the vertical axis represents the count of tweets, and the horizontal axis represents the date.

Table 7

Variables in that Granger-cause at most lags (only 10 listed).

Variable	Variable definition	sig. p-values	Variable	Variable definition	sig. p-values
Xtp16sn0ti	topic16 negative	14	Xtp6sn1ti	topic6 neutral	12
Xtp1sn1ti	topic1 neutral	14	Xtp7sn1ti	topic7 neutral	12
Xtp10sn1ti	topic10 neutral	14	Xtp13sn2ti	topic13 positive	12
Xtp11sn1ti	topic11 neutral	14	Xtp7sn0ti	topic7 negative	11
Xtp12sn2ti	topic12 positive	14	Xtp8sn1ti	topic8 neutral	11
Xtp7sn2ti	topic7 positive	13	Xtp16sn2ti	topic16 positive	11
Xtp9sn2ti	topic9 positive	13	Xtp3sn2ti	topic3 positive	10

Mathematical statement: Granger causality supposes the following hypotheses— (null): does not Granger-cause , (alternative): Granger-causes . Both the time series need to be stationary i.e. parameters such as mean and variance should remain constant over time. To test , the proper number of lags of to be included in an univariate autoregressive model of (Eq. (3)) is identified using information criteria such as Akaike information criterion (AIC) [58] and Bayesian information criterion (BIC) (also known as [59]. AIC and BIC are formally defined as: where is the number of estimated parameters (the variables in the model and the intercept), is a measure of model fit, and is the sample size. We start with modeling an autoregressive model that has the lowest AIC or BIC value. Next, the lagged values of are included into the model . The and parameters, in Eq. (4), are the shortest and longest lag lengths for which the values of are significant. is accepted if and only if no lagged values of are significant in Eq. (4). The significance of the individual variables and their collective explanatory power is done based on t-test and F-test, respectively. The causality test was performed between and each for the maximum lags of at 5% significance level. We used Statsmodels’ adfuller module [60] to implement the Augmented Dickey–Fuller (ADF) test [61] to check variables for stationarity. The test supposes the following hypotheses—: Non Stationarity exists in the series, : Stationarity exists in the series. Second-level differencing was required to make and all variables in stationary. Table 7 lists the set of variables sorted based on the count of significant -values i.e. count of lags at which a variable was observed Granger-causing . The respective plots of these variables are shown in Fig. 4.

Fig. 4

Plots of the variables (listed in Table 7) in that Granger-cause at most lags (10). For each subplot, the vertical axis represents the count of tweets, and the horizontal axis represents the date.

Variables in that Granger-cause at most lags (only 10 listed).

Forecasting models

Autoregressive (AR), Moving Average (MA), ARMA, Integrated ARMA (ARIMA), exogenous variables included ARIMA (ARIMAX), seasonal observations and errors-based (SARIMA, SARIMAX), Prophet, Neural net-based (NeuralProphet, Long Short-Term Memory), and stochastic gradient boosting-based (XGBoost) are some of the widely used time series forecasting models. Before getting started with experiments to address the research questions (RQ2, RQ3, and RQ4) of this study, we fit the variable to multiple time series forecasting models for identifying the model that best explains the variable’s trend. This way, going forward, it is justifiable to continue with the best model and introduce the social media context into the model. We used a machine learning python library, Auto TS,7 for building multiple traditional-based, FB Prophet, and XGBoost models on and identified the best model based on the reported Root Mean Square Error (RMSE) (Eq. (12)) scores. The training and testing were performed using the expanded window cross-validation (using the library’s default parameters). Table 8 shows the results provided by Auto TS.

Table 8

Best forecasting model for .

Approach	Avg. RMSE
Traditional modela (ARIMA of p=1, d=1, q=3)b	135.387
Additive model (FB Prophet)	236.427
Machine learning model (XGBoost)	341.8

also involves the participation of the models such as AR, MA, ARIMA, SARIMA.

the traditional models and their mathematical structures are discussed later in Section 4.2.1.

We also did experiments with neural network models; the results were not encouraging; maybe the amount of data (this study uses 618 days’ of data) is not sufficient to fully exploit the forecasting capabilities of neural-based models. The results, reported in Table 8, suggest that the traditional models significantly explain the cases trend compared to the additive approach-based FB Prophet and the gradient boosting-based XGBoost model. From here, to address the research questions RQ2, RQ3 and RQ4, the design of forecasting models is done in two phases. First, we design ARIMA with exogenous variables (ARIMAX) models to show that the inclusion of social media data provides additional forecasting capabilities. Second, we design Vector Autoregressive (VAR) models to forecast the number of COVID-19 cases, seven days into the future, using the same set of variables. Best forecasting model for . also involves the participation of the models such as AR, MA, ARIMA, SARIMA. the traditional models and their mathematical structures are discussed later in Section 4.2.1.

ARIMAX models

Mathematical definition. Given a time series , the autoregressive part, AR, can be defined as: where, is a constant, is the error at time , and is the number of lags of the prior values of to be considered for regression. Eq. (5) can be made more concise (shown in Eq. (6)) by introducing the back-shift operator (a.k.a. lag operator) L, as . where, is the polynomial function of of order . Similarly, for the same time series , the moving average part, MA can be defined as: where, is the number of lags of the prior values of error to be considered for regression, and is defined similar to . The sum of AR and MA models forms the ARMA model, which is defined as: Further, to deal with non-stationary time series, an integration operator is introduced and defined as: , where is the order of differencing required to make the non-stationary time series stationary. When an ARMA model is fitted on the integrated time series, the model is termed as ARIMA and represented as: When the ARIMA models consider exogenous variables into account, the models are termed ARIMAX models and represented as: where, is the number of exogenous variables with as their respective coefficients. Exogenous variables at time are the independent variables that influence the dependent variable at . ARIMAX models do not regress on the lagged values of such variables; instead, they are computed outside the system and used for predicting the dependent variable. In our case, the social media variables are the exogenous ones; however, our designed lagged time series dataset also incorporates lagged values so that the time series models can look back up to 14 days and make forecasts accordingly. Results from training. Models are ranked based on their AIC scores. We use Root Mean Square Error (RMSE), Mean Absolute Percentage Error (MAPE), and Coefficient of Determination (R2) as the measures for assessing the quality of predictions made by the forecasting models. For number of observations with representing true values and representing predicted values, RMSE, MAPE, and R2 are mathematically defined as: We fit ARIMA models on y, and ARIMAX on and the variables (alongside their lags available through ) in that Granger-cause at all 14 lags. We mark the ARIMA models as baseline model candidates and the ARIMAX models as social media model candidates. All the models were fitted on the data observed up to August 26, 2021, and tested on the data observed between August 27, 2021, and September 9, 2021. The best fit was determined based on the reported AIC scores—lower the AIC, better the fit. The results from the training are shown in Table 9 for both set of models.

Table 9

Results from training. Models are ranked based on their AIC scores.

(a) Top 5 baseline models.

(p,d,q)	AIC	RMSE

(6, 2, 7)	6118.50	37.78
(5, 2, 8)	6118.80	37.81
(7, 2, 5)	6119.06	37.89
(7, 2, 8)	6120.12	37.70
(7, 2, 6)	6120.21	37.87

(b) Top 5 social media models.

(p,d,q)	AIC	RMSE

(2, 2, 3)	5941.08	32.97
(1, 2, 4)	5942.43	33.05
(2, 2, 2)	5945.95	33.26
(4, 2, 3)	5957.88	33.12
(4, 2, 2)	5960.05	33.41

Since all social media models report lower RMSE on the training data compared to the baseline models, it is evident that the inclusion of the social media variables for modeling does help explain the dependent variable better (12.73% improvement over the best baseline model) compared to using just the lagged values of the dependent variable. It is also apparent that the best-fitted social media model requires lower lag parameters for both autoregressive and moving-average processes compared to the best-fitted baseline model. For forecasting the daily COVID-19 cases for the test period, we selected the ARIMA model as the baseline model and the ARIMAX as the social media model. The residuals from both models were further checked for the presence of any possible patterns. For both models, the residual correlograms showed autocorrelations near-zero (insignificant) for all lags. Table 10 presents the forecasting results obtained using baseline and social media models at 1% and 5% significance, and Fig. 5 plots the forecasts of the models from both training and testing phases.

Table 10

Results (upper values) from test data. Baseline model versus Social media model at 1% and 5% significance.

Date	Cases	Baseline		Social media
		at 5%	at 1%	at 5%	at 1%
2021-08-27	1119	1068	1092	1116	1138
2021-08-28	1321	1090	1114	1143	1166
2021-08-29	1355	1074	1099	1171	1195
2021-08-30	1257	1114	1144	1219	1244
2021-08-31	1225	1161	1195	1242	1272
2021-09-01	1467	1120	1159	1289	1325
2021-09-02	1648	1194	1240	1358	1399
2021-09-03	1741	1230	1280	1413	1459
2021-09-04	1670	1221	1276	1447	1496
2021-09-05	1536	1261	1320	1472	1525
2021-09-06	1466	1279	1342	1529	1586
2021-09-07	1696	1326	1393	1572	1634
2021-09-08	1725	1323	1394	1568	1635
2021-09-09	1870	1334	1410	1661	1731
RMSE		342.58	295.68	175.31	143.76
MAPE		19.36%	16.29%	9.24%	7.61%
R2		0.67	0.68	0.75	0.75

Fig. 5

COVID-19 confirmed cases versus the cases predicted by the baseline and social media models at 1% and 5% significance levels.

On the testing data, the social media models introduce 48.83% and 51.38% improvements on RMSE over the baseline models at 5% and 1% significance, respectively. These significant improvements confirm that the social media discourse indeed is a good predictor for pandemic-related forecasting models. In Table 10, if we look at the data observed after September 1, 2021, the forecasting ability of the baseline model begins to be off by significant margins, while the social media model seems to be catching up with the trend of the everyday cases with small errors. Results (upper values) from test data. Baseline model versus Social media model at 1% and 5% significance. COVID-19 confirmed cases versus the cases predicted by the baseline and social media models at 1% and 5% significance levels. The testing timeline in this study, a steep-hill curve (also shown in Fig. 5), was the most suitable region (compared to monotonically ascending regions) for examining the effect of exogenous variables that might influence the variable to be forecasted. Based on the results presented in this section, we conclude that the latent variables extracted from the COVID-19 specific social media discourse can be good predictors of the pandemic’s daily cases, and these variables are predictive of the steep-hill curve of COVID-19 cases during an ongoing wave. Continuing with the idea that the social media variables are predictive of our dependent variable, in the next section, we fit VAR models to forecast the COVID-19 cases in Australia for the next 7 days.

VARMA models

Vector Autoregressive Moving-Average (VARMA) models are multivariate linear time series models generally used for simultaneous modeling of multiple stationary time series and generating simultaneous forecasts of the independent variables in the system. Mathematically, a VARMA model is defined as: where, is an vector of distinct dependent time series variables at , is an vector of constant in each equation, is an matrix of autoregressive coefficients, is an matrix of moving-average coefficients and is an vector of error terms. From our experiments, we observed that the inclusion of the moving-average part of the VARMA models did not improve the quality of the forecasts compared to using the autoregressive part alone. Therefore, we considered VAR models (defined as Eq. (16)) for our multivariate time series forecasting. VAR order selection—fitting VAR models on . Lowest AIC score is highlighted. We fitted multiple VAR models, where , on the variables (except for the lagged ones) used by the social media model in Section 4.2.1 to forecast the COVID-19 cases for the next 7 days. The results from the VAR order selection and the forecasts made by the best fitted VAR model are shown in Table 11 and Fig. 6, respectively. We observed the lowest AIC score with the VAR model. The social media model from Section 4.2.1 had the autoregressive process of lag order of 2, implying that looking back up to 15 days best describes our dependent variable—we had the lagged time series dataset designed in such as way that the lag order of 1 included the past 14 days’ data, the lag order of 2 included the past 15 days’ data, and so on. We observe the same mathematical implication here from the best-fitted VAR model.

Table 11

VAR order selection—fitting VAR models on . Lowest AIC score is highlighted.

parameter p	AIC
0	28.60
1	23.33
2	22.90
3	22.69
...	...
15	22.50
16	22.52

Fig. 6

Forecast of COVID-19 cases for the next 7 days with VAR model. (overall); (excluding the 9/10/2021’s sudden rise).

The VAR model was used for forecasting the COVID-19 cases in Australia one week in advance from September 10, 2021, to September 16, 2021. The forecasts and the deviations from the actual cases are illustrated in Fig. 6. The RMSE and the MAPE of the overall forecasts were 224.65 and 9.08%, respectively. Excluding the September 10, 2021’s sudden rise, the model reported RMSE of 142.8 and MAPE of 6.74%. Out of the 7 days’ forecasts, the model forecasted the cases almost perfectly for 3 days and with small margin of errors for the other 3 days. The VAR model can be deployed for making forecasts using unseen tweets. Its dependency is on dataset , which is based on the outputs generated by BERTsent and the LDA-based topic model. After a collection of a statistically significant number of social media conversations related to an event, similar topic model can be trained and used along side BERTsent to generate a time series dataset identical to as discussed in Section 3.4. Forecast of COVID-19 cases for the next 7 days with VAR model. (overall); (excluding the 9/10/2021’s sudden rise).

Comparison with the existing studies

In this study, we proposed a representation for microblog conversations that can represent the volume of social media activity (conversations) feature at a more granular level to decrease the intensity of possible forecast biases. In the existing literature, the “volume” feature includes social media search indexes, category-based counts, and overall count strategies. Use of the “volume” feature keeps computational complexity to minimal as we maintain only the counts of tweets based on a strategy. Notably, such models can be deployed on small-scale infrastructures. However, those models get heavily affected by avalanches of auto-generated conversations. Therefore, this study proposed a representation for microblog conversations to break the “volume” feature to more granular levels in order to decrease the dependency of the models on one or a few thematic counts. From Table 8, it is evident that the traditional forecasting models significantly explain the trend of the daily confirmed COVID-19 cases in Australia compared to additive-based, machine learning, and neural models. This observation is in agreement with what has been reported in earlier studies [17], [18] that involved the forecast of COVID-19 cases. Moving on, in this section, we compare the forecasting ability of our social media model with existing studies that use social media “volume” feature for designing discourse-based forecasting models. To compare our methodology (identifying relevant exogenous variables through latent variables search), we fit various volumetric features considered by existing studies, as exogenous variables to forecast our dependent variable . Social media-based volumetric features. The following volumetric features were considered as exogenous variables for comparison against the variables identified by our latent variables search methodology. (i) Search indexes: Google Trends8 was considered the data source for social media search indexes. The platform provides the popularity of search queries on Google across various geographical regions. The popularity of a search query is provided through a set of numbers (between 0–100) for each day, where the peak value “100” is the highest point on the graph for the given region and timeline. The platform gives the daily search interests for a search query only for a timeline of 9 months at most; beyond that range, week-level search interests are provided. For this study, we extracted search interests in three different blocks (search trend blocks) for the period January 1, 2020, and September 9, 2021, for the following terms: dry cough, chest distress, coronavirus, fever, and pneumonia. The search trend blocks were created with overlaps to scale the second and third blocks relative to the first. The daily search interests in the second and third blocks were re-scaled by the blocks’ respective scaling factors as: Fig. 7 plots the daily Google search interests for the search terms. The term “chest distress” was excluded since it did not have significant search interest in Australia. Fig. 7(e) is the plot for all search terms relative to each other. It is evident from the plot that the search interests for the term “coronavirus” was significantly higher compared to other terms.

Fig. 7

Search interests data retrieved from Google Trends for the period January 1, 2020, and September 9, 2021.

(ii) Sick posts: We processed all the Twitter conversations in dataset through the LDA model designed in Section 3.3 to create “sick” related posts’ time series. Tweets with the highest score in the probability distribution for topic “6” were considered as “sick” related posts. Some salient words in topic “6” include (sorted based on the influence) test, case, testing, isolation, symptom, clinic, lab, isolate, swab, fever, throat, trace, temperature, quarantine, positive, tracer, carrier, diagnosis, pathology, vitamin. Search interests data retrieved from Google Trends for the period January 1, 2020, and September 9, 2021. (iii) Overall posts: A daily distribution was maintained for the Twitter conversations present in dataset to create the “overall” posts’ time series. Next, we created additional 14 lagged variables for each time series to assist the models to look back up to 14 days for making forecasts (dataset followed the same implementation). Table 12, Table 13 summarize the results from fitting ARIMAX models on different sets of exogenous variables considered in the existing studies. We use the same training and testing timeline as the social media model designed in Section 4.2.1.

Table 12

Comparison of our latent variables search methodology with existing studies that use social media-based volumetric features.

	at 5%			at 1%
	RMSE	MAPE	R2	RMSE	MAPE	R2
Baselinea	342.58	19.36%	0.67	295.68	16.29%	0.68
Search index (dry cough)b	326.93	17.76%	0.68	277.22	14.74%	0.7
Search index (coronavirus)b	307.48	16.98%	0.7	258.716	13.75%	0.7
Search index (fever)b	344.15	19.49%	0.67	297.28	16.4%	0.68
Search index (pneumonia)b	266.13	14.51%	0.67	223.15	11.79%	0.68
Search indexes Combinedb	241.23	13.1%	0.66	200.40	10.62%	0.67
Sick postsc	283.44	15.71%	0.68	239.68	12.72%	0.69
Sick posts + Search indexes combined	198.52	10.29%	0.7	160.62	8.52%	0.70
Overall postsd	289.16	16.07%	0.73	241.44	12.84%	0.73
Latent variables searche	175.31	9.24%	0.75	143.76	7.61%	0.75

fitted solely on . Exogenous variables: [36], [37], [38], [40], [6]. this study.

Table 13

Results from fitting the exogenous variables listed in Table 12 and their respective 14 days’ lags against 84 weeks of data (January 15, 2020, to August 26, 2021).

	Best fitted model	Exo. Variables count	AIC	RMSE
Baseline	ARIMA(6,2,7)	–	6118.50	37.78
Search index (dry cough)	ARIMAX(9,2,9)	1 and its 14 lags	6019.93	37.46
Search index (coronavirus)	ARIMAX(7,2,5)	1 and its 14 lags	6013.5	37.51
Search index (fever)	ARIMAX(5,2,8)	1 and its 14 lags	5993.47	37.55
Search index (pneumonia)	ARIMAX(6,2,9)	1 and its 14 lags	6001.28	37.52
Search indexes Combined	ARIMAX(7,2,8)	4 and respective 14 lags	6085.15	36.53
Sick posts	ARIMAX(8,2,7)	1 and its 14 lags	5989.78	37.12
Sick posts + Search indexes combined	ARIMAX(3,2,9)	5 and respective 14 lags	6069.28	35.77
Overall posts	ARIMAX(4,2,5)	1 and its 14 lags	5991.94	37.34
Latent variables search	ARIMAX(2,2,3)	14 and respective 14 lags	5941.08	32.97

Table 12 reports the RMSE, MAPE, and R2, of the baseline model, existing studies, and this study at both 5% and 1% significance. The results show that our methodology outperforms the existing studies that use social media-based volumetric features to forecast the daily confirmed COVID-19 cases. Except for the search term fever, the search interests of the other three terms included in the experimentation, i.e., dry cough, coronavirus, and pneumonia, seem to provide additional forecasting abilities (compared to the baseline model that was regressed only on ). When all search terms were combined and fitted, there were further improvements observed in both RMSE and MAPE. The best-fitted model for the “sick” related posts performed poorly compared to the search indexes combined model. We performed an additional modeling by combining and fitting the exogenous variables associated with sick posts and all search indexes, and observed significant improvements in RMSE and MAPE; the metrics improved to 198.52 and 10.29% at 5%, and 160.62 and 8.52% at 1%. The overall posts model performed on par with the sick posts model, providing evidence that the count strategy, be it category-based or general, offers limited forecast capability. Overall, our latent variables search methodology achieves the lowest RMSE and MAPE and the highest R2 at both significant levels. Comparison of our latent variables search methodology with existing studies that use social media-based volumetric features. fitted solely on . Exogenous variables: [36], [37], [38], [40], [6]. this study. Results from fitting the exogenous variables listed in Table 12 and their respective 14 days’ lags against 84 weeks of data (January 15, 2020, to August 26, 2021). To demonstrate the robustness of our methodology, in Table 13, we provide the results (p, d, and q parameters of the best-fitted models, their respective exogenous variables counts, and AIC/RMSE scores) obtained while fitting the exogenous variables listed in Table 12 and their respective 14 days’ lags against 84 weeks of data, i.e., January 15, 2020, to August 26, 2021. The results show that the exogenous variables identified by our latent variables search methodology explain the dependent variable better compared to the existing studies in the literature. For the 84 weeks of data, our social media model benchmark the lowest RMSE of 32.97 and is followed by the Sick posts + Search indexes combined model with an RMSE of 35.77. All the models with exogenous variables achieved better RMSE scores than the ARIMA-based baseline model. Issue with search interests. Search interests are “broad” in nature—a search for “coronavirus” can relate to multiple use cases, such as checking top stories, querying updates and local information, and accessing health information (symptoms, prevention, treatments). Search interests do not provide the granular-level distinction of the use case unless the search terms are more specific, such as “melbourne covid hotspots today”, “coronavirus symptoms”, and “covid hotline melbourne”. Therefore, while designing interpretable forecasting models it is critical to exploit the public conversations for searching latent variables that carry granular-level details regarding an event. Besides, services such as Google Trends can retire, or data extraction can be made limited as the platforms upgrade to different versions. However, discourse-based models entirely rely on the conversations and can have applications outside of Twitter-verse.

The research questions

In this section, we address the four research questions (RQ1–4) that this study sets out to answer. Modeling of Twitter data for region-specific analyses requires a large amount of geotagged tweets. For addressing RQ1, we curated a large-scale geotagged tweets dataset – MegaGeoCOV – targeting the public COVID-19 discourse. We used Twitter’s Academic Track-based Full-archive search and count APIs to access the numbers presented in Table 2. Between January 01, 2020, and September 9, 2021, the minimum number of tweets (for the specific set of keywords and hashtags mentioned in Section 3.1) was 59.6k and the maximum was 25.8 million, with a mean of 4.62 million. Among those numbers, the volume of geotagged tweets were observed between 0.449%–1.43%. Although the geotagged volume is considerably limited, the experiments from this study suggest that “what is currently available” is satisfactory for designing similar discourse-based forecasting models. We addressed RQ2 by performing Granger causality tests on the time series that were created based on the geotagged Twitter data. We observed the presence of latent variables within the data that Granger-caused the daily COVID-19 confirmed cases time series. Some such variables (granger-causing at lags out of 14 lags) are listed in Table 7. The methodology for the identification of such variables is discussed in Section 4.1. We also observed that the identified Granger-causing latent variables provide additional prediction capability to time series forecasting models (this observation addresses RQ3). We noticed that the inclusion of social media variables for modeling introduced 12.73% improvement on the training data, and above 48% improvements (at 1% and 5% significance) on the testing data over the baseline model (discussed in Section 4.2.1). Furthermore, “the volume of public discourse in the last few days” being predictive of the steep-hill curve of COVID-19 cases during an ongoing wave address our RQ4. The latent variables (variables in ) are the outputs of every day’s tweet volume. The forecasts produced by the ARIMAX and VAR models designed in this study verify that the volume of public discourse is predictive of the COVID-19 cases’ steep-hill trend.

Conclusion

In this paper, a sentiment-involved topic-based latent variables search methodology was proposed for time series analysis of publicly available COVID-19 related Twitter conversations. A language model trained on 850 million English Tweets (cased) and additional 23 million COVID-19 English Tweets (cased) using the RoBERTa pre-training procedure was finetuned for the sentiment analysis task, and LDA was performed for identifying the hidden topics within the conversations. The proposed methodology was implemented on the COVID-19 cases in Australia and Twitter conversations generated within the country between January 1, 2020, and September 9, 2021. ARIMA models (baseline model candidates) were fitted on Australian COVID-19 cases, and ARIMAX models (social media model candidates) were fitted on and the social media variables (alongside their lags) that Granger-cause the most (at all 14 lags). Experimental results from the training showed that the inclusion of social media variables for modeling brings in 12.73% improvement over the baseline model compared to using just the lagged values of . While, on the testing data, the social media models introduced 48.83% and 51.38% improvements on RMSE over the baseline models at 5% and 1% significance, respectively. Considering the same set of variables used by the social media model, a VAR model was used for forecasting the COVID-19 cases in Australia one week in advance from September 10, 2021, to September 16, 2021. Out of the 7 forecasts, the model predicted the cases almost perfectly for 3 days and with small margin of errors for the other 3 days—with RMSE and MAPE ranging between 142.8–224.65 and 6.74–9.08%, respectively (the upper values of these metrics is outcome of the September 10, 2021’s sudden rise in the cases with respect to the next 6 days). This study confirms the presence of a relationship between latent social media variables and COVID-19 daily cases. The literature seems to have overlooked the social media perspective of the COVID-19 time series analyses. The inclusion of social media variables alongside native epidemiological data (causes, risk factors, population descriptors, etc.) can be beneficial for an early forecast of an epidemic/pandemic’s future courses. One of the limitations of this study is that only the social media variables are included in the time series analysis. Social media, and especially microblogging platforms, are more skewed towards tech-savvy users and younger populations. Further, the study considers only three categories of sentiment – positive, neutral, and negative – and does not consider other possible categories such as hate, offense, and irony. Furthermore, filtration of misinformative tweets can be an additional tweets selection procedure before time series are constructed. These limitations could be an important research avenue while designing next generations of discourse-based forecasting models.

CRediT authorship contribution statement

Rabindra Lamsal: Conceptualization, Data curation, Methodology, Software, Visualization, Writing – original draft. Aaron Harwood: Conceptualization, Supervision, Writing – review & editing. Maria Rodriguez Read: Conceptualization, Supervision, Writing – review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Topic	Salient words
0	lockdown, pm, rule, idea, message, panic, detail, move, meeting, announcement, restriction, location, gathering, situation, detention, notice, prime_minister, mess, looking_forward, regulation

1	food, order, book, shop, market, price, water, store, delivery, supermarket, restaurant, paper, supply, stock, demand, sale, stuff, trade, cafe, list, product, customer, shortage, grocery

2	family, friend, hope, love, mate, house, shit, thought, heart, member, visit, guy, daughter, wife, mother, girl, party, partner, son, movie, dad, anxiety, brother, memory, sister, colleague, loved_one, neighbor, kind, hug, spirit, song, prayer, soul, sunshine

3	time, thing, moment, ship, air, lung, fire, cruise, trip, passenger, winter, crew, plane, hell, summer, pain, island, tip, weather, get_back, spring, ruby_princess, hope, quality, doubt, trouble, board, tour, track, smoke, breathe, omg, port, storm, boat

4	year, team, game, event, show, season, video, player, sport, club, tv, challenge, watch, fan, race, art, music, crowd, training, play, session, ground, tennis, football, ticket, court, venue, ball, goal, episode, win, series, cricket, artist, film, star, host, horse, content, performance, league, competition, song, entertainment, gig

5	death, people, number, rate, infection, risk, population, datum, transmission, protest, freedom, idiot, conspiracy, spread, exposure, control, theory, toll, site, suicide, factor, stat, mortality, prevent, evidence, confirmed_case, statistic, count, analysis, victim, every_day, protester, survivor, fatality, cases_death, surge

6	day, today, test, case, person, hour, testing, isolation, yesterday, symptom, contact, tomorrow, area, period, week, clinic, morning, site, line, act, last_week, lab, drive, delay, queue, household, isolate, swab, fever, throat, trace, temperature, quarantine, positive, tracer, carrier, diagnosis, pathology, caution, vitamin

7	people, world, country, life, rest, war, leader, million, threat, earth, around_world, citizen, pressure, stop, die, moron, stupidity, worry, shit, covidiot, problem, spanish_flu, kill, narrative, planet, prison, mentalhealth, years_ago, faith, enemy, weapon, danger, livelihood, estate, liberty, bullet, fighting, destruction, frustration

8	health, issue, system, advice, problem, expert, science, effect, research, safety, emergency, condition, treatment, mental_health, concern, evidence, disease, management, trial, effort, scientist, solution, trust, officer, report, process, authority, drug, brain, damage

9	news, media, story, fact, article, election, app, information, fear, comment, answer, view, truth, tweet, info, twitter, vote, source, journalist, ad, opinion, page, claim, president, statement

10	mask, hospital, hand, patient, doctor, ace, staff, care, nurse, centre, phone, distance, shopping, icu, eye, ppe, pace, practice, bed, work, line, nose, capacity, folk, guideline, mouth, limit, nursing, lady

11	work, job, business, worker, support, service, money, company, industry, cost, office, pay, healthcare, access, economy, leave, loss, bill, payment, driver, bus, sector, university, frontline, income, tax,

12	state, case, border, vic, nsw, travel, restriction, outbreak, premier, flight, new_case, control, record, sa, victorian, wave, gladys, hotel_quarantine, cluster, traveler, arrival, closure, bubble, community_transmission, update, ban, region, hotspot, territory, exemption

13	community, response, measure, part, change, decision, level, impact, economy, nation, point, situation, law, recovery, strategy, opportunity, crisis, sense, society, step, term, history, experience, reality, role, behavior, contract, thread, lock, model

14	vaccine, flu, vaccination, pfizer, group, risk, age, jab, dose, study, delta, choice, blood, chance, strain, shot, astrazeneca, variant, type, appointment, get_vaccine, protection, immunity, pfizer_vaccine, reaction, virus, clot, target, gp, supply

15	lockdown, week, month, melbourne, sydney, end, city, weekend, beach, stage, road, street, exercise, adelaide, town, restriction, suburb, stayhome, start, curfew, first_time, half, last_year, staysafe, apartment, rock, melb, sight, stage_lockdown

16	quarantine, home, hotel, police, student, security, parent, facility, place, hotel_quarantine, room, care, program, staff, purpose, force, airport, guard, adult, member, breach, station, protocol, inquiry, two_week, standard, fine, requirement

17	government, auspol, morrison, plan, australian, govt, failure, policy, leadership, power, responsibility, labor, disaster, leader, action, blame, federal_government, politician, deal, attack, excuse, crisis, insider, lack, lie, climate, minister, vaccine_rollout, credit, recession

28 in total

1. Forecasting seasonal outbreaks of influenza.

Authors: Jeffrey Shaman; Alicia Karspeck
Journal: Proc Natl Acad Sci U S A Date: 2012-11-26 Impact factor: 11.205

2. Statistical analysis of forecasting COVID-19 for upcoming month in Pakistan.

Authors: Muhammad Yousaf; Samiha Zahir; Muhammad Riaz; Sardar Muhammad Hussain; Kamal Shah
Journal: Chaos Solitons Fractals Date: 2020-05-25 Impact factor: 5.944

3. Prudent public health intervention strategies to control the coronavirus disease 2019 transmission in India: A mathematical model-based approach.

Authors: Sandip Mandal; Tarun Bhatnagar; Nimalan Arinaminpathy; Anup Agarwal; Amartya Chowdhury; Manoj Murhekar; Raman R Gangakhedkar; Swarup Sarkar
Journal: Indian J Med Res Date: 2020 Feb & Mar Impact factor: 2.375

4. Using Reports of Symptoms and Diagnoses on Social Media to Predict COVID-19 Case Counts in Mainland China: Observational Infoveillance Study.

Authors: Cuihua Shen; Anfan Chen; Chen Luo; Jingwen Zhang; Bo Feng; Wang Liao
Journal: J Med Internet Res Date: 2020-05-28 Impact factor: 5.428

5. Estimation of COVID-19 prevalence in Italy, Spain, and France.

Authors: Zeynep Ceylan
Journal: Sci Total Environ Date: 2020-04-22 Impact factor: 7.963

6. Machine learning techniques to detect and forecast the daily total COVID-19 infected and deaths cases under different lockdown types.

Authors: Tanzila Saba; Ibrahim Abunadi; Mirza Naveed Shahzad; Amjad Rehman Khan
Journal: Microsc Res Tech Date: 2021-02-01 Impact factor: 2.893

10. Data Mining and Content Analysis of the Chinese Social Media Platform Weibo During the Early COVID-19 Outbreak: Retrospective Observational Infoveillance Study.

Authors: Jiawei Li; Qing Xu; Raphael Cuomo; Vidya Purushothaman; Tim Mackey
Journal: JMIR Public Health Surveill Date: 2020-04-21