Literature DB >> 33942002

News Sentiment Informed Time-series Analyzing AI (SITALA) to curb the spread of COVID-19 in Houston.

Prathamesh S Desai1.   

Abstract

Coronavirus disease (COVID-19) has evolved into a pandemic with many unknowns. Houston, located in the Harris County of Texas, is becoming the next hotspot of this pandemic. With a severe decline in international and inter-state travel, a model at the county level is needed as opposed to the state or country level. Existing approaches have a few drawbacks. Firstly, the data used is the number of COVID-19 positive cases instead of positivity. The former is a function of the number of tests carried out while the number of tests normalizes the latter. Positivity gives a better picture of the spread of this pandemic as, with time, more tests are being administered. Positivity under 5% has been desired for the reopening of businesses to almost 100% capacity. Secondly, the data used by models like SEIRD (Susceptible, Exposed, Infectious, Recovered, and Deceased) lacks information about the sentiment of people concerning coronavirus. Thirdly, models that make use of social media posts might have too much noise and misinformation. On the other hand, news sentiment can capture long-term effects of hidden variables like public policy, opinions of local doctors, and disobedience of state-wide mandates. The present study introduces a new artificial intelligence (i.e., AI) model, viz., Sentiment Informed Time-series Analyzing AI (SITALA), trained on COVID-19 test positivity data and news sentiment from over 2750 news articles for Harris county. The news sentiment was obtained using IBM Watson Discovery News. SITALA is inspired by Google-Wavenet architecture and makes use of TensorFlow. The mean absolute error for the training dataset of 66 consecutive days is 2.76, and that for the test dataset of 22 consecutive days is 9.6. A cone of uncertainty is provided within which future COVID-19 test positivity has been shown to fall with high accuracy. The model predictions fare better than a published Bayesian-based SEIRD model. The model forecasts that in order to curb the spread of coronavirus in Houston, a sustained negative news sentiment (e.g., death count for COVID-19 will grow at an alarming rate in Houston if mask orders are not followed) will be desirable. Public policymakers may use SITALA to set the tone of the local policies and mandates.
© 2021 Elsevier Ltd. All rights reserved.

Entities:  

Keywords:  Artificial intelligence; COVID-19 model; Deep learning; News sentiment; Pandemic forecast; Public policy

Year:  2021        PMID: 33942002      PMCID: PMC8081574          DOI: 10.1016/j.eswa.2021.115104

Source DB:  PubMed          Journal:  Expert Syst Appl        ISSN: 0957-4174            Impact factor:   6.954


Introduction

In the present-day USA, pandemic models of COVID-19 at the level of state or country (Giuliani et al., 2020, Yang et al., 2020, Benvenuto et al., 2020, Dandekar and Barbastathis, 2020, Chimmula and Zhang, 2020, Mohamadou et al., 2020) are not of much use due to a severe decline in air travel (USDoT, 2020, Chinazzi et al., 2020). The most predominant spread of the virus is restricted to the geography of a county or a couple of neighboring counties. National, and more so local, news articles provide a fairly accurate picture of the ongoing situation in a crisis-stricken county, Harris in the case of the present study. Local public policymakers can thus play an important role in setting the tone or sentiment of the news at the county level (Adolph, Amano, Bang-Jensen, Fullman, & Wilkerson, 2020). The spread of coronavirus in the short-term future is a function of test positivity data; however, news sentiment starts playing an important role over longer periods. For example, a stay-at-home order with a strong negative sentiment issued today may only start seeing the decline in test positivity after 10–14 days due to the coronavirus’s incubation period. Existing approaches have a few drawbacks (Yang et al., 2020, Shinde et al., 2020, Karisani and Karisani, 2020, Ayyoubzadeh et al., 2020, Alamoodi et al., 2020, Jha et al., 2020): Firstly, the data used is the number of COVID-19 positive cases instead of positivity. Secondly, the data used by models like SEIRD (Susceptible, Exposed, Infectious, Recovered, and Dead) lack information about the sentiment of people concerning coronavirus. Thirdly, models that make use of social media posts might have too much noise. This first-of-a-kind study attempts to develop a multivariate artificial intelligence (AI) model to analyze the time series of COVID-19 positivity and news sentiment. The AI model is inspired by Google-Wavenet (Oord et al., 2016) architecture and uses IBM Watson Discovery News (High, 2012) to mine COVID-19 sentiment in the news articles. To the best of the author’s knowledge, this is the first AI study to combine spatial information via news sentiment at the county level with COVID-19 positivity data (Nguyen, 2020). The methodology is described in Section 2, predictions of the model are presented in Section 3 and compared with a published SEIRD temporal model from the literature. The current model fares better than the published Bayesian-based SEIRD model (Jha et al., 2020). Discussion of the modeling results is presented in Section 4.

Methodology

Data

The COVID-19 test positivity data for Harris county was obtained from the website of Texas Department of State Health Services. A couple of instances of wrong or missing data were filled using linear interpolation. IBM Watson Discovery was used to mine the news sentiment in 2867 news articles for three months. The tool provides 200 free queries per user per month. More information can be found here: link. Discovery News employs natural language processing to return answers to the queries. It also analyzes the sentiment of the news articles. The query used in this study was: Is the spread of coronavirus or covid-19 or 2019-nCoV under control? Include analysis of your results. $ average(enriched_text.sentiment.document.score)$ Filter which documents you query. $publication_date::‘‘2020-05-29",url:‘‘houston", (enriched_text.keywords.text:‘‘coronavirus"| enriched_text.keywords.text:‘‘COVID-19"| enriched_text.keywords.text:‘‘2019-nCoV")$ A sample query along with the output from Watson Discovery News is shown in Fig. 1 . The entire dataset is also provided in the Appendix A.
Fig. 1

Title: Sample query output from IBM Watson News Discovery. Description: The present study used this exact same query to determine the sentiment of news (-1 implies maximum negative sentiment and +1 implies maximum positive sentiment) over varying publication dates.

Title: Sample query output from IBM Watson News Discovery. Description: The present study used this exact same query to determine the sentiment of news (-1 implies maximum negative sentiment and +1 implies maximum positive sentiment) over varying publication dates.

Model (Oord et al., 2016, High, 2012, Géron, 2019, Ting et al., 2020, Wan et al., 2019, Borovykh et al., 2017)

Daily COVID-19 positivity rate and news sentiment are passed to a Wavenet-inspired multivariate convolution neural network (CNN) to predict the future COVID-19 positivity rate. The AI is named SITALA or Sentiment Informed Time series Analyzing AI. A 16-day window, based on the coronavirus incubation period of 11–16 days (Lauer et al., 2020), with a stride of 1 was used to generate training and test datasets. The architecture of SITALA is shown in Fig. 1. Dilated causal convolutions help with the transmission of long-term effects. Convolution rate of 1, 2, 4, and 8 days was used. The output is a single-point prediction of COVID-19 in the future. 10% of the test data was reserved for validation. Jupyter notebook code for the neural network architecture and the chosen hyper-parameters are provided in the B. Cross-validation for time series data may result in data leakage if necessary precautions are not taken (Bergmeir, Hyndman, & Koo, 2018). Additionally, cross-validation, in general, may result in overfitting to the training data (Rao, Fung, & Rosales, 2008). Thus, cross-validation was not performed considering the limited amount of data and the unknown final desired MAE.

Results

The data used in the present study is shown in Fig. 2, Fig. 3 and the predictions of SITALA. Data for Harris county from 04/21 to 07/18 has been used in this study. The number of news articles returned by IBM Watson Discovery News is shown with bars that are multiplied by the sign of the average daily news sentiment. The average daily news sentiment (connected blue squares) can be +1.0 for maximum positive sentiment and −1.0 for maximum negative sentiment. Overall, on most of the days, the news sentiment about the spread of coronavirus in Houston, Harris County, Texas, has been negative. A significant positive spike in news sentiment is seen around the time of social unrest in Houston (05/30 to 06/02). The focus of news might have shifted away from COVID-19 during this time-frame. Overall, an upward trend is visible in the COVID-19 positivity data (connected red dots).
Fig. 2

Title: Architecture of SITALA. Description: It is a sequential model that takes a window of multivariate (viz., COVID test positivity and news sentiment from IBM Watson News Discovery) timeseries and outputs the COVID test positivity at the next timestep. This architecture is inspired by Google-Wavenet (Oord et al., 2016) architecture (viz., dilated causal convolutions). In the present study, a window size of 16 was used and dilations of 1, 2, 4, 8 were used. The window size was chosen to be 16 based on the incubation period of coronavirus as reported by Lauer et al. (2020).

Fig. 3

Title: SITALA predictions and relevant COVID-19 data for the Harris county, Texas. Description: Number of news articles returned by IBM Watson discovery, shown on the right axis, started increasing from around mid-May, 2020. COVID-19 test positivity and news sentiment are shown on the left axis. Around 75% of the data (green window) was used for training SITALA, of which 10% was reserved for validation. SITALA was tested on remaining 25% of the data (blue window) for which the mean absolute error (MAE) was 9.6. SITALA forecast (gray window) shows how maintaining a negative sentiment in the news about the spread of COVID-19 can be beneficial to control and eventually decrease test positivity.

Title: Architecture of SITALA. Description: It is a sequential model that takes a window of multivariate (viz., COVID test positivity and news sentiment from IBM Watson News Discovery) timeseries and outputs the COVID test positivity at the next timestep. This architecture is inspired by Google-Wavenet (Oord et al., 2016) architecture (viz., dilated causal convolutions). In the present study, a window size of 16 was used and dilations of 1, 2, 4, 8 were used. The window size was chosen to be 16 based on the incubation period of coronavirus as reported by Lauer et al. (2020). Title: SITALA predictions and relevant COVID-19 data for the Harris county, Texas. Description: Number of news articles returned by IBM Watson discovery, shown on the right axis, started increasing from around mid-May, 2020. COVID-19 test positivity and news sentiment are shown on the left axis. Around 75% of the data (green window) was used for training SITALA, of which 10% was reserved for validation. SITALA was tested on remaining 25% of the data (blue window) for which the mean absolute error (MAE) was 9.6. SITALA forecast (gray window) shows how maintaining a negative sentiment in the news about the spread of COVID-19 can be beneficial to control and eventually decrease test positivity. As the window size based on the virus’s incubation period is 16 days (Lauer et al., 2020), the test data set should have at least 16 days’ worth of data. The model is expected to predict at least a week into the future beyond these 16 days. Thus, the last 22 days’ worth of data was reserved for testing. This 22 days’ worth of data amounted to about 25% data for testing, and the remaining 75% of data was used for training. 10% of the training data was used for validation. The continuous black line with shadow shows the predictions of trained SITALA over the entire dataset. SITALA can capture the COVID-19 positivity data response with a mean absolute error (MAE) of 2.76 for the training dataset and 9.6 for the test dataset. SITALA is unable to capture the highest spikes encountered in both the datasets of COVID-19 positivity. This prediction error may have been due to the smaller number of observations in the total dataset (a total of 88 days’ worth of observations is not at the level of big data requirements) and the smoothing out effect introduced by the time window of 16 days. The theoretical limits of news sentiment are +1.0 and −1.0. However, it is not feasible to expect that all the news outlets would maintain such a sustained positive or negative news sentiment. The observed minimum news sentiment in this study was about −0.62 (refer to Table A1) and was also comparable to another study on news sentiment during COVID-19 (Buckman, Shapiro, Sudhof, & Wilson, 2020). Thus, the news sentiment limits were assumed to be +0.7 and −0.7 for future news sentiment. With these as the inputs to SITALA, two extreme forecasts were obtained till 08/07. These are also shown in Fig. 3 with dotted black and dotted green lines with shadow. SITALA forecasts the positivity of COVID-19 in Houston to lie within this cone of uncertainty. The model forecasts that a sustained positive sentiment, e.g., “masks are optional”, may prove disastrous for the spread of coronavirus in Houston. COVID-19 test positivity can reach 60% in this case. On the contrary, a sustained negative sentiment, e.g., “death count for COVID-19 will grow at an alarming rate in Houston if mask orders are not followed”, may help to discourage social gatherings and to keep the COVID-19 positivity under check. COVID-19 positivity, as forecasted by SITALA, will stay under 20% in this case. COVID-19 test positivity (truth) from 07/19 to 08/05 has also been shown in the Fig. 3. The ground truth in the forecast falls perfectly on or inside the predicted boundaries except for 07/21, where SITALA under-predicts and 08/02 through 08/04, where SITALA over-predicts. The former error of under-prediction may be problematic, but the latter error of over-prediction is not harmful. Note that SITALA was only trained on the data until 06/26. Also shown in Fig. 3 is the prediction by another model published in the literature (Jha et al., 2020). The model developed by Jha et al. is a SERID (i.e., Susceptible, Exposed, Infectious, Recovered, and Deceased) temporal model. Bayesian learning was used to find the optimal model parameters. The model was calibrated for the data of the entire state of Texas. The positivity predictions of this model for Harris county are on the lower end of the side. This prediction error highlights the drawback of SEIRD models that do not account for any spatial data. Local news articles capture the spatial information of the COVID-19 spread, and SITALA that made use of this additional spatial information could better predict the COVID-19 positivity.

Discussion

This study highlights the multivariate nature of COVID-19 positivity. The unknowns about the disease have not yet been thoroughly understood. However, public policymakers can benefit from models like SITALA, which add news sentiment dimension to the COVID-19 test positivity data to make forecasts. The long-term effect of sentiment due to the virus incubation period of 14 days can be captured using an AI, making use of dilated causal convolutions. In the coming weeks or even months, news publishers would have a more significant role in curbing the spread of coronavirus in Houston. SITALA is a continually evolving AI and should be enhanced with newer data, as and when available, using transfer learning. SITALA may be deployed at other similar crisis-stricken counties in New York, Florida, and California.

Limitations

The query searched for the articles having ‘houston’ in the URL may have caused the omission of few relevant articles that did not have Houston in the URL. During the initial few days of the training dataset, there were hardly any articles relevant to the IBM Watson query, and thus the sentiment during this period was assumed to be neutral, i.e., a value of 0.

Ethics (Cutler et al., 2019, Ienca and Vayena, 2020)

SITALA may only be used as an ethical guide by public policymakers to set policies. The author does not support any unethical use of SITALA.

Disclaimers

Funding: No specific funding was received for this work; Data and materials availability: All data is available in the Appendix A.

CRediT authorship contribution statement

Prathamesh S. Desai: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Writing - original draft, Writing - review & editing, Visualization.

Declaration of Competing Interest

The author declares that he has no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Table A.1
Inputs
OutputSITALA predictions
DatePercent cleaned test positivityPercent COVID news sentimentPercent cleaned test positivityPositivity predictionsForecast with positive sentiment (70)Forecast with negative sentiment (−70)
Training dataset with 10% reserved for validation4/232.240.002.24
4/243.560.003.56
4/254.380.004.38
4/2621.190.0021.19
4/273.93−61.813.93
4/2814.830.0014.83
4/294.240.004.24
4/307.580.007.58
5/16.670.006.67
5/25.760.005.76
5/34.890.004.89
5/44.29−52.664.29
5/53.730.003.73
5/63.180.003.18
5/72.620.002.622.60
5/82.360.002.362.49
5/96.290.006.296.08
5/104.200.004.204.16
5/112.230.002.232.06
5/123.930.003.933.67
5/134.500.004.504.32
5/144.290.004.294.17
5/159.175.439.178.85
5/164.781.364.784.59
5/171.81−26.881.811.76
5/187.6613.967.667.32
5/192.38−10.042.382.19
5/205.007.825.004.88
5/216.641.896.646.49
5/228.4811.798.488.32
5/236.74−54.956.746.56
5/244.99−11.664.994.92
5/253.25−30.843.253.19
5/261.466.231.461.52
5/2714.8311.2614.8314.78
5/284.2711.294.274.35
5/296.205.526.205.92
5/306.0639.186.066.00
5/3121.9125.2621.9121.21
6/10.9616.380.960.68
6/28.340.338.347.94
6/312.92−3.2112.9212.45
6/417.5515.2517.5517.02
6/512.499.4412.4911.88
6/67.89−26.277.896.94
6/75.57−15.185.575.21
6/811.63−19.6911.6311.01
6/916.62−14.7016.6216.26
6/1015.012.6315.0114.11
6/1113.40−12.0913.4013.22
6/1211.79−18.9011.7911.28
6/1310.19−28.9210.199.99
6/147.462.877.467.58
6/157.16−4.267.167.27
6/165.532.515.535.57
6/1710.0414.6910.0410.10
6/187.08−0.397.086.87
6/195.7515.415.755.68
6/2017.556.6817.5516.89
6/2127.71−27.4227.716.39
6/223.714.483.716.63
6/2360.35−17.2360.3512.43
6/2445.44−10.1745.449.54
6/2530.53−0.2630.5313.65
6/2615.62−16.6915.6211.36



Test dataset6/279.05−10.339.0514.38
6/289.18−48.709.1814.08
6/291.64−20.831.6417.87
6/3019.83−15.7019.8311.95
7/114.38−20.4614.389.54
7/220.69−30.3720.6917.72
7/323.57−11.6823.578.66
7/415.27−24.7215.2712.23
7/56.32−35.686.3216.23
7/67.381.797.3811.14
7/713.423.5313.4212.01
7/815.12−41.3615.1211.32
7/913.59−50.1113.5910.55
7/1011.44−27.3511.4412.04
7/1110.81−21.8010.815.45
7/1225.92−51.0925.9211.06
7/1318.76−4.3918.766.50
7/1449.020.4949.026.91
7/1528.24−11.3128.247.88
7/1627.441.5927.4416.13
7/1726.65−9.1526.659.96
7/1820.28−31.7920.2820.28



Forecast7/1912.279.239.23
7/2014.1517.3113.78
7/2120.5226.5417.34
7/2227.5414.898.86
7/2324.7922.3111.32
7/2419.8332.0916.35
7/2516.1626.158.79
7/2621.3029.9910.60
7/2721.0037.5614.11
7/2830.0133.2812.61
7/2939.1841.109.85
7/3023.3344.9015.87
7/3143.1845.4114.24
8/137.5649.8516.57
8/212.0057.4718.08
8/35.7556.7519.36
8/415.0455.6216.81
8/521.0956.7517.65
8/659.6517.22
8/760.8217.17
# SITALA: Sentiment Informed Timeseries Analyzing AI
# Location: Harris county, TX
# Purpose: Forecast spread of coronavirus
# Author: Prathamesh S. Desai
tf.keras.backend.clear_session()
tf.random.set_seed(40)
np.random.seed(40)
# Model definition
model  = tf.keras.Sequential()
model.add(tf.keras.layers.InputLayer(input_shape=(window, n_features)))
for dilation_rate in (1, 2, 4, 8):
model.add(tf.keras.layers.Conv1D(filters = 64, kernel_size = 2, strides = 1, dilation_rate = dilation_rate, padding=‘‘causal", activation=‘‘relu"))
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(128, activation=’relu’))
model.add(tf.keras.layers.Dense(1))
# Model compilation
optimizer  = tf.keras.optimizers.Adam(lr = 5e-4)
model.compile(loss = tf.keras.losses.Huber(), optimizer = optimizer, metrics=[‘‘mae"])
early_stopping  = tf.keras.callbacks.EarlyStopping(patience = 200)
model.summary()
history  = model.fit(X_train, Y_train, epochs = 500, verbose = True, validation_split = 0.1, callbacks=[early_stopping])
print(‘‘SITALA has been trained")
  2 in total

1.  The Incubation Period of Coronavirus Disease 2019 (COVID-19) From Publicly Reported Confirmed Cases: Estimation and Application.

Authors:  Stephen A Lauer; Kyra H Grantz; Qifang Bi; Forrest K Jones; Qulu Zheng; Hannah R Meredith; Andrew S Azman; Nicholas G Reich; Justin Lessler
Journal:  Ann Intern Med       Date:  2020-03-10       Impact factor: 25.391

2.  Modified SEIR and AI prediction of the epidemics trend of COVID-19 in China under public health interventions.

Authors:  Zifeng Yang; Zhiqi Zeng; Ke Wang; Sook-San Wong; Wenhua Liang; Mark Zanin; Peng Liu; Xudong Cao; Zhongqiang Gao; Zhitong Mai; Jingyi Liang; Xiaoqing Liu; Shiyue Li; Yimin Li; Feng Ye; Weijie Guan; Yifan Yang; Fei Li; Shengmei Luo; Yuqi Xie; Bin Liu; Zhoulang Wang; Shaobo Zhang; Yaonan Wang; Nanshan Zhong; Jianxing He
Journal:  J Thorac Dis       Date:  2020-03       Impact factor: 3.005

  2 in total
  2 in total

1.  COVID-19 contagion forecasting framework based on curve decomposition and evolutionary artificial neural networks: A case study in Andalusia, Spain.

Authors:  Miguel Díaz-Lozano; David Guijo-Rubio; Pedro Antonio Gutiérrez; Antonio Manuel Gómez-Orellana; Isaac Túñez; Luis Ortigosa-Moreno; Armando Romanos-Rodríguez; Javier Padillo-Ruiz; César Hervás-Martínez
Journal:  Expert Syst Appl       Date:  2022-06-27       Impact factor: 8.665

2.  Deep learning time series prediction models in surveillance data of hepatitis incidence in China.

Authors:  Zhaohui Xia; Lei Qin; Zhen Ning; Xingyu Zhang
Journal:  PLoS One       Date:  2022-04-13       Impact factor: 3.240

  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.