| Literature DB >> 35308584 |
Yiming Zhang1,2, Ke Chen1,2, Ying Weng1, Zhuo Chen3,4, Juntao Zhang1,2, Richard Hubbard2.
Abstract
The World Health Organization (WHO) declared on 11th March 2020 the spread of the coronavirus disease 2019 (COVID-19) a pandemic. The traditional infectious disease surveillance had failed to alert public health authorities to intervene in time and mitigate and control the COVID-19 before it became a pandemic. Compared with traditional public health surveillance, harnessing the rich data from social media, including Twitter, has been considered a useful tool and can overcome the limitations of the traditional surveillance system. This paper proposes an intelligent COVID-19 early warning system using Twitter data with novel machine learning methods. We use the natural language processing (NLP) pre-training technique, i.e., fine-tuning BERT as a Twitter classification method. Moreover, we implement a COVID-19 forecasting model through a Twitter-based linear regression model to detect early signs of the COVID-19 outbreak. Furthermore, we develop an expert system, an early warning web application based on the proposed methods. The experimental results suggest that it is feasible to use Twitter data to provide COVID-19 surveillance and prediction in the US to support health departments' decision-making.Entities:
Keywords: BERT; COVID-19 surveillance; Early warning system; Epidemic intelligence; Text classification
Year: 2022 PMID: 35308584 PMCID: PMC8920081 DOI: 10.1016/j.eswa.2022.116882
Source DB: PubMed Journal: Expert Syst Appl ISSN: 0957-4174 Impact factor: 6.954
Fig. 1The proposed COVID-19 early warning system framework.
Fig. 2Architecture of fine-tuning BERT for COVID-19 Twitter classification.
Fig. 3Illustration of the proposed linear regression model: use day 1 to day 5 data to predict the confirmed cases of day n.
Fig. 4Early warning system web application architecture.
Fig. 5Dataset collection and pre-processing flowchart.
Annotation Guidelines with Examples.
| Annotation Guidelines | Example |
|---|---|
| “Not a hoax as #Trump said in SC rally. Illinois officials say patient has tested positive for #coronavirus | |
| “A letter sent to the @PlymouthSch community warns that a student who just got back from Italy last month was hospitalized with flu-like symptoms. We’re tracking this potential case of #coronavirus on @boston25 at 5 and 6:30 | |
| “I'm sure I've got that #Coronavirus. Been in absolute tatters since about 4 pm yesterday. Hardly slept all night, coughing and sweating, and my head is totally throbbing. Not my most productive day so far. Literally getting up now to see if I can eat anything. Thanks, China!” |
The Attribute Description of the Dataset.
| Attribute | Description |
|---|---|
| created_at | the creation time of the tweet |
| id_str | the unique identifier (id) of the tweet |
| state | the tweet’s geo-location at US state level |
| full_text | the text content of the tweet |
| label | the result of text classification |
Fig. 6Word cloud of the processed Twitter dataset.
Binary Classification Confusion Matrix.
| Actual Positive Class | Actual Negative Class | |
|---|---|---|
| Predicted Positive Class | True positive (TP) | False positive (FP) |
| Predicted Negative Class | False negative (FN) | True negative (TN) |
Twitter classification results (The best performance is marked in bold font).
| Algorithm | Precision | Recall | F1 score | Accuracy |
|---|---|---|---|---|
| KNN ( | 0.83 | 0.61 | 0.66 | 0.95 |
| SVM ( | 0.82 | 0.67 | 0.72 | 0.95 |
| DPCNN | 0.20 | 0.75 | 0.32 | 0.94 |
The number of total parameters, model size, and run time in the four models. Run time is the inference time that measured over 2120 samples.
| Model | Total parameters | Model size (memory) | Run time (ms) |
|---|---|---|---|
| KNN (k = 10) | N/A | 1.43 MB | 364.4 |
| SVM (linear kernel) | N/A | 0.75 MB | 40.8 |
| DPCNN | 0.98 M | 3.2 MB | 29494.0 |
| BERT (Base, Uncased) | 110 M | 1.22 GB | 81957.7 |
State level prediction model with Twitter data (seven days prior) results.
| Prediction model (predict seven days prior) + Twitter data | |
|---|---|
| California Linear Regression Model | 0.842 |
| Oregon Linear Regression Model | 0.946 |
| Massachusetts Linear Regression Model | 0.889 |
US level prediction model with Twitter data results.
| Prediction Model | |
|---|---|
| US Linear Regression Model (predict one day prior) | 0.977 |
| US Linear Regression Model (predict two days prior) | 0.979 |
| US Linear Regression Model (predict three days prior) | 0.977 |
| US Linear Regression Model (predict four days prior) | 0.979 |
| US Linear Regression Model (predict five days prior) | 0.789 |
| US Linear Regression Model (predict six days prior) | 0.909 |
| US Linear Regression Model (predict seven days prior) | 0.621 |
Fig. 7Visualization of prediction for California. (a) one day prior (b) two days prior (c) three days prior (d) four days prior (e) five days prior (f) six days prior (g) seven days prior. Red denotes prediction, and blue denotes the true values.
Twitter classification results with over-sampling.
| Algorithm | Precision | Recall | F1 score | Accuracy |
|---|---|---|---|---|
| KNN ( | 0.73 | 0.91 | 0.79 | 0.93 |
| SVM ( | 0.92 | 0.91 | 0.92 | 0.98 |
| DPCNN | 0.30 | 0.70 | 0.42 | 0.95 |
| Fine-tuning BERT | 0.96 | 0.99 | 0.98 | 0.99 |
State-level prediction model without Twitter data (seven days prior) results.
| Prediction model (predict seven days prior) | |
|---|---|
| California Linear Regression Model | 0.976 |
| Oregon Linear Regression Model | 0.694 |
| Massachusetts Linear Regression Model | 0.694 |
Early warning detection results: the early warning detection date is the date that the system sends an early warning message (six days before the predicted outbreak date), the predicted outbreak date is the date that the linear regression model forecasting the COVID-19 confirmed cases exceeds the outbreak threshold.
| State | Early warning detection date | Predicted outbreak date |
|---|---|---|
| California | 2020.01.30 | 2020.02.05 |
| Colorado | 2020.02.24 | 2020.03.01 |
| Washington | 2020.01.30 | 2020.02.05 |
| New York | 2020.02.04 | 2020.02.10 |
| Florida | 2020.02.04 | 2020.02.10 |
Fig. 8Table component UI in the early warning system.
Fig. 9Map component UI in the early warning system.
Fig. 10Prediction chart UI in the early warning system.