| Literature DB >> 34032577 |
Beakcheol Jang1, Inhwan Kim1, Jong Wook Kim2.
Abstract
BACKGROUND: Each year, influenza affects 3 to 5 million people and causes 290,000 to 650,000 fatalities worldwide. To reduce the fatalities caused by influenza, several countries have established influenza surveillance systems to collect early warning data. However, proper and timely warnings are hindered by a 1- to 2-week delay between the actual disease outbreaks and the publication of surveillance data. To address the issue, novel methods for influenza surveillance and prediction using real-time internet data (such as search queries, microblogging, and news) have been proposed. Some of the currently popular approaches extract online data and use machine learning to predict influenza occurrences in a classification mode. However, many of these methods extract training data subjectively, and it is difficult to capture the latent characteristics of the data correctly. There is a critical need to devise new approaches that focus on extracting training data by reflecting the latent characteristics of the data.Entities:
Keywords: Pearson correlation coefficient; influenza; infodemiology; infoveillance; keyword; long short-term memory; model; sorting; surveillance; training data extraction; word embedding
Year: 2021 PMID: 34032577 PMCID: PMC8188311 DOI: 10.2196/23305
Source DB: PubMed Journal: JMIR Med Inform
Figure 1System architecture. LSTM: long short-term memory.
Summary of news data for word embeddings.
| Parameter | Value |
| Time period | September 11, 2017, to September 15, 2019 |
| Total articles | 2,093,120 |
| Total bytes | 761,233,009 |
| Total terms | 142,651 |
Hyperparameters for word embeddings and long short-term memory model training.
| Hyperparameter | Word embeddings | Long short-term memory model |
| Epoch | 10 | 200 |
| Dimension | 300 | 64 |
| Window size | 5 | – |
| Min count | 100 | – |
| Time step | – | 5 weeks |
Figure 2Pearson correlation coefficient (PCC) (A) and root-mean-square error (RMSE) (B) of long short-term memory models using Word2Vec continuous bag-of-words.
Figure 3Pearson correlation coefficient (PCC) (A) and root-mean-square error (RMSE) (B) of long short-term memory models using Word2Vec skip-gram.
Figure 4Pearson correlation coefficient (PCC) (A) and root-mean-square error (RMSE) (B) of long short-term memory models using GloVe.
Figure 5Pearson correlation coefficient (PCC) (A) and root-mean-square error (RMSE) (B) of long short-term memory models using FastText continuous bag-of-words.
Figure 6Pearson correlation coefficient (PCC) (A) and root-mean-square error (RMSE) (B) of long short-term memory models using FastText skip-gram.
Pearson correlation coefficient (PCC) and root-mean-square error (RMSE) for influenza prediction models using different word embedding techniques.
| Prediction model | PCC (number of keywords) | RMSE (number of keywords) | ||
|
| Unsorted | Sorted | Unsorted | Sorted |
| Word2Vec CBOWa | 0.8784 (59) | 0.8951 (22) | 0.0095 (19) | 0.0082 (22) |
| Word2Vec skip-gram | 0.8755 (50) | 0.8942 (8) | 0.0089 (9) | 0.0080 (8) |
| GloVe | 0.8467 (14) | 0.8783 (29) | 0.0095 (14) | 0.0090 (22) |
| FastText CBOW | 0.8845 (42) | 0.8986 (34) | 0.0095 (11) | 0.0090 (34) |
| FastText skip-gram | 0.8676 (86) | 0.8679 (10) | 0.0095 (87) | 0.0090 (10) |
| Mean | 0.8705 (50) | 0.8868 (21) | 0.0094 (28) | 0.0086 (19) |
aCBOW: continuous bag-of-words.
Figure 7Comparison of actual influenza outbreaks and influenza prediction results from prediction models. CBOW: continuous bag-of-words; ILI: influenza-like illness; KCDC: Korea Centers for Disease Control and Prevention; LSTM: long short-term memory.