| Literature DB >> 32012050 |
Shahan Ali Memon1, Saquib Razak2, Ingmar Weber3.
Abstract
BACKGROUND: As the process of producing official health statistics for lifestyle diseases is slow, researchers have explored using Web search data as a proxy for lifestyle disease surveillance. Existing studies, however, are prone to at least one of the following issues: ad-hoc keyword selection, overfitting, insufficient predictive evaluation, lack of generalization, and failure to compare against trivial baselines.Entities:
Keywords: Google Trends; Web search; digital epidemiology; infodemiology; infoveillance; lifestyle disease surveillance; noncommunicable diseases; nowcasting; public health
Mesh:
Year: 2020 PMID: 32012050 PMCID: PMC7011125 DOI: 10.2196/13347
Source DB: PubMed Journal: J Med Internet Res ISSN: 1438-8871 Impact factor: 5.428
A survey and comparison of previous literature across different metrics.
| Studies | Bootstrapping keyword selection | GT data type | Data denormalization | Predictive evaluation | Comparison to trivial baseline | Geographical setting | Generalizability to other geographical setting |
| Leffler et al [ | xa | Temporal | N/Ab | In-sample | N/A | United States, the United Kingdom, Canada, and Australia | ✓c |
| Yang et al [ | x | Temporal | N/A | In-sample | N/A | Worldwide | x |
| McCarthy [ | x | Temporal | N/A | In-sample | x | United States | x |
| Hagihara et al [ | x | Temporal | N/A | In-sample | N/A | Japan | x |
| Sueki [ | x | Temporal | N/A | In-sample | N/A | Japan | x |
| Walcott et al [ | x | State-level | N/A | In-sample | N/A | United States | x |
| Yang et al [ | x | Temporal | N/A | In-sample | N/A | Taipei City, Taiwan | x |
| Ayers [ | x | Temporal | N/A | In-sample | N/A | United States, and Australia | ✓ |
| Bragazzi [ | x | Temporal | N/A | In-sample | N/A | Italy | x |
| Braun and Harréus [ | x | Temporal | N/A | In-sample | N/A | Germany | x |
| Breyer and Eisenberg [ | x | Temporal | N/A | In-sample | N/A | United States | x |
| Gunn III and Lester [ | x | State-level | N/A | In-sample | x | United States | x |
| Ingram Plante [ | x | Temporal | N/A | In-sample | N/A | United States, Australia, Germany, the United Kingdom, and Canada | ✓ |
| Willard and Nguyen [ | x | State-level | N/A | In-sample | N/A | United States | x |
| Bragazzi [ | x | Temporal | N/A | In-sample | x | Italy | x |
| Brigo et al [ | x | Temporal | N/A | In-sample | N/A | Worldwide | x |
| Bruckner et al [ | x | Temporal | N/A | In-sample | N/A | England and Wales | x |
| Sarigul et al [ | x | State-level | x | In-sample | x | United States | x |
| Song et al [ | x | Temporal | N/A | In-sample | N/A | Korea | x |
| Nguyen et al [ | ✓ | State-level | x | Out-of-sample | x | United States | x |
| Wang et al [ | ✓ | Temporal | N/A | In-sample | x | Taiwan | x |
| Ma-Kellams et al [ | x | State-level | N/A | In-sample | x | United States | x |
| Parker et al [ | x | State-level | x | Out-of-sample | x | United States | x |
| Burns et al [ | x | Temporal | N/A | In-sample | N/A | United States | x |
| Cervellin et al [ | x | Temporal | N/A | In-sample | x | Italy | x |
| Hassid et al [ | x | Temporal | N/A | In-sample | N/A | United States | x |
| Lotto et al [ | ✓ | Temporal | x | In-sample | N/A | United States, United Kingdom, Australia, and Brazil | ✓ |
| Ojala et al [ | ✓ | State-level, temporal | N/A | Out-of-sample | x | United States | x |
| Ricketts and Silva [ | x | Temporal | N/A | In-sample | x | United States | x |
| Tran et al [ | ✓ | Temporal | N/A | In-sample | N/A | United States, Germany, Austria, and Switzerland | ✓ |
| Wehner et al [ | x | Temporal | N/A | In-sample | x | United States | x |
| Aguirre et al [ | ✓ | Temporal | x | In-sample | N/A | United States, United Kingdom, Germany, Brazil, France, India, Italy, Japan | ✓ |
| Arendt [ | x | Temporal | N/A | In-sample | x | Worldwide | x |
| Chandler [ | x | State-level | x | In-sample | x | United States | x |
| Coogan et al [ | x | Temporal | N/A | In-sample | x | Australia | x |
| Phillips et al [ | x | State-level | x | In-sample | x | United States | x |
| Cruvinel et al [ | ✓ | Temporal | x | In-sample | N/A | 10 South American Countries | ✓ |
|
| ✓ |
| ✓ |
| ✓ |
| ✓ |
ax: missing.
bNot applicable.
c✓: available.
dThe values in italics signify how our study compares to those from the past literature across different metrics.
The subset of unpruned keywords for different target variables.
| Target variable | Google Correlate | Semantic Link | Related Queries |
| Diabetes | when i get up | insulin | diabetes symptoms |
|
| sell avon | polyphagia | signs of diabetes |
|
| medicine for dogs | ketoacidosis | prediabetes |
|
| very weak | cholesterol | icd 10 |
|
| sugar level | hypertension | icd 10 type 2 diabetes |
| Obesity | catherines.com | abdominal | food delivery near me |
|
| dresses plus size | anorexia | lose fat |
|
| sims 3 games | BMI | myfitnesspal |
|
| lose 100 pounds | appetite | indeed.com |
|
| dresses plus | ADHD | pizza delivery |
| Exercise | transportation options | exercises | my fitness pal |
|
| best bike | aerobic | workout |
|
| bike laws | jogging | iPod |
|
| bike repair | gyms | quinoa gluten free |
|
| bike frame size | muscles | how to exercise |
Figure 1The equations for the proposed denormalization framework.
A detailed evaluation of 5 different experiments across the 3 target variables for the region of the United States.
| Target variable | Trivial baseline | Spatial model | Spatio-temporal model | Multivariate spatio-temporal model | Lagged multivariate spatio-temporal model | Hierarchical lagged multivariate spatio-temporal model | |
|
| |||||||
|
| MAEa | 0.72 | 0.81 |
|
|
|
|
|
| RMSEd | 0.92 | 1.0 |
|
|
|
|
|
| SMAPEe | 6.94 | 7.63 |
|
|
|
|
|
| Spearman rho | 0.87 |
| 0.87 |
|
|
|
|
| Pearson R | 0.90 | 0.90 | 0.88 |
| 0.91c |
|
|
| |||||||
|
| MAE | 1.20 | 2.81 |
|
|
| 1.08c |
|
| RMSE | 1.55 | 3.28 |
|
|
| 1.40c |
|
| SMAPE | 3.88 | 9.31 |
|
|
| 3.51c |
|
| Spearman rho | 0.93 | 0.87 | 0.85 |
| 0.93 |
|
|
| Pearson | 0.93 | 0.86 | 0.86 |
| 0.94c |
|
|
| |||||||
|
| MAE | 2.89 |
| 3.12 |
|
|
|
|
| RMSE | 3.32 |
| 3.75 |
|
|
|
|
| SMAPE | 3.85 |
| 4.11 |
|
|
|
|
| Spearman rho | 0.68 |
|
| 0.71c |
|
|
|
| Pearson R | 0.69 |
|
| 0.72c |
|
|
aMAE: mean absolute error.
bThe values in italics signify an improvement in the performance in comparison to the previous method.
cThe method beat the trivial baseline.
dRMSE: root mean squared error.
eSMAPE: symmetric mean absolute percentage error.
Results for the transfer learning framework across the 2 target variables for Canada for generalizability of the method trained over the years 2008-2012, and tested on the years 2013 and 2014.
| Target variable | Trivial baseline | Spatial | Spatio-temporal model | Multivariate spatio-temporal model | Lagged multivariate spatio-temporal model | Hierarchical lagged multivariate spatio-temporal model | |
|
| |||||||
|
| MAEa | 0.54 | 0.80 | 0.83 |
|
|
|
|
| RMSEc | 0.66 | 0.96 | 1.00 |
|
|
|
|
| SMAPEd | 7.64 | 11.57 | 12.10 |
|
|
|
|
| Spearman | 0.86 | 0.68 | 0.62 |
|
|
|
|
| Pearson | 0.85 | 0.64 | 0.59 |
|
|
|
|
| |||||||
|
| MAE | 1.31 | 1.68 | 1.74 |
| 1.59 | 1.81 |
|
| RMSE | 1.66 | 2.56 | 2.80 |
|
| 2.45 |
|
| SMAPE | 5.81 | 7.31 | 8.05 |
|
| 7.88 |
|
| Spearman | 0.89 | 0.82 |
|
|
| 0.85 |
|
| Pearson | 0.95 | 0.86 | 0.86 |
|
| 0.91 |
aMAE: mean absolute error.
bThe values in italics signify an improvement in the performance in comparison to the previous method.
cRMSE: root mean squared error.
dSMAPE: symmetric mean absolute percentage error.
The results for the transfer learning framework across the 2 target variables for Canada for generalizability of the US trained model.
| Cross-country generalizability of the US-based model | Diabetes | Obesity | |||||||
|
| Trivial baseline | Lagged multivariate model | Trivial baseline | Lagged multivariate model | |||||
|
| |||||||||
|
| MAEa | 0.68 | 0.88 | 1.53 | 1.78 | ||||
|
| RMSEb | 0.92 | 1.10 | 1.96 | 2.16 | ||||
|
| SMAPEc | 9.9 | 12.65 | 6.91 | 8.20 | ||||
|
| Spearman | 0.81 | 0.77 | 0.90 | 0.90 | ||||
|
| Pearson | 0.76 | 0.74 | 0.91 | 0.91 | ||||
|
| |||||||||
|
| MAE | 0.84 | 1.29 | 1.57 |
| ||||
|
| RMSE | 1.07 | 1.49 | 2.37 |
| ||||
|
| SMAPE | 11.16 | 16.04 | 4.99 |
| ||||
|
| Spearman | 0.69 | 0.59 | 0.93 | 0.92 | ||||
|
| Pearson | 0.70 | 0.60 | 0.91 | 0.91 | ||||
|
| |||||||||
|
| MAE | 0.54 | 0.91 | 1.31 | 1.31 | ||||
|
| RMSE | 0.66 | 1.12 | 1.66 |
| ||||
|
| SMAPE | 7.64 | 13.10 | 5.81 |
| ||||
|
| Spearman | 0.86 | 0.74 | 0.89 |
| ||||
|
| Pearson | 0.85 | 0.74 | 0.95 | 0.95 | ||||
aMAE: mean absolute error.
bRMSE: root mean squared error.
cSMAPE: symmetric mean absolute percentage error.
dThe values in italics signify improvement in performance over the trivial baseline.
Results for the transfer learning framework across the 2 target variables for Canada for generalizability of the method trained over the years 2008-2014, and tested over the years 2016-2018.
| Target variable | Trivial Baseline | Spatial | Spatio-temporal model | Multivariate spatio-temporal model | Lagged multivariate spatio-temporal model | Hierarchical lagged multivariate spatio-temporal model | |
|
| |||||||
|
| MAEa | 0.84 | 0.82 |
|
|
| 0.74c |
|
| RMSEd | 1.07 |
|
| 0.93c |
| 0.99c |
|
| SMAPEe | 11.16 |
|
|
|
| 10.05c |
|
| Spearman | 0.69 |
|
| 0.76c |
| 0.78c |
|
| Pearson | 0.7 |
|
| 0.76c |
|
|
|
| |||||||
|
| MAE | 1.57 | 7.98 | 8.72 |
|
| 5.69 |
|
| RMSE | 2.37 | 8.4 | 9 |
|
| 5.92 |
|
| SMAPE | 4.99 | 29.59 | 33.59 |
|
| 20.71 |
|
| Spearman | 0.93 | 0.86 |
|
|
|
|
|
| Pearson | 0.91 | 0.88 |
|
| 0.91 |
|
aMAE: mean absolute error.
bThe values in italics signify an improvement in the performance in comparison to the previous method.
cThe method beat the trivial baseline.
dRMSE: root mean squared error.
eSMAPE: symmetric mean absolute percentage error.
Figure 2The comparison of the errors of 6 different methods for the target variable obesity for the year 2018 for each state. (Note: The bars represent the simple error [ie, ground truth prediction] for each state, and not the predicted diabetes or obesity rates. Therefore, the height of the bars is only comparable within each state and not comparable across states as the scale is not fixed. As most bars are above zero, this indicates that in most cases, the models underestimate the ground truth obesity rates.).