| Literature DB >> 27994231 |
Florian Rohart1, Gabriel J Milinovich2,3, Simon M R Avril4, Kim-Anh Lê Cao1, Shilu Tong3, Wenbiao Hu3.
Abstract
Effective disease surveillance is critical to the functioning of health systems. Traditional approaches are, however, limited in their ability to deliver timely information. Internet-based surveillance systems are a promising approach that may circumvent many of the limitations of traditional health surveillance systems and provide more intelligence on cases of infection, including cases from those that do not use the healthcare system. Infectious disease surveillance systems built on Internet search metrics have been shown to produce accurate estimates of disease weeks before traditional systems and are an economically attractive approach to surveillance; they are, however, also prone to error under certain circumstances. This study sought to explore previously unmodeled diseases by investigating the link between Google Trends search metrics and Australian weekly notification data. We propose using four alternative disease modelling strategies based on linear models that studied the length of the training period used for model construction, determined the most appropriate lag for search metrics, used wavelet transformation for denoising data and enabled the identification of key search queries for each disease. Out of the twenty-four diseases assessed with Australian data, our nowcasting results highlighted promise for two diseases of international concern, Ross River virus and pneumococcal disease.Entities:
Mesh:
Year: 2016 PMID: 27994231 PMCID: PMC5172376 DOI: 10.1038/srep38522
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Summary of the number of search terms identified and used in this study for each disease.
| Pneumococcal disease (invasive) | 777 | 304 | 54 | 35 | 69 | 34 |
| Varicella zoster (Chickenpox) | 115 | 115 | 9 | 8 | 15 | 13 |
| Varicella zoster (Shingles) | 953 | 710 | 6 | 8 | 14 | 11 |
| Influenza (laboratory confirmed) | 1799 | 701 | 16 | 14 | 20 | 8 |
| Varicella zoster (unspecified) | 637 | 532 | 2 | 8 | 9 | 7 |
| Gonococcal infection | 909 | 663 | 2 | 6 | 8 | 6 |
| Ross River virus infection | 1316 | 931 | 19 | 6 | 19 | 6 |
| Barmah Forest virus infection | 420 | 271 | 11 | 2 | 12 | 6 |
| Dengue virus infection | 803 | 505 | 8 | 6 | 11 | 5 |
| Hepatitis B (unspecified) | 24 | 24 | 0 | 5 | 5 | 5 |
| Hepatitis A | 644 | 428 | 2 | 4 | 6 | 5 |
| Hepatitis B (newly acquired) | 0 | 0 | 0 | 5 | 5 | 5 |
| Pertussis | 1629 | 1287 | 5 | 5 | 6 | 4 |
| Hepatitis C (unspecified) | 144 | 120 | 0 | 4 | 4 | 4 |
| Meningococcal disease (invasive) | 0 | 0 | 0 | 9 | 9 | 4 |
| Chlamydial infection | 1261 | 431 | 2 | 2 | 3 | 3 |
| Leptospirosis | 162 | 130 | 3 | 6 | 8 | 3 |
| Murray Valley encephalitis virus infection | 913 | 538 | 0 | 2 | 2 | 2 |
| Cryptosporidiosis | 795 | 424 | 2 | 2 | 4 | 2 |
| Chikungunya virus infection | 414 | 253 | 0 | 2 | 2 | 2 |
| Listeriosis | 216 | 127 | 2 | 2 | 4 | 2 |
| Measles | 795 | 556 | 1 | 2 | 3 | 1 |
| Botulism | 464 | 347 | 0 | 2 | 2 | 1 |
| Legionellosis | 0 | 0 | 0 | 3 | 3 | 1 |
Spearman’s rho correlation coefficients for diseases notifications-search metrics for the period 2009–13.
| Disease | Top ranked search term | ACT | NSW | NT | QLD | SA | TAS | VIC | WA | AUS | p-value |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Gonococcal infection | discharge | 0.795 | −0.028 | 0.298 | 0.205 | 0.556 | 0.265 | 0.786 | <0.00001 | ||
| Varicella zoster (Shingles) | diarrhea | −0.400 | 0.546 | 0.578 | 0.633 | 0.762 | <0.00001 | ||||
| Pneumococcal disease (invasive) | bronchitis | 0.558 | 0.338 | 0.402 | 0.753 | <0.00001 | |||||
| Ross River virus infection | "ross river" | 0.456 | 0.082 | 0.742 | <0.00001 | ||||||
| Pertussis | whooping | 0.661 | 0.489 | 0.525 | 0.700 | 0.651 | <0.00001 | ||||
| Chlamydial infection | blood test | 0.540 | 0.038 | 0.208 | 0.433 | 0.390 | 0.557 | 0.634 | <0.00001 | ||
| Varicella zoster (unspecified) | blood test | −0.023 | 0.297 | −0.404 | 0.316 | 0.400 | 0.628 | <0.00001 | |||
| Varicella zoster (Chickenpox) | conjunctivitis | 0.071 | 0.380 | 0.624 | <0.00001 | ||||||
| Cryptosporidiosis | ross river virus | 0.569 | <0.00001 | ||||||||
| Barmah Forest virus infection | ross river virus | 0.539 | <0.00001 | ||||||||
| Dengue virus infection | dengue | −0.036 | 0.337 | 0.508 | 0.569 | 0.507 | <0.00001 | ||||
| Influenza (laboratory confirmed) | flu symptoms | 0.344 | 0.290 | 0.589 | 0.485 | 0.423 | <0.00001 | ||||
| Leptospirosis | ross river | 0.110 | 0.059 | 0.405 | <0.00001 | ||||||
| Measles | measles | 0.198 | 0.263 | 0.119 | 0.367 | <0.00001 | |||||
| Hepatitis C (unspecified) | hepatitis | 0.193 | 0.116 | −0.149 | 0.142 | −0.035 | 0.297 | <0.00001 | |||
| Hepatitis A | hepatitis a | −0.167 | 0.293 | <0.00001 | |||||||
| Murray Valley encephalitis virus infection | murray valley encephalitis | 0.265 | <0.0001 | ||||||||
| Legionellosis | legionnaires | 0.108 | 0.237 | <0.0001 | |||||||
| Hepatitis B | hepatitis | 0.183 | 0.097 | −0.060 | 0.067 | −0.007 | 0.230 | <0.001 | |||
| Meningococcal disease (invasive) | rulide | 0.222 | <0.001 | ||||||||
| Chikungunya virus infection | dengue | −0.023 | 0.104 | 0.140 | 0.194 | 0.0013 | |||||
| Hepatitis B (newly acquired) | hepatitis | −0.116 | −0.014 | −0.064 | 0.157 | −0.037 | 0.123 | 0.0333 | |||
| Listeriosis | listeria | −0.010 | 0.091 | 0.0906 | |||||||
| Botulism | botulism | 0.053 | 0.2299 |
The table only contains the search term with the highest degree of correlation for each disease; see supplementary material for a full list of search terms and correlation coefficients. p-values relate to correlations at national level. Empty cells indicate that Google Trends data were not available.
Model performance for 1 week (top) and 2 week estimates (bottom), as assessed by Mean Square Error of Prediction.
| 1 Week estimate | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Gonococcal infection | 1.183 | 1.237 | 1.212 | 1.229 | 1.301 | 1.274 | 1.368 | 1.268 | 1.340 | 1.356 | 1.358 | |
| Varicella zoster (Shingles) | 0.970 | 1.105 | 0.850 | 0.907 | 0.964 | 0.946 | 0.937 | 0.954 | 1.008 | 0.967 | 0.976 | |
| Pneumococcal disease (invasive) | 0.478 | 0.510 | 0.523 | 0.548 | 0.347 | 0.564 | 0.376 | 0.420 | 0.396 | 0.435 | 0.437 | |
| Ross River virus infection | 0.394 | 0.542 | 0.465 | 0.514 | 0.365 | 0.537 | 0.351 | 0.478 | 0.289 | 0.290 | ||
| Pertussis | 1.418 | 1.447 | 1.416 | 2.053 | 1.659 | 2.206 | 2.118 | 2.299 | 2.265 | 2.240 | 2.189 | |
| Chlamydial infection | 1.089 | 1.019 | 0.983 | 0.968 | 0.832 | 0.832 | 0.847 | 0.826 | 0.847 | 0.826 | ||
| Varicella zoster (unspecified) | 0.771 | 0.830 | 0.845 | 0.892 | 0.942 | 0.883 | 0.909 | 0.994 | 0.957 | 1.014 | 1.014 | |
| Varicella zoster (Chickenpox) | 0.862 | 0.879 | 0.758 | 0.773 | 0.805 | 0.760 | 0.760 | 0.875 | 0.926 | 0.884 | 0.861 | |
| Cryptosporidiosis | 1.050 | 1.071 | 1.089 | 1.048 | 1.049 | |||||||
| Barmah Forest virus infection | 1.527 | 1.520 | 1.249 | 1.188 | 1.214 | 1.201 | 1.267 | 1.180 | 1.189 | 1.197 | 1.203 | |
| Dengue virus infection | 1.362 | 1.748 | 1.355 | 1.417 | 1.220 | 1.263 | 1.149 | 1.177 | 1.110 | 1.334 | 1.342 | |
| Influenza (laboratory confirmed) | 0.425 | 0.463 | 0.438 | 0.461 | 0.485 | 0.461 | 0.490 | 0.718 | 0.681 | 0.915 | 0.922 | |
| Gonococcal infection | 1.213 | 1.240 | 1.217 | 1.231 | 1.235 | 1.306 | 1.181 | 1.317 | 1.266 | 1.367 | 1.267 | |
| Varicella zoster (Shingles) | 1.015 | 1.071 | 0.846 | 0.903 | 0.907 | 0.950 | 0.918 | 0.918 | 1.020 | 0.979 | 0.966 | |
| Pneumococcal disease (invasive) | 0.388 | 0.496 | 0.529 | 0.473 | 0.374 | 0.577 | 0.454 | 0.403 | 0.462 | 0.446 | 0.495 | |
| Ross River virus infection | 0.425 | 0.449 | 0.439 | 0.452 | 0.364 | 0.465 | 0.350 | 0.378 | 0.303 | 0.295 | 0.304 | |
| Pertussis | 1.575 | 1.594 | 1.571 | 2.166 | 1.922 | 2.399 | 2.313 | 2.453 | 2.436 | 2.397 | 2.378 | |
| Chlamydial infection | 1.052 | 1.103 | 0.992 | 1.060 | 0.991 | 0.991 | 0.884 | 0.948 | 0.884 | 0.948 | ||
| Varicella zoster (unspecified) | 0.886 | 0.869 | 0.878 | 0.951 | 0.854 | 0.929 | 0.854 | 1.034 | 0.980 | 1.046 | 0.995 | |
| Varicella zoster (Chickenpox) | 0.862 | 0.771 | 0.789 | 0.832 | 0.930 | 0.769 | 0.781 | 0.969 | 1.060 | 0.923 | 0.908 | |
| Cryptosporidiosis | 1.092 | 1.098 | 1.089 | 1.096 | ||||||||
| Barmah Forest virus infection | 1.643 | 1.677 | 1.257 | 1.202 | 1.265 | 1.221 | 1.249 | 1.204 | 1.200 | 1.221 | 1.222 | |
| Dengue virus infection | 1.373 | 1.527 | 1.335 | 1.344 | 1.269 | 1.277 | 1.178 | 1.175 | 1.172 | 1.341 | 1.309 | |
| Influenza (laboratory confirmed) | 0.473 | 0.464 | 0.428 | 0.490 | 0.473 | 0.501 | 0.481 | 0.769 | 0.734 | 0.966 | 0.966 | |
The highest performing models for each disease are indicated in bold. Model characteristics are described in Table 4.
Figure 1Boxplots of cross-correlation results for search terms and pneumococcal disease or Ross River virus infection.
Cross-correlations were estimated using a shifting 52 or 104-week window over a 156 week (2009–11) period or for the entirety of the 156-week period. Red, green and blue dots indicate the mean best cross correlation for the 52, 104 and 156-week period respectively; dark lines indicate the median.
Figure 2One (left) and two (right) week models for pneumococcal disease (top) and Ross River virus infection (bottom).
Solid blue line indicates notifications; broken red line indicates the model estimate; grey shading indicates the 95% confidence interval; and the green shading at the bottom indicates the number of keywords used in the model to create the estimate.
Summary of model characteristics.
| Model | Training period | Google Trends data | Keyword selection | Model Name |
|---|---|---|---|---|
| 1 | 52 weeks | Raw data | Continuous | 52RC |
| 2 | 52 weeks | Wavelet transformed | Continuous | 52WC |
| 3 | 104 weeks | Raw data | Continuous | 104RC |
| 4 | 104 weeks | Wavelet transformed | Continuous | 104WC |
| 5 | 156 weeks | Raw data | Continuous | 156RC |
| 6 | 156 weeks | Wavelet transformed | Continuous | 156WC |
| 7 | 52 weeks | Raw data | Set | 52RS |
| 8 | 52 weeks | Wavelet transformed | Set | 52WS |
| 9 | 104 weeks | Raw data | Set | 104RS |
| 10 | 104 weeks | Wavelet transformed | Set | 104WS |
| 11 | 156 weeks | Raw data | Set | 156RS |
| 12 | 156 weeks | Wavelet transformed | Set | 156WS |
1The training period denotes how many weeks data are available to the model for fitting, keyword selection and wavelet construction. This period was also used to determine the best lag for keywords used in these models (but was restricted to the 2009–2011 data).
2Indicates the search metrics data available for the model.
3In producing forecasts for holdout data (2012–2013), continuous models are able to reselect keywords at each time point using the previous 52, 104 or 156 weeks data; set models use a selection of keywords determined using only the 2009–2011 data.
4Models are named using a combination of the number of weeks data visible to them (52/104/156), format of search metric data (raw/wavelet transformed; R/W) and the method of keyword selection (continuous/set; C/S).