| Literature DB >> 25392913 |
Nicholas Generous1, Geoffrey Fairchild1, Alina Deshpande1, Sara Y Del Valle1, Reid Priedhorsky1.
Abstract
Infectious disease is a leading threat to public health, economic stability, and other key social structures. Efforts to mitigate these impacts depend on accurate and timely monitoring to measure the risk and progress of disease. Traditional, biologically-focused monitoring techniques are accurate but costly and slow; in response, new techniques based on social internet data, such as social media and search queries, are emerging. These efforts are promising, but important challenges in the areas of scientific peer review, breadth of diseases and countries, and forecasting hamper their operational usefulness. We examine a freely available, open data source for this use: access logs from the online encyclopedia Wikipedia. Using linear models, language as a proxy for location, and a systematic yet simple article selection procedure, we tested 14 location-disease combinations and demonstrate that these data feasibly support an approach that overcomes these challenges. Specifically, our proof-of-concept yields models with r2 up to 0.92, forecasting value up to the 28 days tested, and several pairs of models similar enough to suggest that transferring models from one location to another without re-training is feasible. Based on these preliminary results, we close with a research agenda designed to overcome these challenges and produce a disease monitoring and forecasting system that is significantly more effective, robust, and globally comprehensive than the current state of the art.Entities:
Mesh:
Year: 2014 PMID: 25392913 PMCID: PMC4231164 DOI: 10.1371/journal.pcbi.1003892
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Diseases-location contexts analyzed, with data sources.
| Disease | Country | Language | Start | End | Resolution | Sources |
| Cholera | Haiti | French | 2010-12-05 | 2013-12-05 | daily |
|
| Dengue | Brazil | Portuguese | 2010-03-07 | 2013-03-16 | weekly |
|
| Thailand | Thai | 2011-01-01 | 2014-01-31 | monthly |
| |
| Ebola | Uganda/DRC | English | 2011-01-01 | 2013-12-31 | daily |
|
| HIV/AIDS | China (PRC) | Chinese | 2011-01-01 | 2013-12-31 | monthly |
|
| Japan | Japanese | 2010-10-09 | 2013-10-18 | weekly |
| |
| Influenza | Japan | Japanese | 2010-06-26 | 2013-07-05 | weekly |
|
| Poland | Polish | 2010-10-17 | 2013-10-23 | weekly |
| |
| Thailand | Thai | 2011-01-23 | 2014-02-01 | weekly |
| |
| United States | English | 2011-01-01 | 2014-01-10 | weekly |
| |
| Plague | United States | English | 2011-01-22 | 2014-01-31 | weekly |
|
| Tuberculosis | China (PRC) | Chinese | 2010-12-01 | 2013-12-31 | monthly |
|
| Norway | Norwegian | 2010-12-01 | 2013-12-31 | monthly |
| |
| Thailand | Thai | 2010-12-01 | 2013-12-31 | monthly |
|
This table lists the 7 diseases in 9 locations analyzed, for a total of 14 disease-location contexts. For each context, we list the language used as a location proxy, the inclusive start and end dates of analysis, the resolution of the disease incidence data, and one or more citations for those data.
Transferability r t example.
| Article | Japanese | Thai |
| Fever |
|
|
| Chills |
| |
| Headache |
|
|
| Influenza |
|
|
This table shows simplified models for influenza in two locations: Japan, where Japanese is spoken, and Thailand, where Thai is spoken. The Japanese model yielded correlations for Japanese versions of the articles “Fever”, “Chills”, “Headache”, and “Influenza” of 0.23, 0.59, −0.10, and 0.85, respectively. The Thai model yielded correlations of 0.21, 0.15, and 0.77 for “Fever”, “Headache”, and “Influenza”, respectively. Note that the article “Chills” is not currently present in the Thai Wikipedia. Therefore, the correlation vectors are and for the two languages. The meta-correlation, r t, for these two vectors, which provides a gross estimate of how similar the models are, is 0.97. Extending this computation to the full models yields r t = 0.81, as noted below in Table 4.
Transferability scores r t for paired models.
| Disease | Location 1 | Location 2 |
|
| Dengue | Brazil | Thailand | 0.39 |
| HIV/AIDS | China (PRC) | Japan | −0.06 |
| Influenza | Japan | Poland | 0.45 |
| Japan | Thailand | 0.81 | |
| Japan | United States | 0.62 | |
| Poland | Thailand | 0.48 | |
| Poland | United States | 0.44 | |
| Thailand | United States | 0.76 | |
| Tuberculosis | China (PRC) | Norway | 0.19 |
| China (PRC) | Thailand | −0.20 | |
| Norway | Thailand | n/a |
This table lists the transferability scores r t for each tested pair of countries within a disease. Countries that did not share enough articles to compute a meaningful r t are indicated with n/a.
Figure 1Selected successful model nowcasts.
These graphs show official epidemiological data and nowcast model estimate (left Y axis) with traffic to the five most-correlated Wikipedia articles (right Y axis) over the 3 year study periods. The Wikipedia time series are individually self-normalized. Graphs for the four remaining successful contexts (dengue in Thailand, influenza in Japan, influenza in Thailand, and tuberculosis in Thailand) are included in the supplemental data file S1.
Figure 2Forecasting effectiveness for selected successful models.
This figure shows model r 2 compared to temporal offset in days: positive offsets are forecasting, zero is nowcasting (marked with a dotted line), and negative offsets are anti-forecasting. As above, figures for the four successful contexts not included here are in the supplemental data S1.
Figure 3Nowcast attempts where the model was unable to capture a meaningful pattern in official data.
Figure 4Nowcast attempts with poor performance due to unfavorable signal-to-noise ratio.
Model performance summary.
|
| Best forec. | |||||||
| Disease | Location | Result | 0 | 7 | 14 | 28 | Days |
|
| Cholera | Haiti | Failure (SNR) | 0.45 | 0.39 | 0.41 | 0.48 | 26 | 0.50 |
| Dengue | Brazil | Success | 0.85 | 0.81 | 0.77 | 0.65 | −3 | 0.86 |
| Thailand | Success | 0.55 | 0.54 | 0.57 | 0.74 | 28 | 0.74 | |
| Ebola | Uganda/DRC | Failure (SNR) | 0.02 | 0.01 | 0.02 | 0.02 | 5 | 0.14 |
| HIV/AIDS | China (PRC) | Failure (Official data) | 0.62 | 0.48 | 0.34 | 0.31 | −1 | 0.63 |
| Japan | Failure (Official data) | 0.15 | 0.19 | 0.15 | 0.05 | 9 | 0.22 | |
| Influenza | Japan | Success | 0.82 | 0.92 | 0.86 | 0.52 | 8 | 0.92 |
| Poland | Success | 0.81 | 0.86 | 0.88 | 0.72 | 12 | 0.89 | |
| Thailand | Success | 0.79 | 0.76 | 0.67 | 0.48 | −2 | 0.80 | |
| United States | Success | 0.89 | 0.90 | 0.85 | 0.66 | 5 | 0.91 | |
| Plague | United States | Failure (SNR) | 0.23 | 0.03 | 0.05 | 0.07 | 0 | 0.23 |
| Tuberculosis | China (PRC) | Success | 0.66 | 0.66 | 0.52 | 0.25 | −9 | 0.78 |
| Norway | Failure (Official data) | 0.31 | 0.41 | 0.40 | 0.42 | 20 | 0.48 | |
| Thailand | Success | 0.68 | 0.68 | 0.69 | 0.69 | 9 | 0.69 | |
This table summarizes the performance of our estimation models. For each disease and location, we list the subjective success/failure classification as well as model r 2 at nowcasting (0-day forecast) and 7-, 14-, and 28-day forecasts. We also list the temporal offset in days of the best model (again, a positive offset indicates forecasting) along with that model's r 2.