| Literature DB >> 31875051 |
Christopher H Arehart1, Michael Z David2, Vanja Dukic3.
Abstract
The Morbidity and Mortality Weekly Reports of the U.S. Centers for Disease Control and Prevention document a raw proxy for counts of pertussis cases in the U.S., and the Project Tycho (PT) database provides an improved source of these weekly data. These data are limited because of reporting delays, variation in state-level surveillance practices, and changes over time in diagnosis methods. We aim to assess whether Google Trends (GT) search data track pertussis incidence relative to PT data and if sociodemographic characteristics explain some variation in the accuracy of state-level models. GT and PT data were used to construct auto-correlation corrected linear models for pertussis incidence in 2004-2011 for the entire U.S. and each individual state. The national model resulted in a moderate correlation (adjusted R2 = 0.2369, p < 0.05), and state models tracked PT data for some but not all states. Sociodemographic variables explained approximately 30% of the variation in performance of individual state-level models. The significant correlation between GT models and public health data suggests that GT is a potentially useful pertussis surveillance tool. However, the variable accuracy of this tool by state suggests GT surveillance cannot be applied in a uniform manner across geographic sub-regions.Entities:
Mesh:
Year: 2019 PMID: 31875051 PMCID: PMC6930253 DOI: 10.1038/s41598-019-56385-z
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1U.S. time-series showing pertussis incidence per 100,000 people categorized by age group for 1990–2017. After introduction of an acellular vaccine, there was an increase in incidence among school age and adolescent age groups. Data from the National Notifiable Diseases Surveillance System[9].
Terms describing selection of the 14 GT searches used for modeling incidence.
| Preliminary Word Bank | Source | Final Word Bank | Exclusion Justification | |
|---|---|---|---|---|
| 1 | “bordatella” | health information websites | yes | |
| 2 | “bordetella” | common misspelling | yes | |
| 3 | “CDC pertussis” | Pollet | no | |
| 4 | “chronic cough” | health information websites | yes | |
| 5 | “coqueluche” | Spanish term | yes | |
| 6 | “coughing fits” | health information websites | yes | |
| 7 | “coughing spell” | health information websites | no | |
| 8 | “exhaustion after cough” | health information websites | no | |
| 9 | “infant pertussis” | health information websites | no | |
| 10 | “infant whooping cough” | health information websites | no | |
| 11 | “pertusis” | common misspelling | yes | |
| 12 | “pertussis” | Pollet | yes | |
| 13 | “pertussis kids” | health information websites | no | |
| 14 | “pertussis symptoms” | Pollet | yes | |
| 15 | “pertussis treatment” | health information websites | yes | |
| 16 | “prolonged cough” | health information websites | no | |
| 17 | “puking after cough” | health information websites | no | |
| 18 | “symptoms whooping cough” | Pollet | no | |
| 19 | “tired after cough” | health information websites | no | |
| 20 | “tos ferina” | Spanish term | yes | |
| 21 | “uncontrollable cough” | health information websites | no | |
| 22 | “vomiting after cough” | health information websites | no | |
| 23 | “whooping cough adults” | Pollet | yes | |
| 24 | “whooping cough pertussis” | Pollet | no | |
| 25 | “whooping cough symptoms” | health information websites | yes | |
| 26 | “whooping cough treatment” | Pollet | yes | |
| 27 | “whooping cough” | Pollet | yes | |
| 28 | “whooping” | Pollet | no | Collinear with “whooping cough” |
Specific GT links are provided to illustrate how many terms in the preliminary word bank were excluded because they failed to return nonzero results above the privacy threshold – even in the most populated states such as California and New York.
The states’ sociodemographic abbreviated variable names, data sources, and descriptions used in the explanatory linear model.
| Variable Name | Data Source | Description |
|---|---|---|
| ACEP | 2014 American College of Emergency Physicians (ACEP) Report Card | Scores based on access to care, quality of patient safety, public health, medical liability, disaster preparedness |
| Age | 2010 Census | Percent of population between 20–49 years of age |
| Poverty | 2010 Census | Percent of population in poverty |
| Internet | 2010 Census | Percent of individuals living in a household with internet access |
| Education | 2010 Census | Percent of population with bachelor’s degree or higher |
| Urban | 2010 Census | Percent of individuals living in urban areas |
| Vaccinated | 2014 CDC Childhood Diphtheria toxoid, Tetanus toxoid, acellular Pertussis (DTaP) Vaccination Coverage Report | Percent DTaP vaccination coverage among children aged 19–35 months |
| Republican | Federal Elections 2012: Election Results for the U.S. President, the U.S. Senate, and the U.S. House of Representatives | Percent of people who voted for Mitt Romney (Republican) in the 2012 presidential election |
| Job | U.S. Department of Labor, Bureau of Labor Statistics: May 2017 State Occupational Employment and Wage Estimates | Percent of employed population working in Healthcare Practitioners/Technical Occupations and Healthcare Support Occupations (occupation codes 29–0000 and 31–0000) |
| Population | US Census Bureau Annual Estimates of the Resident Population | 2010 census population |
| Household | 2010 Census | Average number of individuals per household |
| Birth | 2010 CDC births by race of mother, United States, each state and territory | Births per 100,000 individuals |
| Immigration | Department of Homeland Security: Persons Obtaining Lawful Permanent Resident Status by State or Territory Of Residence: Fiscal Year 2012 | Number of people obtaining permanent residence in the United States. |
Modeling results for each method described by the 52-week forecasting RMSE and adjusted R2 values for the U.S. overall and the average for 51 U.S. regions.
| United States overall | Average for 50 states and Washington, D.C. | |||
|---|---|---|---|---|
| 52-Week Forecasting RMSE | 2004–2011 Mean Adjusted R2 | Mean 52-Week Forecasting RMSE | 2004–2011 Adjusted R2 | |
| 2.3342 | 0.2682 | 0.1823 | 0.0593 | |
| All Models Average | 2.5345 | 0.2560 | 0.1859 | 0.0577 |
| Top Models Average | 2.5453 | 0.2543 | 0.1861 | 0.0567 |
| AR(1) | 1.9788 | 0.2369 | 0.1808 | 0.0735 |
| AR(1) All Models Average | 1.8954 | 0.2249 | 0.1785 | 0.0713 |
| AR(1) Top Models Average | 1.8982 | 0.2682 | 0.1786 | 0.0707 |
Abbreviations: AIC(i*): lowest AIC model, All Models Average: average of 2n−1 models using posterior probabilities, Top Models Average: average of few most probable models using posterior probabilities, AR(1): models using the simple AR(1) Cochrane-Orcutt correction, RMSE: root-mean-square error.
Figure 2Time-series graphs showing PT pertussis incidence (black) per 100,000 people as a function of year for national U.S. data for 2004–2011. The left panel shows the results of all 6 modeling methods (see text), and the right panel shows the optimized AR(1) AIC(i*) model. The accuracy of this model supports previous findings that in larger geographic regions such as California[23] and Australia[24], GT models can track incidence. Some state-level models may be less accurate because they expose new sources of cultural and sociodemographic variability that are inconsequentially combined in the national model. Abbreviations: PT: Project Tycho, AIC(i*): lowest AIC model, All Models Average: average of 2n−1 models using posterior probabilities, Top Models Average: average of few most probable models using posterior probabilities, AR(1): models using the simple AR(1) Cochrane-Orcutt correction.
Figure 3Estimated pertussis incidence per 100,000 population, all modeling methods for the 52-week 2011 forecasting period, United States. Abbreviations: PT: Project Tycho, AIC(i*): lowest AIC model, All Models Average: average of 2n−1 models using posterior probabilities, Top Models Average: average of few most probable models using posterior probabilities, AR(1): models using the simple AR(1) Cochrane-Orcutt correction.
Figure 4Time-series data showing recorded incidence from PT (black) and AR(1) AIC(i*) modeled incidence (blue) for 2004–2011 for 2 states (North Dakota and New York) with well performing models in the top panels and 2 states (Connecticut and Alabama) with poorly performing models in the bottom panels. The variability between state-model accuracy suggests that GT surveillance approaches cannot be performed uniformly across regions of the U.S.
Figure 5(a) Heat map displaying the percentage of unexplained variation (1−adjusted R2) in the AR(1) AIC(i*) models spanning the 2004–2010 timeframe in the US. A larger model explanatory power (R2) adjusted for the number of predictors (adjusted R2) is indicated with lighter shading. (b) Is a heat map illustrating the state models’ predictive accuracy (52-week forecasting RMSE in 2011) where lighter shading represents a lower RMSE value and a better performing state AR(1) AIC(i*) model.
Summary of coefficients, standard error, p values and a brief interpretation of each variable’s effect on the 52-week forecast RMSE in the exploratory sociodemographic model (see Table 2 for definitions of variables).
| Variable | Coefficient | Standard Error | p | Interpretation |
|---|---|---|---|---|
| Intercept | 0.1258 | 0.0262 | 2.57E-05 | |
| ACEP | 0.0196 | 0.0224 | 0.3869 | Inconclusive |
| Age | −0.0916 | 0.0360 | 0.0153 | States/regions with a higher population of younger adults (age 20–49 years) may produce more accurate GT data with their search patterns. |
| Poverty | −0.0099 | 0.0311 | 0.7519 | Inconclusive |
| Internet | 0.0332 | 0.0331 | 0.3233 | More people with household internet access in a state/region corresponds to less accurate model forecasting. |
| Education | 0.0646 | 0.0427 | 0.1393 | More people with a bachelor’s degree in a state/region corresponds to less accurate model forecasting. |
| Urban | −0.0526 | 0.0246 | 0.0393 | People might produce more accurate GT data (i.e., better model forecasting) in urban settings. |
| Vaccinated | −0.0234 | 0.0250 | 0.3557 | Inconclusive |
| Republican | −0.0239 | 0.0324 | 0.4657 | Inconclusive |
| Job | −0.0001 | 0.0260 | 0.9984 | Inconclusive |
| Population | −0.2627 | 0.2991 | 0.3854 | Inconclusive, however we note that (because of the GT privacy threshold) larger populations in a state/region associate with more available data from GT. |
| Household | −0.0208 | 0.0178 | 0.2509 | States/regions with more people per household may have more accurate GT data. |
| Birth | 0.0903 | 0.0301 | 0.0048 | A higher birth rate might correspond with less accurate GT model predictions. |
| Immigration | 0.1272 | 0.1823 | 0.4896 | Inconclusive |
Interpretations were deemed inconclusive if the standard error was larger than the magnitude of the coefficient – otherwise the interpretation explains the coefficient’s directionality (how an increase in the variable relates to the forecasting RMSE).
Abbreviations: GT: Google Trends. See Table 2 for variable descriptions.