| Literature DB >> 25938675 |
Wei Luo1, Thin Nguyen1, Melanie Nichols2, Truyen Tran1, Santu Rana1, Sunil Gupta1, Dinh Phung1, Svetha Venkatesh1, Steve Allender2.
Abstract
For years, we have relied on population surveys to keep track of regional public health statistics, including the prevalence of non-communicable diseases. Because of the cost and limitations of such surveys, we often do not have the up-to-date data on health outcomes of a region. In this paper, we examined the feasibility of inferring regional health outcomes from socio-demographic data that are widely available and timely updated through national censuses and community surveys. Using data for 50 American states (excluding Washington DC) from 2007 to 2012, we constructed a machine-learning model to predict the prevalence of six non-communicable disease (NCD) outcomes (four NCDs and two major clinical risk factors), based on population socio-demographic characteristics from the American Community Survey. We found that regional prevalence estimates for non-communicable diseases can be reasonably predicted. The predictions were highly correlated with the observed data, in both the states included in the derivation model (median correlation 0.88) and those excluded from the development for use as a completely separated validation sample (median correlation 0.85), demonstrating that the model had sufficient external validity to make good predictions, based on demographics alone, for areas not included in the model development. This highlights both the utility of this sophisticated approach to model development, and the vital importance of simple socio-demographic characteristics as both indicators and determinants of chronic disease.Entities:
Mesh:
Year: 2015 PMID: 25938675 PMCID: PMC4418831 DOI: 10.1371/journal.pone.0125602
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Data used for model derivation and validation.
The 50 states were randomly sampled into a derivation group of 30 states and a validation group of 20 states (See Fig 2). Models were derived from data of the derivation group during years 2007 to 2010, and were validated using 2011–2012 data of both the derivation group and the validation group.
Fig 2States allocated to the derivation (training) group and the validation (testing) group for model development.
30 states from the derivation group were coloured blue; 20 states from the validation group were coloured red.
Independent and dependent variables included in the study, measuring proportions of population for different survey responses.
| Variable | Details and categories |
|---|---|
| Independent variables (American Community Survey) | |
| Age | Proportion (%) of population aged: 18–25 years |
| Sex | Proportion (%) of males in population |
| Race | Proportion (%) of population: White |
| Household income | Proportion (%) of population with annual total household income: <USD $15,000 |
| Employment status | Proportion (%) of population: Employed |
| Marital status | Proportion (%) of population: Married |
| Education | Proportion (%) of population with the following levels of formal education: Did not complete high school |
| Dependent variables (Behavioral Risk Factor Surveillance Survey) | |
| High blood pressure | State-level proportion of adults (%) who have ever been told they have high blood pressure. Self-reported. |
| Obese | State-level prevalence (%) of population with body mass index (BMI) greater than 30kg/m2. Based on self-reported weight and height. |
| Cardiovascular Disease—Angina or coronary heart disease | State-level prevalence (%) of self-reported chronic heart condition. “Has a doctor, nurse, or other health professional ever told you that you had angina or coronary heart disease?” |
| Cardiovascular Disease—Heart attack | State-level prevalence (%) of self-reported history of AMI. “Has a doctor, nurse, or other health professional ever told you that you had a heart attack, also called a myocardial infarction?” |
| Cardiovascular Disease—Stroke | State-level prevalence (%) of self-reported history of stroke. “Has a doctor, nurse, or other health professional ever told you ever told you that you had a stroke?” |
| Diabetes | State-level prevalence (%) of lifetime diabetes diagnosis. Excluding gestational diabetes only diagnoses and borderline / pre-diabetes. “Has a doctor, nurse, or other health professional ever told you that you have diabetes?” |
*reference category.
Prevalence of NCD and risk factors for the derivation group states and the validation group states at year 2011, defined by proportion of positive responses to the corresponding BRFSS questions.
| State Prevalence, mean (sd) | ||
|---|---|---|
| Derivation group | Validation group | |
|
| ||
| High blood pressure | 31.1 (3.3) | 32.5 (4.1) |
| Obese | 27.6 (3.2) | 27.8 (2.7) |
|
| ||
| Cardiovascular Disease-Angina or CHD | 4.2 (0.9) | 4.3 (0.8) |
| Cardiovascular Disease-Heart attack | 4.4 (0.9) | 4.5 (0.9) |
| Cardiovascular Disease-Stroke | 2.9 (0.5) | 3.1 (0.7) |
| Diabetes | 9.4 (1.3) | 9.7 (1.4) |
Fig 3Coefficients of independent demographical variables in models predicting different NCD and risk factors.
The dashed lines show the zeros. The center dots show the means of coefficients. The thicker bars show ± sd and the thinner bars show ± 2sd based on 1000 models with resampling of the training data.
Accuracy of risk factor and disease estimates evaluated using out-of-sample data from 2011 and 2012.
| RMSE | Median Absolute Error (Median Relative Absolute Error) | Pearson correlation (95% CI) | ||||
|---|---|---|---|---|---|---|
| Derivation | Validation | Derivation | Validation | Derivation | Validation | |
|
| ||||||
| High blood pressure | 3.436 (-2.901) | 4.159 (-3.601) | 2.672 (0.097) | 3.491 (0.12) | 0.823 (0.658–0.913) | 0.864 (0.683–0.945) |
| Obese | 1.723 (-0.7) | 2.084 (-0.089) | 1.196 (0.043) | 1.515 (0.052) | 0.872 (0.794–0.922) | 0.753 (0.577–0.862) |
|
| ||||||
| Angina or CHD | 0.528 (-0.299) | 0.72 (-0.457) | 0.298 (0.076) | 0.458 (0.131) | 0.906 (0.847–0.943) | 0.806 (0.698–0.907) |
| Heart attack | 0.541 (-0.361) | 0.552 (-0.304) | 0.398 (0.097) | 0.385 (0.103) | 0.9 (0.837–0.939) | 0.861 (0.751–0.924) |
| Stroke | 0.409 (-0.301) | 0.537 (-0.399) | 0.242 (0.091) | 0.463 (0.184) | 0.876 (0.794–0.922) | 0.866 (0.759–0.927) |
| Diabetes | 1.381 (-1.158) | 1.188 (-1.008) | 1.115 (0.126) | 0.908 (0.103) | 0.851 (0.762–0.909) | 0.911 (0.838–0.952) |
*RMSE stands for Root Mean Squared.
Fig 4Predicted vs Observed values of state NCD and risk factor prevalence.
Agreement between the predicted and the observed values is reflected by proximity of the points with the diagonal dashed lines.
Prediction performance of different models.
| RMSE on validation group (bias) | ||||
|---|---|---|---|---|
| Stepwise regression | Group lasso | Random forest | Gaussian process regression | |
|
| ||||
| High blood pressure | 2.324 (-1.231) | 3.35 (-2.652) | 4.387 (-3.676) | 3.141 (-2.438) |
| Obese | 2.194 (-0.658) | 2.059 (-0.451) | 2.512 (-0.765) | 2.113 (-0.852) |
|
| ||||
| Cardiovascular Disease - Angina or CHD | 0.818 (-0.467) | 0.657 (-0.376) | 0.678 (-0.308) | 0.489 (-0.121) |
| Cardiovascular Disease - Heart attack | 0.595 (-0.164) | 0.518 (-0.24) | 0.64 (-0.328) | 0.547 (-0.288) |
| Cardiovascular Disease - Stroke | 0.513 (-0.312) | 0.491 (-0.327) | 0.493 (-0.36) | 0.57 (-0.343) |
| Diabetes | 1.091 (-0.839) | 1.166 (-0.999) | 1.498 (-1.204) | 1.432 (-1.05) |
*The following software packages were used: the stepAIC function from the MASS R package [13](for stepwise regression), the glmnet R package [14] (for group lass), the randomForest R package [15] (for random forest), and the GPML Matlab toolbox [16] (for Gaussian process regression). For Gaussian process regression, feature selection was first performed with Hilbert-Schmidt Independence Criterion Lasso [17]. The mean function was a constant function of the mean prevalence in the training set. The covariance function was the squared exponential with a maximum allowable covariance 10 and a length parameter 10.
Number of states with observed values differing from predicted values by greater than 10% of estimate, by outcome and year.
| NCD prevalence >10% better than predicted by demographic model | NCD prevalence >10% worse than predicted by demographic model | |||
|---|---|---|---|---|
| 2011 | 2012 | 2011 | 2012 | |
| CVD—Angina/CHD | 6 | 0 | 9 | 23 |
| CVD—AMI | 5 | 2 | 17 | 24 |
| CVD—Stroke | 4 | 1 | 21 | 27 |
| Diabetes | 1 | 2 | 15 | 26 |
| High blood pressure | 0 | n/a | 24 | n/a |
| Obesity | 2 | 4 | 3 | 5 |