| Literature DB >> 31026030 |
David Scheinker1,2, Areli Valencia3, Fatima Rodriguez4.
Abstract
Importance: Obesity is a leading cause of high health care expenditures, disability, and premature mortality. Previous studies have documented geographic disparities in obesity prevalence. Objective: To identify county-level factors associated with obesity using traditional epidemiologic and machine learning methods. Design, Setting, and Participants: Cross-sectional study using linear regression models and machine learning models to evaluate the associations between county-level obesity and county-level demographic, socioeconomic, health care, and environmental factors from summarized statistical data extracted from the 2018 Robert Wood Johnson Foundation County Health Rankings and merged with US Census data from each of 3138 US counties. The explanatory power of the linear multivariate regression and the top performing machine learning model were compared using mean R2 measured in 30-fold cross validation. Exposures: County-level demographic factors (population; rural status; census region; and race/ethnicity, sex, and age composition), socioeconomic factors (median income, unemployment rate, and percentage of population with some college education), health care factors (rate of uninsured adults and primary care physicians), and environmental factors (access to healthy foods and access to exercise opportunities). Main Outcomes and Measures: County-level obesity prevalence in 2018, its association with each county-level factor, and the percentage of variation in county-level obesity prevalence explained by linear multivariate and gradient boosting machine regression measured with R2.Entities:
Mesh:
Year: 2019 PMID: 31026030 PMCID: PMC6487629 DOI: 10.1001/jamanetworkopen.2019.2884
Source DB: PubMed Journal: JAMA Netw Open ISSN: 2574-3805
Figure 1. Distribution of Obesity Prevalence by County and Census Region
A, Map of US counties by obesity prevalence. B, Density plot of county-level obesity prevalence in each US Census region.
Variables Included in the Regression Analysis With Summary Statistics and Univariate Regression Results for 2018 County-Level Obesity Prevalence
| Variable | Summary Statistics, Mean (SD) [Range], % | Univariate Regression Results | |
|---|---|---|---|
| Coefficient (SE) | |||
| Demographic factors | |||
| Population | 59.1 (10.4) [16.7-100] | −0.0721 (0.0076) | 0.0277 |
| Rural | 58.6 (31.5) [0-100] | 0.0345 (0.0025) | 0.0579 |
| Female | 49.9 (2.3) [27.8-56.5] | 0.1195 (0.0354) | 0.0036 |
| Aged <18 y | 22.3 (3.5) [0-40.9] | 0.2495 (0.0228) | 0.0369 |
| Aged ≥65 y | 18.4 (4.6) [4.6-56.3] | −0.0624 (0.0176) | 0.0040 |
| African American | 9.0 (14.3) [0-85.2] | 0.1042 (0.0053) | 0.1092 |
| Hispanic | 9.3 (13.7) [0.5-96.3] | −0.0946 (0.0057) | 0.0820 |
| Asian | 1.5 (2.9) [0-44.3] | −0.5008 (0.0268 | 0.1005 |
| American Indian/Alaskan Native | 2.3 (7.7) [0-93.1] | 0.0568 (0.0105) | 0.0093 |
| Native Hawaiian/other | 0.1 (1.0) [0-50] | −0.3945 (0.0815) | 0.0074 |
| Census region | 0.2377 | ||
| Midwest | NA | 32.2 (3.0) | NA |
| Northeast | 28.6 (4.0) | ||
| South | 32.9 (4.2) | ||
| West | 32.4 (4.8) | ||
| Socioeconomic factors | |||
| Household income | 91.3 (2.1) [84.7-100] | −1.0254 (0.0347) | 0.2179 |
| Some college | 57.2 (11.6) [15.5- 94.0] | −0.1563 (0.0064) | 0.1597 |
| Food insecure | 14.1 (4.2) [3.4- 37.9] | 0.4065 (0.0176) | 0.1455 |
| Unemployed | 5.3 (1.9) [1.7- 23.5] | 0.6532 (0.0412) | 0.0743 |
| Severe housing problems | 14.5 (4.8) [2.7-70.1] | −0.1626 (0.0166) | 0.0297 |
| Health care factors | |||
| Uninsured | 12.0 (5.1) [2.1-37.4] | −0.0571 (0.0158) | 0.0041 |
| Primary care physician rate | 12.1 (7.7) [0-100] | −0.1769 (0.0102) | 0.0907 |
| Environmental factors | |||
| Access to exercise opportunities | 63.0 (23.2) [0-100] | −0.0694 (0.0033) | 0.1269 |
| Food environment index | 7.4 (1.2) [0-10.0] | −1.0379 (0.0657) | 0.0741 |
Abbreviation: NA, not applicable because variables were not used in the corresponding mode.
P < .01.
Mean (SD) reported.
Variables were log normalized and scaled to have a maximum value of 100.
Multivariate Regression Results
| Variable | Demographic Factors | Socioeconomic Factors | Health Care Factors | Environmental Factors | Combined |
|---|---|---|---|---|---|
| Observations, No. | 3135 | 3137 | 3003 | 3117 | 2984 |
| 0.452 | 0.331 | 0.092 | 0.156 | 0.603 | |
| Adjusted | 0.449 | 0.330 | 0.091 | 0.155 | 0.600 |
Demographic factors include percentage of population, percentage rural, percentage female, percentage younger than 18 years, percentage 65 years and older, percentage African American, percentage Hispanic, percentage Asian, percentage American Indian/Alaskan Native, and percentage Native Hawaiian/other.
Socioeconomic factors include household income, percentage of children in poverty, percentage with some college, percentage food insecure, percentage unemployed, and percentage with severe housing problems.
Health care factors include percentage uninsured and primary care physician rate.
Environmental factors include percentage with access to exercise opportunities and food environment index.
Combined includes all factors.
Multivariate Regression
| Variable | Coefficient (SE), % | ||||
|---|---|---|---|---|---|
| Model 1 | Model 2 | Model 3 | Model 4 | Model 5 | |
| Demographic factors | |||||
| Population | 0.007 (0.010) | 0.004 (0.010) | |||
| Rural | 0.018 (0.003) | −0.005 (0.003) | |||
| Female | −0.170 (0.034) | 0.031 (0.034) | |||
| Aged <18 y | 0.351 (0.029) | 0.297 (0.028) | |||
| Aged ≥65 y | 0.016 (0.023) | −0.080 (0.021) | |||
| African American | 0.082 (0.005) | 0.055 (0.006) | |||
| Hispanic | −0.073 (0.005) | −0.071 (0.006) | |||
| Asian | −0.256 (0.025) | −0.047 (0.027) | |||
| American Indian/Alaskan Native | 0.057 (0.009) | 0.076 (0.010) | |||
| Native Hawaiian/other | 0.160 (0.064) | 0.307 (0.145) | |||
| Census region | |||||
| Northeast | −1.876 (0.271) | −1.777 (0.242) | |||
| South | 0.184 (0.163) | −0.473 (0.166) | |||
| West | −4.390 (0.208) | −3.899 (0.199) | |||
| Socioeconomic factors | |||||
| Household income | −0.340 (0.052) | −0.667 (0.051) | |||
| Some college | −0.073 (0.008) | −0.077 (0.008) | |||
| Food insecure | 0.310 (0.023) | −0.017 (0.033) | |||
| Unemployed | 0.151 (0.045) | 0.199 (0.039) | |||
| Severe housing problems | −0.305 (0.016) | −0.156 (0.017) | |||
| Health care factors | |||||
| Uninsured | 0.003 (0.016) | −0.139 (0.017) | |||
| Primary care physician rate | −0.177 (0.010) | −0.044 (0.008) | |||
| Environmental factors | |||||
| Access to exercise opportunities | −0.687 (0.066) | −0.009 (0.003) | |||
| Food environment index | −0.058 (0.003) | 0.025 (0.088) | |||
| Observations, No. | 3135 | 3137 | 3003 | 3117 | 2984 |
| 0.452 | 0.331 | 0.092 | 0.156 | 0.603 | |
| Adjusted | 0.449 | 0.330 | 0.091 | 0.155 | 0.600 |
Abbreviation: SE, standard error.
Combined category includes all factors.
P < .01.
P < .05.
Figure 2. Comparison of Performance of Gradient Boosting Machine Regression and Linear Multivariate Regression Using 30-Fold Cross Validation
Violin plots of the distribution of the R2 values of the gradient boosting machine and linear model models. The box plots inside the violin plot show the following values of the distribution of R2 for the gradient boosting machine and linear models: the middle lines indicate the medians, the bottom and top of each box show the 25th and 75th percentiles, respectively, the bottom whiskers show the values of the 25th percentile minus 1.5 × the interquartile range, the top whiskers show the values of the 75th percentile plus 1.5 × the interquartile range, and the top and bottom points are all outliers, defined as points in the data that lie below and above the whiskers.