| Literature DB >> 33229378 |
Gabriel M Knight1, Gabriela Spencer-Bonilla2, David M Maahs3,4,5, Manuel R Blum2,6,7, Areli Valencia8, Bongeka Z Zuma8, Priya Prahalad3, Ashish Sarraju9, Fatima Rodriguez10, David Scheinker11,12,13.
Abstract
INTRODUCTION: Population-level and individual-level analyses have strengths and limitations as do 'blackbox' machine learning (ML) and traditional, interpretable models. Diabetes mellitus (DM) is a leading cause of morbidity and mortality with complex sociodemographic dynamics that have not been analyzed in a way that leverages population-level and individual-level data as well as traditional epidemiological and ML models. We analyzed complementary individual-level and county-level datasets with both regression and ML methods to study the association between sociodemographic factors and DM. RESEARCH DESIGN AND METHODS: County-level DM prevalence, demographics, and socioeconomic status (SES) factors were extracted from the 2018 Robert Wood Johnson Foundation County Health Rankings and merged with US Census data. Analogous individual-level data were extracted from 2007 to 2016 National Health and Nutrition Examination Survey studies and corrected for oversampling with survey weights. We used multivariate linear (logistic) regression and ML regression (classification) models for county (individual) data. Regression and ML models were compared using measures of explained variation (area under the receiver operating characteristic curve (AUC) and R2).Entities:
Keywords: diabetes mellitus; ethnic groups; informatics; risk factors; type 2
Mesh:
Year: 2020 PMID: 33229378 PMCID: PMC7684662 DOI: 10.1136/bmjdrc-2020-001725
Source DB: PubMed Journal: BMJ Open Diabetes Res Care ISSN: 2052-4897
Figure 1Map of DM prevalence by county. Map of US counties according to county-level DM prevalence rates obtained from CHR data. CHR, County Health Rankings; DM, diabetes mellitus.
Univariate regression results
| Individual DM status – univariate logistic regression results | ||||
| Category | Factor | Mean (SD) (range) | Coefficient (SE) | AUROC |
| Demographic | Age | 49.4 (15.9) (20.0–80.0) | 0.0472 (0.00210)* | 0.694 |
| Female | 50.9 | −0.168 (0.0712)* | 0.520 | |
| African-American | 10.2 | 0.512 (0.0699)* | 0.535 | |
| Hispanic | 14.2 | 0.106 (0.0694)* | 0.512 | |
| Socioeconomic | Household income | NA | −0.866 (0.143)* | 0.580 |
| Some college | 61.1 | −0.544 (0.0722)* | 0.565 | |
Univariate regression results for models using factors shared between NHANES (individual level) and CHR (county level) datasets (ie, sociodemographic factors). For female gender, Hispanic and African-American race/ethnicity, and education level factors from the individual-level (NHANES) data, summary characteristics are expressed in terms of per cent of total sample; not all summary statistics could be calculated. Similarly, for household income from the individual-level data, values were collected and stored as ranges of income; summary statistics could not be calculated. Finally, for county-level median household income, variables were normalized and scaled to have a maximum value of 100.
*p<0.001.
AUROC, area under the receiver operating characteristic; CHR, County Health Rankings; DM, diabetes mellitus; NHANES, National Health and Nutrition Examination Survey.
Multivariate regression results
| Individual DM status – multivariate logistic regression results | |||||
| Category | Factor | Mean (SD) (range) | Coefficient (SE) | AUROC | |
| Demographic | Age | 49.4 (15.9) (20.0–80.0) | 0.0467 (0.00185)* | 0.695 | 0.711 |
| Female | 50.9 | −0.221 (0.0544)* | |||
| African-American | 10.2 | 0.633 (0.0688)* | |||
| Hispanic | 14.2 | 0.445 (0.0672)* | |||
| Socioeconomic | Household income | NA | −0.474 (0.127)* | 0.653 | |
| Some college | 61.1 | −0.200 (0.0595)† | |||
Multiple regression results for models using factors shared between NHANES (individual level) and CHR (county level) datasets (ie, sociodemographic factors). For female gender, Hispanic and African-American race/ethnicity, and education level factors from the individual-level (NHANES) data, summary characteristics are expressed in terms of percent of total sample; not all summary statistics could be calculated. Similarly, for household income from the individual-level data, values were collected and stored as ranges of income; summary statistics could not be calculated. Finally, for county-level median household income, variables were normalized and scaled to have a maximum value of 100.
*P<0.05.
†P<0.001.
AUROC, area under the receiver operating characteristic; CHR, county health rankings; DM, diabetes mellitus; NHANES, national health and nutrition examination survey.