| Literature DB >> 33140687 |
Liangyuan Hu1,2, Bian Liu1, Jiayi Ji1,2, Yan Li1,3.
Abstract
Background Stroke is a major cardiovascular disease that causes significant health and economic burden in the United States. Neighborhood community-based interventions have been shown to be both effective and cost-effective in preventing cardiovascular disease. There is a dearth of robust studies identifying the key determinants of cardiovascular disease and the underlying effect mechanisms at the neighborhood level. We aim to contribute to the evidence base for neighborhood cardiovascular health research. Methods and Results We created a new neighborhood health data set at the census tract level by integrating 4 types of potential predictors, including unhealthy behaviors, prevention measures, sociodemographic factors, and environmental measures from multiple data sources. We used 4 tree-based machine learning techniques to identify the most critical neighborhood-level factors in predicting the neighborhood-level prevalence of stroke, and compared their predictive performance for variable selection. We further quantified the effects of the identified determinants on stroke prevalence using a Bayesian linear regression model. Of the 5 most important predictors identified by our method, higher prevalence of low physical activity, larger share of older adults, higher percentage of non-Hispanic Black people, and higher ozone levels were associated with higher prevalence of stroke at the neighborhood level. Higher median household income was linked to lower prevalence. The most important interaction term showed an exacerbated adverse effect of aging and low physical activity on the neighborhood-level prevalence of stroke. Conclusions Tree-based machine learning provides insights into underlying drivers of neighborhood cardiovascular health by discovering the most important determinants from a wide range of factors in an agnostic, data-driven, and reproducible way. The identified major determinants and the interactive mechanism can be used to prioritize and allocate resources to optimize community-level interventions for stroke prevention.Entities:
Keywords: cardiovascular health; neighborhood; prevention; tree‐based machine learning; variable selection
Mesh:
Year: 2020 PMID: 33140687 PMCID: PMC7763737 DOI: 10.1161/JAHA.120.016745
Source DB: PubMed Journal: J Am Heart Assoc ISSN: 2047-9980 Impact factor: 5.501
Distribution of 24 Potential Neighborhood‐Level Predictors and Prevalence of Stroke Across 26 697 Census Tracts in 500 Major US Cities
| Domain | Variable Name | Definition | (Min, Max) | Median (Q1–Q3) | Mean (SD) | Data Source |
|---|---|---|---|---|---|---|
| Health outcomes | STROKE | Stroke among adults aged ≥18 y | (0.30, 18.80) | 2.80 (2.20–3.60) | 3.11 (1.43) | CDC 500 Cities Data |
| Unhealthy behaviors | SMOKING | Current smoking among adults aged ≥18 y | (2.00, 48.70) | 18.30 (14.30–23.10) | 19.10 (6.42) | CDC 500 Cities Data |
| NO_PA | No leisure‐time physical activity among adults aged ≥18 y | (7.90, 61.30) | 24.20 (18.30–31.60) | 25.30 (8.62) | ||
| OBESITY | Obesity among adults aged ≥18 y | (8.70, 58.50) | 28.60 (23.70–34.90) | 29.76 (8.06) | ||
| INSUF_SLEEP | Sleeping <7 h among adults aged ≥18 y | (18.50, 59.80) | 36.30 (32.50–41.20) | 37.10 (6.42) | ||
| Prevention | LACK_INSURANCE | Current lack of health insurance among adults aged 18–64 y | (2.50, 70.80) | 18.00 (11.70–27.40) | 20.58 (11.27) | CDC 500 Cities Data |
| DENTAL | Visits to dentist or dental clinic among adults aged ≥18 y | (18.90, 87.10) | 61.30 (49.80–70.50) | 59.82 (13.16) | ||
| COLON_SCREEN | Fecal occult blood test, sigmoidoscopy, or colonoscopy among adults aged 50–75 y | (23.40, 81.50) | 60.60 (52.60–66.60) | 59.29 (9.38) | ||
| CORE_PREV_M | Older adults aged ≥65 y who are up to date on a core set of clinical preventive services (men: flu shot past year, pneumococcal polysaccharides vaccine [PPV] shot ever, colorectal cancer screening) | (13.10, 52.20) | 29.90 (24.80–34.60) | 29.88 (6.50) | ||
| CORE_PREV_W | Older adults aged ≥65 y who are up to date on a core set of clinical preventive services (women: same as above and mammogram past 2 y) | (9.60, 53.80) | 28.60 (23.00–33.90) | 28.64 (7.32) | ||
| Sociodemographic status | AGE65_OVER | Population aged ≥65 | (0.00, 100.00) | 14.81 (10.61–19.79) | 15.81 (8.02) | ACS |
| AGE18_34 | Population aged between 18 and 34 | (0.00, 100.00) | 33.68 (27.28–40.74) | 35.01 (12.63) | ||
| COLLEGE_HIGHER | Bachelor's degree or higher | (0.00, 100.00) | 23.71 (12.27–40.33) | 28.28 (19.62) | ||
| HS_COLLEGE | High school graduate or higher | (0.00, 100.00) | 85.51 (75.78–91.66) | 82.44 (11.87) | ||
| FEMALE | Female | (0.00, 100.00) | 51.19 (48.82–53.60) | 51.04 (4.99) | ||
| NON_HIS_ASIAN | Not Hispanic or Latino—Asian alone | (0.00, 91.32) | 3.08 (0.72–8.50) | 7.26 (11.36) | ||
| NON_HIS_BLACK | Not Hispanic or Latino—Black or African American alone | (0.00, 100.00) | 7.37 (2.19–24.43) | 19.73 (26.69) | ||
| NON_HIS_OTHER | Not Hispanic or Latino—Other | (0.00, 119.45) | 4.61 (2.07–8.06) | 6.02 (6.44) | ||
| NON_HIS_WHITE | Not Hispanic or Latino—White alone | (0.00, 100.00) | 48.02 (17.24–72.15) | 45.65 (29.55) | ||
| POVERTY | Below poverty level; estimate; families | (0.00, 100.00) | 12.10 (5.1 0–24.00) | 16.09 (13.91) | ||
| MED_INCOME | Median household income in the past 12 mo (in thousands) | (4.17, 250.00) | 49.58 (34.10–70.43) | 55.49 (29.17) | ||
| Environmental factors | HOUSE_PRE1960 | Pre‐1960 housing (lead paint indicator) (in thousands) | (0.00, 8.13) | 0.48 (0.10–0.92) | 0.59 (0.56) | |
| TRAFFIC | Traffic proximity and volume (average number of vehicles/distance) | (0.00, 62.11) | 0.39 (0.12–1.10) | 1.17 (2.75) | ||
| OZONE | Ozone level in air (ppb) | (27.63, 73.67) | 48.74 (44.40–52.81) | 48.04 (8.13) | EPA‐EJSCREEN | |
| PM25 |
PM2.5 level in air (μg/m3) | (4.97, 13.32) | 9.89 (8.54–10.66) | 9.71 (1.53) |
Measures are in percentages for all variables except those marked with a double dagger. PM indicates particulate matter; Q1 indicates first quantile; and Q3, third quantile.
Census tract level 500 Cities Data from the Centers for Disease Control and Prevention (CDC), which were modeled based on population‐based survey data from the Behavioral Risk Factor Surveillance System.
Census tract level data from the 2011–2015 American Community Survey 5‐Year Estimates provided by the Census Bureau.
Indicates variables with absolute measurements as opposed to percentages.
To match the geospatial unit of census tract available in the other 2 data sources, we aggregated the census block group level environmental measures to the census tract level by taking the means for PM2.5 and O3, and the sum for the housing data, and the sum of block‐group‐level population weighted traffic proximity data. PM2.5 concentrations are annual average of the daily ambient average, and ozone concentrations are average of daily maximum 8‐hour level for the summer season. Both PM2.5 and ozone were from a space‐time downscaling fusion model based on monitoring data and modeled data. Traffic data reflect annual average daily traffic count of vehicles, that is, count of vehicle at major roads within 500 meters divided by distance in meters, and was calculated based on traffic data from the US Department of Transportation. Pre‐1960 housing data were based on the American Community Survey from the US Census.
Figure 1Variable selection algorithm using BART‐Machine.
BART indicates Bayesian Additive Regression Trees; and VIP, variable inclusion proportions.
Figure 2Comparison of Cross‐validated (CV) root mean squared error (RMSE) for each of 4 tree‐based methods.
BART indicates Bayesian additive regression trees; gbm, gradient boosting machines; and RF, random forests.
Figure 3Visualization of the variable selection procedures for stroke.
The blue lines are the threshold levels for variable selection procedure described in Figure 1. The red line represents the cutoff determined by a more stringent rule. Variables passing this threshold are displayed as solid dots. Variables that exceed the blue lines but not the red line are represented as asterisks. We select variables with either an asterisk or a solid dot. Open dots correspond to variables that are not selected.
Figure 4The top 10 average interaction counts (termed as relative importance) for the neighborhood‐level prevalence of stroke, averaged over 25 BART model constructions.
The segments atop the bars represent 95% confidence intervals. BART indicates Bayesian additive regression trees.
RMSE Reduction, Number of Selected Predictors, and Selected Predictors by Each of 4 Tree‐Based Methods
| Methods | RMSE | RMSE Reduction per Predictor | Number of Predictors | Selected Predictors |
|---|---|---|---|---|
| BART | 0.48 | 0.15 | 6 |
NO_PA, NON_HIS_BLACK, AGE65_OVER, MED_INCOME, OZONE, NO_PA×AGE65_OVER |
| XGBoost | 0.48 | 0.11 | 8 | NO_PA, NON_HIS_BLACK, AGE65_OVER, MED_INCOME, OBESITY, SMOKING, INSUF_SLEEP, NO_PA×AGE65_OVER |
| gbm | 0.49 | 0.11 | 8 | NO_PA, NON_HIS_BLACK, AGE65_OVER, MED_INCOME, OBESITY, SMOKING, AGE18_34, NO_PA×AGE65_OVER |
| RFs | 0.47 | 0.09 | 11 | NO_PA, OBESITY, AGE65_OVER, NON_HIS_BLACK, DENTAL, INSUF_SLEEP, SMOKING, MED_INCOME, COLON_SCREEN, LACK_INSURANCE |
The Pearson correlation was −0.9 between DENTAL and LACK_INSURANCE (selected by RFs), 0.75 between SMOKING and INSUF_SLEEP (selected by XGBoost) and 0.84 between OBESITY and SMOKING (selected by gbm). LR‐StepWise retained 20 out of 24 predictors. Neither of the two LR methods had the capability to identify interactions. BART indicates Bayesian additive regression trees; gbm, gradient boosting machines; LR, linear regression; RFs, random forests; RMSE, root mean squared error; and XGBoost, Extreme Gradient Boosting.
Figure 5RMSE, number of predictors and RMSE reduction per predictor for each of 6 methods: BART‐Machine, XGBoost, gbm, RF, stepwise LR variable selection and LR with all covariates.
BART indicates Bayesian additive regression trees; gbm, gradient boosting machines; LR, linear regression; RFs, random forests; RMSE, root mean squared error; and XGBoost, Extreme Gradient Boosting.
Figure 6Effect estimates and 95% credible intervals for 5 key predictors and 1 most important interaction.
Effect estimates represent average changes in percent of stroke per 10% increase in AGE65_OVER, NO_PA or NON_HIS_BLACK, and per $100 000 increase in MED_INCOME and per 100 ppb (for the sake of legibility of the effect) increase in OZONE. AGE65_OVER indicates proportion of people who were ≥65 years old; MED_INCOME, Median household income in the past 12 months (in thousands); NO_PA, proportion of adults who had no leisure‐time physical activity; NON_HIS_BLACK, proportion of non‐Hispanic Black people; and OZONE, ozone level in air (ppb).