| Literature DB >> 33004983 |
In-Soo Kim1,2, Pil-Sung Yang3, Eunsun Jang1, Hyunjean Jung1, Seng Chan You4, Hee Tae Yu1, Tae-Hoon Kim1, Jae-Sun Uhm1, Hui-Nam Pak1, Moon-Hyoung Lee1, Jong-Youn Kim5, Boyoung Joung6.
Abstract
Clinical impact of fine particulate matter (PM2.5) air pollution on incident atrial fibrillation (AF) had not been well studied. We used integrated machine learning (ML) to build several incident AF prediction models that include average hourly measurements of PM2.5 for the 432,587 subjects of Korean general population. We compared these incident AF prediction models using c-index, net reclassification improvement index (NRI), and integrated discrimination improvement index (IDI). ML using the boosted ensemble method exhibited a higher c-index (0.845 [0.837-0.853]) than existing traditional regression models using CHA2DS2-VASc (0.654 [0.646-0.661]), CHADS2 (0.652 [0.646-0.657]), or HATCH (0.669 [0.661-0.676]) scores (each p < 0.001) for predicting incident AF. As feature selection algorithms identified PM2.5 as a highly important variable, we applied PM2.5 for predicting incident AF and constructed scoring systems. The prediction performances significantly increased compared with models without PM2.5 (c-indices: boosted ensemble ML, 0.954 [0.949-0.959]; PM-CHA2DS2-VASc, 0.859 [0.848-0.870]; PM-CHADS2, 0.823 [0.810-0.836]; or PM-HATCH score, 0.849 [0.837-0.860]; each interaction, p < 0.001; NRI and IDI were also positive). ML combining readily available clinical variables and PM2.5 data was found to predict incident AF better than models without PM2.5 or even established risk prediction approaches in the general population exposed to high air pollution levels.Entities:
Mesh:
Substances:
Year: 2020 PMID: 33004983 PMCID: PMC7530980 DOI: 10.1038/s41598-020-73537-8
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Study cohort and included subjects from the National Health Insurance Service National Sample Cohort (NHIS-NSC; overall general population). We randomly divided the population into the discovery (n = 302,811) and validation (n = 129,776) cohorts. In the discovery cohort, model construction and training were performed. Model fitting and performance measures were performed in the validation cohort. AF atrial fibrillation.
Baseline characteristics of the study population (n = 432,587).
| Variables | Discovery | Validation | ||
|---|---|---|---|---|
| No AF (n = 300,367) | Incident AF (n = 2,444) | No AF (n = 126,395) | Incident AF (n = 3,381) | |
| 48.3 ± 14.1 | 63.1 ± 13.4 | 45.7 ± 14.5 | 62.8 ± 12.4 | |
| 65–74 (%) | (11.3) | (30.3) | (7.5) | (35.2) |
| ≥ 75 (%) | (3.2) | (19.6) | (3.4) | (15.0) |
| Male (%) | (50.5) | (53.2) | (46.9) | (57.6) |
| 23.7 ± 3.2 | 24.2 ± 3.4 | 23.6 ± 3.5 | 24.3 ± 3.4 | |
| Obesity (BMI ≥ 27.5 kg/m2) (%) | (11.8) | (15.6) | (12.8) | (16.1) |
| 6.1 ± 11.8 | 8.1 ± 15.3 | 5.8 ± 11.4 | 8.8 ± 15.8 | |
| Non- (%) | (62.6) | (63.4) | (62.0) | (61.8) |
| < 20 pyrs (%) | (24.3) | (17.6) | (25.5) | (17.6) |
| ≥ 20 pyrs (%) | (13.1) | (19.0) | (12.5) | (20.6) |
| 61.2 ± 13.7 | 45.9 ± 12.7 | 63.9 ± 14.4 | 55.8 ± 15.1 | |
| ≥ 220.5 g/week | (7.8) | (6.3) | (8.3) | (7.7) |
| (60.9) | (62.6) | (59.3) | (62.2) | |
| (22.0) | (72.2) | (19.4) | (68.1) | |
| SBP (mmHg) | 122.4 ± 15.2 | 126.5 ± 16.5 | 121.4 ± 15.4 | 127.1 ± 16.3 |
| DBP (mmHg) | 76.2 ± 10.1 | 77.4 ± 10.9 | 75.6 ± 10.3 | 78.0 ± 10.4 |
| (6.1) | (19.1) | (6.2) | (17.8) | |
| Fasting blood glucose (mg/dL) | 97.6 ± 24.0 | 105.7 ± 32.8 | 97.8 ± 25.6 | 103.8 ± 29.6 |
| (6.3) | (18.8) | (4.0) | (20.6) | |
| eGFR (mL/min) | 88.0 ± 21.3 | 77.2 ± 21.7 | 93.4 ± 19.9 | 75.5 ± 21.2 |
| (18.9) | (54.5) | (19.7) | (46.5) | |
| Total cholesterol (mmol/L) | 195.4 ± 37.0 | 185.6 ± 38.8 | 193.3 ± 37.2 | 190.3 ± 39.1 |
| Triglyceride (mmol/L) | 131.8 ± 90.2 | 131.4 ± 78.3 | 124.9 ± 84.3 | 138.4 ± 84.9 |
| HDL cholesterol (mmol/L) | 56.1 ± 25.7 | 53.1 ± 24.6 | 56.3 ± 17.9 | 55.2 ± 38.2 |
| LDL cholesterol (mmol/L) | 114.2 ± 38.0 | 107.3 ± 37.0 | 112.4 ± 34.3 | 109.8 ± 37.5 |
| Previous MI (%) | (0.9) | (9.9) | (0.9) | (7.7) |
| Peripheral vascular disease (%) | (3.1) | (19.4) | (3.2) | (15.1) |
| Heart failure (%) | (2.1) | (27.5) | (2.1) | (23.0) |
| Previous stroke/TIA (%) | (3.6) | (20.6) | (3.6) | (17.3) |
| COPD | (2.2) | (10.7) | (2.3) | (9.8) |
| Hemoglobin (g/dL) | 13.9 ± 1.6 | 13.7 ± 1.8 | 13.9 ± 1.7 | 13.8 ± 1.7 |
| Aspartate transaminase (IU/L) | 25.3 ± 16.3 | 27.1 ± 15.5 | 24.9 ± 17.1 | 27.8 ± 17.8 |
| Alanine transaminase (IU/L) | 25.0 ± 21.7 | 24.4 ± 20.2 | 24.1 ± 22.4 | 25.5 ± 18.9 |
| Gamma-glutamyl transferase (U/L) | 36.0 ± 49.3 | 45.7 ± 72.9 | 36.0 ± 53.9 | 46.4 ± 67.9 |
| Antiplatelet agent (%) | (9.7) | (49.2) | (8.7) | (44.5) |
| Beta-blocker (%) | (7.5) | (40.0) | (6.7) | (34.2) |
| Statin (%) | (8.1) | (29.9) | (8.3) | (23.3) |
| Average PM2.5 concentration (μg/m3) | 18.5 | 32.7 | 18.5 | 35.7 |
| Total follow-up year | 3.8 | 3.9 | 3.8 | 3.9 |
AF atrial fibrillation, BMI body mass index (kg/m2), CKD chronic kidney disease (eGFR lower than 60 mL/min estimated by serum creatinine using CKD-EPI formula)[37], COPD chronic obstructive pulmonary disease, DBP diastolic blood pressure, eGFR estimated glomerular filtration rate (mL/min), HDL high density lipoprotein, LDL low density lipoprotein, MI myocardial infarction, PM particulate matter < 2.5 μm in diameter, SBP systolic blood pressure, TIA transient ischemic attack.
*Socioeconomic status was divided into two groups: higher (≥ 51% of income level) and lower (< 51% of income level).
Performance of predictive models for incident AF risk during follow-up period in overall general population.
| Models | c-index (95% CI) | NRI | IDI |
|---|---|---|---|
| Clinical variables-adjusted (TR1, model 1)* | 0.643 (0.636–0.649) | Ref | Ref |
| TR1 plus PM2.5-adjusted (model 1)† | 0.819 (0.813–0.825) | 1.069 (1.038–1.103) | 0.302 (0.294–0.322) |
| Clinical variables-adjusted (TR2, model 2)* | 0.684 (0.675–0.693) | Ref | Ref |
| TR2 plus PM2.5-adjusted (model 2)† | 0.869 (0.862–0.876) | 1.087 (1.060–1.113) | 0.219 (0.209–0.228) |
| CHA2DS2-VASc score | 0.654 (0.646–0.661) | Ref | Ref |
| PM-CHA2DS2-VASc score‡ | 0.859 (0.848–0.870) | 1.078 (1.059–1.102) | 0.220 (0.208–0.233) |
| CHADS2 score | 0.652 (0.646–0.657) | Ref | Ref |
| PM-CHADS2 score‡ | 0.823 (0.810–0.836) | 0.981 (0.962–1.001) | 0.042 (0.029–0.054) |
| HATCH score | 0.669 (0.661–0.676) | Ref | Ref |
| PM-HATCH score‡ | 0.849 (0.837–0.860) | 1.004 (0.983–1.024) | 0.053 (0.042–0.064) |
| Support vector machine | |||
| Clinical variables-adjusted* | 0.766 (0.757–0.775) | Ref | Ref |
| Clinical variables plus PM2.5-adjusted† | 0.903 (0.895–0.910) | 1.061 (1.038–1.083) | 0.270 (0.260–0.281) |
| Decision tree | |||
| Clinical variables-adjusted* | 0.801 (0.787–0.815) | Ref | Ref |
| Clinical variables plus PM2.5-adjusted† | 0.931 (0.925–0.937) | 1.054 (1.027–1.081) | 0.265 (0.256–0.275) |
| Random forest | |||
| Clinical variables-adjusted* | 0.838 (0.830–0.846) | Ref | Ref |
| Clinical variables plus PM2.5-adjusted† | 0.939 (0.933–0.945) | 1.027 (1.006–1.050) | 0.242 (0.232–0.253) |
| Naïve Bayes | |||
| Clinical variables-adjusted* | 0.833 (0.825–0.841) | Ref | Ref |
| Clinical variables plus PM2.5-adjusted† | 0.894 (0.888–0.900) | 0.987 (0.959–1.014) | 0.152 (0.142–0.162) |
| Deep neural network | |||
| Clinical variables-adjusted* | 0.813 (0.800–0.826) | Ref | Ref |
| Clinical variables plus PM2.5-adjusted† | 0.849 (0.834–0.865) | 0.792 (0.745–0.837) | 0.088 (0.074–0.101) |
| Extreme gradient boosting | |||
| Clinical variables-adjusted* | 0.845 (0.837–0.853) | Ref | Ref |
| Clinical variables plus PM2.5-adjusted† | 0.954 (0.949–0.959) | 1.277 (1.218–1.334) | 0.461 (0.438–0.485) |
AF atrial fibrillation, BMI body mass index, CI confidence interval, DBP diastolic blood pressure, eGFR estimated glomerular filtration rate, HF heart failure, IDI integrated discrimination improvement index, NRI category-free net reclassification improvement index, PM particulate matter < 2.5 μm in diameter, SBP systolic blood pressure, TR traditional regression analysis.
*TR1 (model 1), clinical 6 variables (age, sex, BMI, SBP, previous HF history, and serum eGFR) -adjusted c-index; TR2 (model 2), clinical 12 variables (age, sex, BMI, SBP, diabetes, previous HF history, previous stroke/TIA history, previous myocardial infarction history, serum eGFR, serum total cholesterol, smoking history, and alcohol intake habit) -adjusted c-index. DBP was not included in these models because of multicollinearity with SBP.
†Clinical 6 variables (*) plus PM2.5-adjusted c-index.
‡We pointed as 1 if PM2.5 ≥ 15 μg/m3 based on Korean National Ambient Air Quality Standards and added this PM score to established CHADS2, CHA2DS2-VASc, and HATCH scores as PM-CHADS2, PM-CHA2DS2-VASc, PM-HATCH scores[25].
Association of PM2.5 with the incidence of AF in overall general population.
| Variables | Crude* | Adjusted for clinical variables† including CHA2DS2-VASc score components plus PM2.5 | Adjusted for clinical variables† including CHADS2 score components plus PM2.5 | Adjusted for clinical variables† including HATCH score components plus PM2.5 |
|---|---|---|---|---|
| HR (95% CI) | HR (95% CI) | HR (95% CI) | HR (95% CI) | |
| PM2.5 (≥ 15 μg/m3)‡ | 1.439 (1.231–1.623) || | 1.248 (1.103–1.384) || | 1.186 (1.122–1.251) || | 1.329 (1.189–1.466) || |
| ≥ 75 years | 6.548 (5.995–7.152) || | 2.167 (1.915–2.444) || | 2.115 (1.873–2.381) || | |
| 65–74; ≥ 75 years | 5.670 (5.262–6.109) || | 3.402 (3.127–3.701) || | ||
| Male sex | 1.352 (1.263–1.447) || | 1.634 (1.526–1.750) || | ||
| Heart failure | 6.775 (6.124–7.494) || | 2.013 (1.806–2.243) || | 2.102 (1.885–2.343) || | 2.612 (2.403–2.846) || |
| Hypertension | 5.196 (4.855–5.560) || | 1.743 (1.524–1.980) || | 1.931 (1.638–2.247) || | 2.074 (1.778–2.392) || |
| Diabetes | 3.178 (2.908–3.473) || | 1.310 (1.202–1.430) || | 1.364 (1.241–1.498) || | |
| Stroke/TIA | 4.368 (3.965–4.812) || | 2.282 (2.156–2.422) || | 2.495 (2.348–2.659) || | 2.503 (2.355–2.667) || |
| Vascular disease | 3.491 (3.081–3.955) || | 1.285 (1.142–1.449) || | ||
| COPD | 4.445 (3.954–4.997) || | 1.780 (1.574–2.014) || | ||
AF atrial fibrillation, CI confidence interval, HR hazard ratio, PM particulate matter < 2.5 μm in diameter, TIA transient ischemic attack.
*Unadjusted Cox proportional hazards model.
†Cox proportional hazards model adjusted for clinical variables. Clinical variables were remaining CHA2DS2-VASc, CHADS2, and HATCH components of age, male sex, heart failure, hypertension, diabetes, stroke/transient ischemic attack, vascular disease including previous history of myocardial infarction or peripheral vascular disease, and chronic obstructive pulmonary disease.
‡In these Cox proportional hazards models, PM2.5 variable was analyzed as categorical variable with dividing subjects into subgroups as PM2.5 ≥ 15 μg/m3 or < 15 μg/m3 based on the Korean National Ambient Air Quality Standards[25].
§Age variable was analyzed as binary or three categorical variables with dividing subjects into subgroups as: (for binary categorical variables) age ≥ 75 or < 75 years; or (for three categorical variables) age ≥ 75, 65–74, or < 65 years.
|| p-value < 0.001.
Figure 2Kaplan–Meier curves for risk categories according to PM-CHA2DS2-VASc, PM-CHADS2, and PM-HATCH scores. Patients were divided into three groups as low (0–1 points), intermediate (2–3 points), and high risk (≥ 4 points) groups. We pointed as 1 if PM2.5 ≥ 15 μg/m3 based on Korean National Ambient Air Quality Standards.
Performance of predictive models for incident AF risk during follow-up period in overall general population (age, sex, and BMI-adjusted models).
| Predictive models* | c-index (95% CI) | NRI (95% CI) | IDI (95% CI) |
|---|---|---|---|
| Traditional regression analysis | 0.604 (0.598–0.611) | Ref | Ref |
| Support vector machine | 0.699 (0.688–0.710) | 0.280 (0.220–0.340) | 0.002 (0.001–0.003) |
| Decision tree | 0.786 (0.771–0.800) | 0.806 (0.747–0.866) | 0.010 (0.009–0.011) |
| Random forest | 0.787 (0.772–0.801) | 0.764 (0.701–0.827) | 0.006 (0.005–0.007) |
| Naïve Bayes | 0.790 (0.776–0.805) | 0.792 (0.732–0.853) | 0.009 (0.008–0.010) |
| Deep neural network | 0.779 (0.768–0.790) | 0.218 (0.182–0.253) | 0.002 (0.001–0.003) |
| Extreme gradient boosting | 0.794 (0.780–0.807) | 0.536 (0.484–0.589) | 0.005 (0.004–0.006) |
AF atrial fibrillation, BMI body mass index, CI confidence interval, IDI integrated discrimination improvement index, NRI category-free net reclassification improvement index.
*Age, sex, and BMI were used for constructing these predictive models (age, sex, and BMI were adjusted for traditional regression analysis, and these variables were used as input variables for training the listed machine learning models).
Ranking of the 10 most important variables for algorithms run for predicting incident AF (among 27 clinical variables).
| Ranking of variables | Traditional regression analysis | Support vector machines with linear Kernel | Decision tree | Random forest | Extreme gradient boosting |
|---|---|---|---|---|---|
| 1 | Heart failure | Heart failure | Age | Serum eGFR | Heart failure |
| 2 | Systolic blood pressure | Systolic blood pressure | Serum eGFR | Systolic blood pressure | Systolic blood pressure |
| 3 | Age | Age | Heart failure | Age | Age |
| 4 | Previous ischemic stroke/TIA | Previous ischemic stroke/TIA | Systolic blood pressure | Heart failure | PM2.5 |
| 5 | PM2.5 | PM2.5 | Previous ischemic stroke/TIA | PM2.5 | Serum triglyceride |
| 6 | Serum eGFR | Serum eGFR | PM2.5 | Sex | Serum total cholesterol |
| 7 | Serum triglyceride | Previous MI | Serum triglyceride | BMI | Serum HDL cholesterol |
| 8 | Sex | Sex | Sex | Smoking history | BMI |
| 9 | Smoking history | Smoking history | Smoking history | Fasting blood glucose | Serum eGFR |
| 10 | Serum total cholesterol | BMI | BMI | Previous ischemic stroke/TIA | Serum LDL cholesterol |
AF atrial fibrillation, BMI body mass index, CI confidence interval, eGFR estimated glomerular filtration rate (mL/min), HDL high density lipoprotein, HR hazard ratio, LDL low density lipoprotein, MI myocardial infarction, PM particulate matter < 2.5 μm in diameter, TIA transient ischemic attack.
Figure 3Comparison of models for predicting incident atrial fibrillation based on the c-index. DNN deep neural network model, DT decision tree model, NB naïve Bayes model, PM particulate matter < 2.5 μm in diameter, RF random forest model, SVM support vector machine, TR1 traditional regression analysis model using clinical six variables (adjusted variables were same with that of TR1 (model 1) in Table 2) as input variables, TR2 traditional regression analysis model using clinical 12 variables (adjusted variables were same with that of TR2 (model 2) in Table 2), XGBM extreme gradient boosting model.