| Literature DB >> 32332844 |
Jaehyuk Heo1,2, Sang Jun Park3,4, Si-Hyuck Kang3,5, Chang Wan Oh1, Jae Seung Bang1, Tackeun Kim6,7.
Abstract
An efficient method for identifying subjects at high risk of an intracranial aneurysm (IA) is warranted to provide adequate radiological screening guidelines and effectively allocate medical resources. We developed a model for pre-diagnosis IA prediction using a national claims database and health examination records. Data from the National Health Screening Program in Korea were utilized as input for several machine learning algorithms: logistic regression (LR), random forest (RF), scalable tree boosting system (XGB), and deep neural networks (DNN). Algorithm performance was evaluated through the area under the receiver operating characteristic curve (AUROC) using different test data from that employed for model training. Five risk groups were classified in ascending order of risk using model prediction probabilities. Incidence rate ratios between the lowest- and highest-risk groups were then compared. The XGB model produced the best IA risk prediction (AUROC of 0.765) and predicted the lowest IA incidence (3.20) in the lowest-risk group, whereas the RF model predicted the highest IA incidence (161.34) in the highest-risk group. The incidence rate ratios between the lowest- and highest-risk groups were 49.85, 35.85, 34.90, and 30.26 for the XGB, LR, DNN, and RF models, respectively. The developed prediction model can aid future IA screening strategies.Entities:
Mesh:
Year: 2020 PMID: 32332844 PMCID: PMC7181629 DOI: 10.1038/s41598-020-63906-8
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Flowchart of the data processing strategy. NHIS-NSC = National Health Insurance Service-National Sample Cohort; SAH = subarachnoid hemorrhage; UIA = unruptured intracranial aneurysm.
Factor distribution differences between training and test datasets.
| Factors | Training set (n = 299,088) | Test set (n = 128,181) | p-value |
|---|---|---|---|
| Age | 46.05 ± 13.92 | 46.04 ± 13.94 | 0.93 |
| Sex (Female) | 152,746 (51.1) | 65,677 (51.2) | 0.32 |
| BMI | 23.55 ± 3.28 | 23.55 ± 3.29 | 0.87 |
| Waist circumference | 79.41 ± 9.29 | 79.41 ± 9.31 | 0.98 |
| Hypertension | 39,341 (13.2) | 16,248 (13.0) | 0.23 |
| Systolic BP (mmHg) | 121.17 ± 14.66 | 121.18 ± 14.68 | 0.92 |
| Diastolic BP (mmHg) | 75.54 ± 9.91 | 75.56 ± 9.93 | 0.58 |
| DM | 13,686 (4.6) | 5,754 (4.5) | 0.21 |
| Glucose (mg/dl) | 95.1 ± 16.19 | 75.56 ± 9.93 | 0.61 |
| Total cholesterol (mg/dl) | 193.64 ± 35.83 | 193.67 ± 35.88 | 0.81 |
| LDL cholesterol (mg/dl) | 55.81 ± 13.79 | 55.85 ± 13.81 | 0.33 |
| HDL cholesterol (mg/dl) | 113.21 ± 32.79 | 113.24 ± 32.85 | 0.79 |
| Triglyceride (mg/dl) | 122.4 ± 72.9 | 122.28 ± 72.61 | 0.62 |
| Hemoglobin (g/dl) | 13.87 ± 1.61 | 13.86 ± 1.61 | 0.52 |
| Creatinine (mg/dl) | 0.89 ± 0.22 | 0.89 ± 0.22 | 0.67 |
| AST (IU/L) | 23.75 ± 8.74 | 23.74 ± 8.72 | 0.75 |
| ALT (IU/L) | 22.96 ± 14.09 | 22.97 ± 14.11 | 0.73 |
| GGT (IU/L) | 31.36 ± 28.9 | 31.31 ± 28.85 | 0.59 |
| Smoking | |||
| Never | 187,333 (62.6) | 80,371 (62.7) | 0.68 |
| Ex | 37,987 (12.7) | 16,248 (12.7) | 0.83 |
| Current | 73,768 (24.7) | 31,562 (24.6) | 0.78 |
| Familial history of stroke | 16,590 (5.5) | 7,098 (5.5) | 0.91 |
| Familial history of heart disease | 10,320 (3.5) | 4,351 (3.4) | 0.36 |
| Familial history of hypertension | 34,574 (1.6) | 14,982 (1.7) | 0.23 |
| Familial history of diabetes | 27,424 (9.2) | 11,805 (9.2) | 0.68 |
Continuous variables are presented as mean ± standard deviation. Categorical variables are represented as numbers (percentages). BMI = body mass index; BP = blood pressure; DM = diabetes mellitus; AST = aspartate aminotransferase; ALT = alanine transaminase; GGT = gamma-glutamyl transferase.
Figure 2Summary of grid search process for optimizing parameters. Each line shows the combination of parameters used in the grid search. The thick red line indicates the optimal combination of parameters achieving the highest AUROC (area under receiver operating characteristic curve).
Model performance indicators.
| Model | AUROC | 95% confidence interval | Sensitivity | Specificity | p-values | ||
|---|---|---|---|---|---|---|---|
| vs. LR | vs. RF | vs. DNN | |||||
| XGB | 0.765 | 0.742–0.788 | 0.805 | 0.613 | 0.485 | 0.049 | 0.010 |
| LR | 0.762 | 0.739–0.784 | 0.788 | 0.621 | — | 0.487 | 0.021 |
| RF | 0.757 | 0.733–0.779 | 0.815 | 0.591 | — | 0.197 | |
| DNN | 0.748 | 0.724–0.770 | 0.853 | 0.571 | — | ||
AUROC = area under the receiver operating characteristic curve; LR = logistic regression; RF = random forest; DNN = deep neural networks; XGB = scalable tree boosting system.
Figure 3IA incidence according to each risk group. IA incidence per 100,000 person-year by quintile risk group. IA = intracranial aneurysm; LR = logistic regression; RF = random forest; XGB = scalable tree boosting system; DNN = deep neural networks.
Figure 4Survival curves for IA incidence by risk group predicted by scalable tree boosting systems (XGB).
Figure 5Distribution of variables according to IA prediction score by sex. R = Pearson correlation coefficient.
Figure 6Trade-off line graph according to various cut-off probabilities calculated by XGB model. The x-axis indicates probability values calculated from the model. The scale of the solid line is shown on the left axis, and the scale of the dotted line is shown on the right axis. The red vertical line indicates the optimal cut-off value maximizing sensitivity + specificity. (A) Trends of the number of predictions. The y-axis represents the number of subjects. (B) Trends of performance indicators. The y-axis represents the performance scores.