| Literature DB >> 30938684 |
Yonglai Zhang1, Yaojian Zhou1, Dongsong Zhang2, Wenai Song1.
Abstract
BACKGROUND: Stroke is one of the most common diseases that cause mortality. Detecting the risk of stroke for individuals is critical yet challenging because of a large number of risk factors for stroke.Entities:
Keywords: WRHFS; feature selection; machine learning; risk; stroke
Mesh:
Year: 2019 PMID: 30938684 PMCID: PMC6466481 DOI: 10.2196/12437
Source DB: PubMed Journal: J Med Internet Res ISSN: 1438-8871 Impact factor: 5.428
Classification of feature selection methods.
| Methods | Rationale | Limitations | Sample studies | ||
| Filter | Mutual information based | Signal objective function | [ | ||
| Ranking based | Neglecting the correlation between the features and class labels | [ | |||
| Weighting based | Lacking the uniform standards of selecting features | [ | |||
| Wrapper | Evaluating the accuracy of the classifier | Overfitting and high computational complexity | [ | ||
| Hybrid | Guiding the wrapper using a filter | Only for certain specific fields | [ | ||
| Semisupervised | Guiding by the labeled samples | Relying on small labeled samples | [ | ||
| Unsupervised | Clustering-based models | Relying on certain data distribution | [ | ||
Figure 1Weighting- and ranking-based hybrid feature selection.
24 blood test items.
| Full name | Abbreviation | Unit | Type of data |
| α Hydroxybutyric dehydrogenase | α- HBD | IU/L | Integer |
| Gamma glutamyl transpeptidase | GGP | IU/L | Integer |
| Lactate dehydrogenase | LDH | mmol/L | Real |
| Low-density lipoprotein | LDL | mmol/L | Real |
| High-density lipoprotein | HDL | mmol/L | Real |
| Blood urea nitrogen | BUN | mmol/L | Real |
| Uric acid | UA | umol/L | Integer |
| Total cholesterol | TC | mmol/L | Real |
| Total bilirubin | TBIL | umol/L | Real |
| Total protein | TP | g/L | Integer |
| Triglyceride | TG | mmol/L | Real |
| Albumin | Alb | g/L | Integer |
| Direct bilirubin | DBIL | umol/L | Real |
| Alkaline phosphatase | ALP | IU/L | Integer |
| Serum phosphorus | PI | mmol/L | Real |
| Serum creatinine | SCr | umol/L | Integer |
| Creatine kinase | CK | IU/L | Integer |
| Creatine kinase isoenzyme | CK-MB | IU/L | Integer |
| Glucose | Glu | mmol/L | Real |
| Alanine aminotransferase | ALT | IU/L | Integer |
| Aspartate aminotransferase | AST | IU/L | Integer |
| Apolipoprotein A1 | Apo-A1 | g/L | Real |
| Apolipoprotein B | Apo-B | g/L | Real |
| Serum calcium | Ca | mmol/L | Real |
Descriptive statistics of age and gender of patients in the dataset (N=792)
| Age (years) and gender | Statistics, n (%) | |
| Male | 105 (13.3) | |
| Female | 167 (21.1) | |
| Male | 151 (19.1) | |
| Female | 246 (31.1) | |
| Male | 76 (9.6) | |
| Female | 47 (5.9) | |
Effectiveness coefficients of the filter feature selection methods.
| Method | Effective coefficient |
| Information gain | 63 |
| Relief | 61 |
| Standard deviation | 52 |
| Pearson correlation coefficient | 49 |
| Fisher score | 46 |
| Chi-squared test | 40 |
Weighting of the 28 features based on standard deviation.
| Featurea | Standard deviation | C | q | Accuracy (%) | Contribution | SD (0-1) | Contribution (0-1) | Weight |
| CK | 0.21 | 16 | 0.5 | 56.1 | —b | 1.00 | 1.00 | — |
| LDH | 0.21 | 64 | 0.5 | 57.1 | 1.00 | 0.99 | 0.30 | 1.00 |
| α-HBD | 0.19 | 128 | 8.0 | 58.6 | 1.52 | 0.91 | 0.37 | 1.52 |
| Height | 0.17 | 8 | 8.0 | 57.3 | −1.26 | 0.81 | 0.00 | −1.26 |
| ALP | 0.15 | 2 | 4.0 | 58.8 | 1.52 | 0.72 | 0.37 | 1.52 |
| UA | 0.10 | 1 | 8.0 | 58.6 | −0.25 | 0.48 | 0.13 | −0.25 |
| SCr | 0.09 | 16 | 4.0 | 61.9 | 3.28 | 0.41 | 0.60 | 3.28 |
| GGP | 0.08 | 16 | 2.0 | 61.6 | −0.25 | 0.40 | 0.13 | −0.25 |
| TP | 0.08 | 2 | 8.0 | 61.6 | 0.00 | 0.37 | 0.17 | 0.00 |
| AGE | 0.08 | 64 | 1.0 | 67.9 | 6.31 | 0.36 | 1.00 | 6.31 |
| ALT | 0.07 | 128 | 1.0 | 67.6 | −0.38 | 0.31 | 0.12 | −0.38 |
| AST | 0.06 | 64 | 1.0 | 67.9 | 0.38 | 0.27 | 0.22 | 0.38 |
| CK-MB | 0.05 | 128 | 0.2 | 69.2 | 1.26 | 0.23 | 0.33 | 1.26 |
| Alb | 0.05 | 128 | 0.1 | 69.6 | 0.38 | 0.22 | 0.22 | 0.38 |
| TBIL | 0.04 | 256 | 0.5 | 72.1 | 2.53 | 0.16 | 0.50 | 2.53 |
| BMI | 0.03 | 64 | 0.3 | 72.7 | 0.63 | 0.12 | 0.25 | 0.63 |
| Glu | 0.01 | 128 | 0.3 | 72.7 | 0.00 | 0.04 | 0.17 | 0.00 |
| DBIL | 0.01 | 64 | 0.5 | 73.1 | 0.38 | 0.04 | 0.22 | 0.38 |
| BUN | 0.01 | 64 | 0.5 | 73.0 | −0.13 | 0.03 | 0.15 | −0.13 |
| TC | 0.01 | 64 | 0.5 | 73.2 | 0.25 | 0.02 | 0.20 | 0.25 |
| LDL | 0.01 | 128 | 1.0 | 73.0 | −0.25 | 0.02 | 0.13 | −0.25 |
| TG | 0.00 | 128 | 1.0 | 72.9 | −0.13 | 0.02 | 0.15 | −0.13 |
| Gender | 0.00 | 128 | 1.0 | 73.0 | 0.13 | 0.00 | 0.18 | 0.13 |
| Ca | 0.00 | 64 | 0.5 | 73.0 | 0.00 | 0.00 | 0.17 | 0.00 |
| Apo-A1 | 0.00 | 128 | 1.0 | 73.2 | 0.25 | 0.00 | 0.20 | 0.25 |
| HDL | 0.00 | 128 | 1.0 | 73.1 | −0.13 | 0.00 | 0.15 | −0.13 |
| Apo-B | 0.00 | 128 | 1.0 | 73.1 | 0.00 | 0.00 | 0.17 | 0.00 |
| PI | 0.00 | 128 | 1.0 | 73.0 | −0.13 | 0.00 | 0.15 | −0.13 |
aThe full forms of all abbreviations are shown in Table 2.
Weighting of the 3 feature selection models.
| Order | Featurea | Standard deviation | Relief | Information gain | Weight sum |
| 1 | α-HBD | 0.9123 | 1.0000 | 0.0001 | 1.9124 |
| 2 | GGP | 0.4000 | 0.0657 | 0.0498 | 0.5156 |
| 3 | Alb | 0.2198 | 0.0592 | 0.0211 | 0.3001 |
| 4 | LDL | 0.0197 | 0.0026 | 0.0236 | 0.0459 |
| 5 | TG | 0.0156 | 0.0002 | 0.0001 | 0.0159 |
| 6 | HDL | 0.0032 | 0.0000 | 0.0010 | 0.0042 |
| 7 | ALT | 0.3120 | 0.0055 | 0.1141 | 0.4316 |
| 8 | AST | 0.2734 | 0.0366 | 0.0985 | 0.4085 |
| 9 | SCr | 0.4142 | 0.0637 | 0.0638 | 0.5417 |
| 10 | CK | 1.0000 | 0.5919 | 0.0549 | 1.6468 |
| 11 | CK-MB | 0.2303 | 0.0190 | 0.1657 | 0.4150 |
| 12 | ALP | 0.7239 | 0.0509 | 0.1051 | 0.8799 |
| 13 | AGE | 0.3574 | 0.0503 | 1.0000 | 1.4077 |
| 14 | BUN | 0.0296 | 0.0005 | 0.0845 | 0.1146 |
| 15 | UA | 0.4817 | 0.0024 | 0.0037 | 0.4878 |
| 16 | LDH | 0.9884 | 0.9582 | 0.0788 | 2.0254 |
| 17 | Height | 0.8145 | 0.4240 | 0.1235 | 1.3621 |
| 18 | BMI | 0.1171 | 0.0011 | 0.2146 | 0.3328 |
| 19 | Gender | 0.0049 | 0.0000 | 0.1349 | 0.1398 |
| 20 | Ca | 0.0040 | 0.0000 | 0.0000 | 0.0040 |
| 21 | PI | 0.0000 | 0.0001 | 0.0812 | 0.0813 |
| 22 | Glu | 0.0430 | 0.0009 | 0.4154 | 0.4593 |
| 23 | Apo-A1 | 0.0036 | 0.0001 | 0.4525 | 0.4562 |
| 24 | Apo-B | 0.0013 | 0.0001 | 0.6987 | 0.7000 |
| 25 | DBIL | 0.0364 | 0.0003 | 0.2629 | 0.2996 |
| 26 | TC | 0.0248 | 0.0000 | 0.0382 | 0.0630 |
| 27 | TBIL | 0.1633 | 0.0323 | 0.5188 | 0.7143 |
| 28 | TP | 0.3667 | 0.0946 | 0.0417 | 0.5029 |
aThe full forms of all abbreviations are shown in Table 2.
Contribution of individual features.
| Featurea | Contribution | Cumulative contribution | ||||
| Standard deviation | Relief | Information gain | Standard deviation | Relief | Information gain | |
| α-HBD | 0.9123 | 1.0000 | 0.0001 | 1.6654 | 1.0000 | 6.2282 |
| GGP | 0.4000 | 0.0657 | 0.0498 | 2.8987 | 2.5000 | 5.1229 |
| Alb | 0.2198 | 0.0592 | 0.0211 | 4.9488 | 3.4428 | 5.8159 |
| LDL | 0.0197 | 0.0026 | 0.0236 | 6.5655 | 6.2714 | 5.5703 |
| TG | 0.0156 | 0.0002 | 0.0001 | 6.7155 | 7.8856 | 6.3685 |
| HDL | 0.0032 | 0.0000 | 0.0010 | 7.4155 | 9.4571 | 6.0966 |
| ALT | 0.3120 | 0.0055 | 0.1141 | 4.1821 | 6.0714 | 3.7369 |
| AST | 0.2734 | 0.0366 | 0.0985 | 4.3988 | 5.0000 | 4.0439 |
| SCr | 0.4142 | 0.0637 | 0.0638 | 2.7654 | 3.0857 | 4.8422 |
| CK | 1.0000 | 0.5919 | 0.0549 | 1.0000 | 1.6571 | 4.9562 |
| CK-MB | 0.2303 | 0.0190 | 0.1657 | 4.7321 | 5.8428 | 2.4474 |
| ALP | 0.7239 | 0.0509 | 0.1051 | 2.0320 | 3.6428 | 3.8158 |
| AGE | 0.3574 | 0.0503 | 1.0000 | 4.0654 | 4.6428 | 1.0000 |
| BUN | 0.0296 | 0.0005 | 0.0845 | 6.2321 | 7.3999 | 4.1930 |
| UA | 0.4817 | 0.0024 | 0.0037 | 2.1654 | 6.5428 | 5.9299 |
| LDH | 0.9884 | 0.9582 | 0.0788 | 1.2987 | 1.6571 | 4.5878 |
| Height | 0.8145 | 0.4240 | 0.1235 | 1.6654 | 1.7714 | 3.6492 |
| BMI | 0.1171 | 0.0011 | 0.2146 | 5.6988 | 6.8857 | 2.3070 |
| Gender | 0.0049 | 0.0000 | 0.1349 | 6.8988 | 9.2142 | 2.6492 |
| Ca | 0.0040 | 0.0000 | 0.0000 | 7.0655 | 9.7142 | 6.5264 |
| PI | 0.0000 | 0.0001 | 0.0812 | 7.7322 | 8.1285 | 4.3509 |
| Glu | 0.0430 | 0.0009 | 0.4154 | 5.8655 | 7.1428 | 2.0614 |
| Apo-A1 | 0.0036 | 0.0001 | 0.4525 | 7.2655 | 8.4142 | 1.8246 |
| Apo-B | 0.0013 | 0.0001 | 0.6987 | 7.5822 | 8.7285 | 1.4386 |
| DBIL | 0.0364 | 0.0003 | 0.2629 | 6.0821 | 7.6285 | 2.3070 |
| TC | 0.0248 | 0.0000 | 0.0382 | 6.4322 | 8.9571 | 5.4387 |
| TBIL | 0.1633 | 0.0323 | 0.5188 | 5.4488 | 5.3857 | 1.7018 |
| TP | 0.3667 | 0.0946 | 0.0417 | 3.0654 | 2.0428 | 5.2808 |
aThe full forms of all abbreviations are shown in Table 2.
Weighting of the 28 features using weighting- and ranking-based hybrid feature selection.
| Order | Featurea | Weight | Contribution | Cumulative contribution | Weight (0-1) |
| 1 | Age | 176.31 | 0.13 | 0.13 | 1 |
| 2 | α-HBD | 88.36 | 0.06 | 0.19 | 0.42 |
| 3 | SCr | 83.02 | 0.06 | 0.25 | 0.38 |
| 4 | LDH | 70.59 | 0.05 | 0.30 | 0.30 |
| 5 | Height | 70.32 | 0.05 | 0.35 | 0.30 |
| 6 | TBIL | 66.18 | 0.05 | 0.39 | 0.27 |
| 7 | CK | 59.22 | 0.04 | 0.44 | 0.22 |
| 8 | Apo-B | 55.61 | 0.04 | 0.48 | 0.20 |
| 9 | CK-MB | 54.09 | 0.04 | 0.51 | 0.19 |
| 10 | Alb | 48.60 | 0.03 | 0.55 | 0.15 |
| 11 | AST | 47.49 | 0.03 | 0.58 | 0.15 |
| 12 | GGP | 45.36 | 0.03 | 0.61 | 0.13 |
| 13 | DBIL | 40.76 | 0.03 | 0.64 | 0.10 |
| 14 | Glu | 39.35 | 0.03 | 0.67 | 0.09 |
| 15 | Gender | 37.99 | 0.03 | 0.70 | 0.08 |
| 16 | ALP | 36.26 | 0.03 | 0.72 | 0.07 |
| 17 | Apo-A1 | 35.60 | 0.03 | 0.75 | 0.07 |
| 18 | TP | 35.22 | 0.03 | 0.77 | 0.06 |
| 19 | Ca | 34.34 | 0.02 | 0.80 | 0.06 |
| 20 | TC | 34.34 | 0.02 | 0.82 | 0.06 |
| 21 | BMI | 33.90 | 0.02 | 0.85 | 0.06 |
| 22 | HDL | 33.16 | 0.02 | 0.87 | 0.05 |
| 23 | BUN | 32.92 | 0.02 | 0.89 | 0.05 |
| 24 | PI | 32.61 | 0.02 | 0.92 | 0.05 |
| 25 | TG | 32.37 | 0.02 | 0.94 | 0.05 |
| 26 | UA | 30.70 | 0.02 | 0.96 | 0.03 |
| 27 | LDL | 27.46 | 0.02 | 0.98 | 0.01 |
| 28 | ALT | 25.56 | 0.02 | 1.00 | 0 |
aThe full forms of all abbreviations are shown in Table 2.
Classification performances of support vector machine with different feature selection methods.
| Method | Features | Sensitivity (N=398), n (%) | Specificity (N=394), n (%) | Accuracy (N=792), n (%) | Youden index |
| WRHFSa | 9 | 329 (82.7) | 317 (80.4) | 645 (81.5) | 0.63 |
| Information gain | 10 | 297 (74.6) | 284 (72.1) | 574 (72.5) | 0.47 |
| Relief | 13 | 277 (69.6) | 290 (73.7) | 577 (72.9) | 0.43 |
| Standard deviation | 20 | 283 (71.1) | 291 (73.9) | 580 (73.2) | 0.45 |
aWRHFS: weighting- and ranking-based hybrid feature selection.
Classification performances of different models with weighting- and ranking-based hybrid feature selection.
| Classifier | Sensitivity (N=398), n (%) | Specificity (N=394), n (%) | Accuracy (N=792), n (%) | Youden index |
| SVMa | 329 (82.7) | 317 (80.4) | 645 (81.5) | 0.63 |
| Bayes | 319 (80.2) | 197 (50.02) | 520 (65.7) | 0.30 |
| CBAb | 305 (76.6) | 300 (76.1) | 605 (76.4) | 0.53 |
| BPNNc | 280 (70.4) | 220 (55.8) | 501 (63.2) | 0.26 |
| CARTd | 280 (70.4) | 283 (71.8) | 562 (71.0) | 0.42 |
| C4.5 | 269 (67.6) | 302 (76.6) | 571 (72.1) | 0.44 |
| ELMe | 220 (55.3) | 249 (63.2) | 469 (59.2) | 0.19 |
aSVM: support vector machine.
bCBA: classification based on associations.
cBPNN: back-propagation neural networks.
dCART: classification and regression tree.
eELM: extreme learning machine.
Figure 2A surface chart for risk detection.
Figure 3A risk index map for ischemic stroke detection.