| Literature DB >> 32328156 |
Keding Li1, Canyi Huang2, Jianqiang Du2, Bin Nie2, Guoliang Xu3, Wangping Xiong2, Jigen Luo2.
Abstract
The basic experimental data of traditional Chinese medicine are generally obtained by high-performance liquid chromatography and mass spectrometry. The data often show the characteristics of high dimensionality and few samples, and there are many irrelevant features and redundant features in the data, which bring challenges to the in-depth exploration of Chinese medicine material information. A hybrid feature selection method based on iterative approximate Markov blanket (CI_AMB) is proposed in the paper. The method uses the maximum information coefficient to measure the correlation between features and target variables and achieves the purpose of filtering irrelevant features according to the evaluation criteria, firstly. The iterative approximation Markov blanket strategy analyzes the redundancy between features and implements the elimination of redundant features and then selects an effective feature subset finally. Comparative experiments using traditional Chinese medicine material basic experimental data and UCI's multiple public datasets show that the new method has a better advantage to select a small number of highly explanatory features, compared with Lasso, XGBoost, and the classic approximate Markov blanket method.Entities:
Year: 2020 PMID: 32328156 PMCID: PMC7166270 DOI: 10.1155/2020/8308173
Source DB: PubMed Journal: Comput Math Methods Med ISSN: 1748-670X Impact factor: 2.238
Algorithm 1CI_AMB algorithm.
Figure 1CI_AMB model.
Basic dataset information (default task: regression).
| Datasets | Number of samples | Number of attributes |
|---|---|---|
| WYHXB | 54 | 799 (798 + 1) |
| NYWZ | 54 | 10284 (10283 + 1) |
| BlogData | 60021 | 281 (280 + 1) |
| RBuild | 372 | 104 (103 + 1) |
| CCrime | 1994 | 128 (127 + 1) |
Partial data of basic experiments with traditional Chinese medicine substances (WYHXB).
| 0.34_237.0119 | 0.35_735.1196 | 0.36_588.0942 | … | 0.36_590.0903 | Red blood cell flow rate ( |
|---|---|---|---|---|---|
| 0.48808 | 302.16 | 0 | … | 27.8589 | 750 |
| 100.078 | 62.016 | 0 | … | 3.80712 | 1400 |
| 11.6992 | 52.5058 | 7.61005 | … | 4.85059 | 785 |
| 143.643 | 284.113 | 0 | … | 456.607 | 790 |
| 7.75089 | 54.4535 | 0 | … | 0 | 670 |
| 18.2499 | 0 | 0 | … | 14.6621 | 680 |
| … | … | … | … | … | … |
| 28.5783 | 0 | 0 | … | 2.3551 | 850 |
| 2.91064 | 0 | 16.1624 | … | 3.41406 | 620 |
| … | … | … | … | … | … |
Partial data of basic experiments with traditional Chinese medicine substances (NYWZ).
| 11.10_787.5077 | 12.29_526.1784 | 12.29_531.2005 | … | 12.47_631.3847 | Red blood cell flow rate ( |
|---|---|---|---|---|---|
| 53.3719 | 11557.6 | 764.329 | … | 1795.79 | 2200 |
| 43.4717 | 7971.33 | 875.465 | … | 1842.39 | 2750 |
| 76.507 | 3399.9 | 870.161 | … | 1562.81 | 1980 |
| 153.145 | 51027.4 | 916.064 | … | 1619.62 | 1860 |
| 16.3197 | 10694.4 | 942.699 | … | 1612.42 | 2100 |
| 42.2836 | 11048.1 | 714.536 | … | 1649.23 | 2000 |
| … | … | … | … | … | … |
| 55.5021 | 4702.83 | 748.844 | … | 1632.9 | 2481 |
| 153.21 | 78912.8 | 835.24 | … | 1647.55 | 2970 |
| … | … | … | … | … | … |
Experimental results of five datasets of filter-independent features.
|
| WYHXB | NYWZ | BlogData | RBuild | CCrime | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| Number of features | Ave-RMSE | Number of features | Ave-RMSE | Number of features | Ave-RMSE | Number of features | Ave-RMSE | Number of features | Ave-RMSE | |
|
| 758 | 234.960328 | 9768 | 233.324863 | 266 | 12.645784 | 97 | 354.101779 |
|
|
| 0.9 | 718 | 235.019819 | 9254 | 233.324863 | 252 | 12.645784 | 92 | 354.090179 | 114 | 0.131695 |
|
|
|
| 8740 | 233.324863 | 238 | 12.645784 | 87 | 354.134252 | 107 | 0.131858 |
|
| 638 | 235.133101 |
|
| 224 | 12.645784 | 82 | 354.146541 | 101 | 0.131792 |
| 0.75 | 598 | 235.104648 | 7712 | 233.388367 | 210 | 12.645784 | 77 | 353.914801 | 95 | 0.131897 |
|
| 558 | 235.132128 | 7198 | 233.388367 | 196 | 12.645784 |
|
| 88 | 0.131853 |
| 0.65 | 518 | 235.191663 | 6683 | 233.385479 | 182 | 12.645784 | 66 | 353.923275 | 82 | 0.131902 |
| 0.6 | 478 | 235.202756 | 6169 | 233.394604 | 168 | 12.645784 | 61 | 354.042364 | 76 | 0.132113 |
| 0.55 | 438 | 235.263138 | 5655 | 233.394604 | 154 | 12.645784 | 56 | 354.050328 | 69 | 0.132164 |
|
| 399 | 235.962421 | 5141 | 233.357302 |
|
| 51 | 354.053246 | 63 | 0.132310 |
| 0.45 | 359 | 235.941428 | 4627 | 233.355757 | 126 | 12.649723 | 46 | 354.770411 | 57 | 0.132497 |
| 0.4 | 319 | 236.399412 | 4113 | 233.354086 | 112 | 12.651157 | 41 | 354.849084 | 50 | 0.132620 |
| 0.35 | 279 | 236.574098 | 3599 | 233.354248 | 98 | 12.657242 | 36 | 355.659524 | 44 | 0.133428 |
| 0.3 | 239 | 376.546789 | 3084 | 233.358374 | 84 | 12.664293 | 30 | 355.714190 | 38 | 0.133759 |
| 0.25 | 199 | 406.768586 | 2570 | 233.399275 | 70 | 12.671595 | 25 | 355.700106 | 31 | 0.134865 |
| 0.2 | 159 | 445.621765 | 2056 | 233.437486 | 56 | 12.676944 | 20 | 355.714027 | 25 | 0.136386 |
| 0.15 | 119 | 545.521345 | 1542 | 233.539485 | 42 | 12.677181 | 15 | 355.722452 | 19 | 0.137433 |
| 0.1 | 79 | 553.326100 | 1028 | 233.550540 | 28 | 12.677343 | 10 | 355.785519 | 12 | 0.139937 |
Comparison of experimental data between raw data and candidate feature subsets.
| Original data | Candidate feature subset | |||
|---|---|---|---|---|
| Number of features | RMSE | Number of features | RMSE | |
| WYHXB | 798 | 234.967849 | 678 | 234.800187 |
| NYWZ | 10283 | 234.052699 | 8226 | 233.324863 |
| BlogData | 280 | 12.645784 | 140 | 12.645784 |
| RBuild | 103 | 352.473674 | 72 | 353.914801 |
| CCrime | 128 | 0.131377 | 120 | 0.131535 |
Figure 2WYHXB parameter K selection.
Figure 3NYWZ parameter K selection.
Figure 4BlogData parameter K selection.
Figure 5RBuild parameter K selection.
Figure 6CCrime parameter K selection.
Comparison of experimental results between CI_AMB and other methods (RMSE evaluation index of GBDT).
| Original data | CI_AMB | XGBoost | Lasso | FCBF-MIC | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Number of features | RMSE | Number of features | RMSE | Number of features | RMSE | Number of features | RMSE | Number of features | RMSE | |
| WYHXB | 798 | 267.5115 | 80(19 + 61) |
| 83 | 269.1644 | 89 | 255.9661 | 15 | 265.0474 |
| NYWZ | 10283 | 258.4021 | 220(59 + 161) |
| 212 | 263.3908 | 215 | 256.2172 | 60 | 265.2352 |
| BlogData | 280 | 22.7247 | 48(5 + 43) |
| 43 | 14.5660 | 47 | 18.7933 | 9 | 24.2629 |
| RBuild | 103 | 458.0302 | 35(16 + 19) |
| 23 | 458.2780 | 26 | 466.8546 | 3 | 461.7130 |
| CCrime | 127 |
| 37(3 + 34) | 0.1091 | 37 | 0.1176 | 31 | 0.1121 | 5 | 0.1231 |
| Average value | 201.3550 |
| 201.1034 | 199.5887 | 203.2763 | |||||
Comparison of experimental results of CI_AMB with other methods (RMSE evaluation index of XGBoost).
| Original data | CI_AMB | XGBoost | Lasso | FCBF-MIC | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Number of features | RMSE | Number of features | RMSE | Number of features | RMSE | Number of features | RMSE | Number of features | RMSE | |
| WYHXB | 798 | 227.9061 | 80(19 + 61) |
| 83 | 221.8774 | 89 | 214.0560 | 15 | 229.7367 |
| NYWZ | 10283 | 219.7160 | 220(59 + 161) |
| 212 | 220.3312 | 215 | 225.1712 | 60 | 225.1525 |
| BlogData | 280 | 8.6356 | 48(5 + 43) |
| 43 | 10.0949 | 47 | 10.2909 | 9 | 10.8045 |
| RBuild | 103 | 264.5195 | 35(16 + 19) |
| 23 | 269.8928 | 26 | 261.3095 | 3 | 278.6242 |
| CCrime | 127 | 0.1447 | 37(3 + 34) |
| 37 | 0.1487 | 31 | 0.1483 | 5 | 0.1492 |
| Average value | 144.1844 |
| 144.4690 | 142.1952 | 148.8934 | |||||
Figure 7The average RMSE trend of the five sets of datasets.
Figure 8The average RMSE trend of the five sets of datasets.