| Literature DB >> 32276658 |
Abstract
OBJECTIVE: Early disease screening and diagnosis are important for improving patient survival. Thus, identifying early predictive features of disease is necessary. This paper presents a comprehensive comparative analysis of different Machine Learning (ML) systems and reports the standard deviation of the results obtained through sampling with replacement. The research emphasises on: (a) to analyze and compare ML strategies used to predict Breast Cancer (BC) and Cardiovascular Disease (CVD) and (b) to use feature importance ranking to identify early high-risk features.Entities:
Keywords: Ensemble learning; Feature selection; Gain; Hyperparameter optimization
Mesh:
Year: 2020 PMID: 32276658 PMCID: PMC7146897 DOI: 10.1186/s13104-020-05050-0
Source DB: PubMed Journal: BMC Res Notes ISSN: 1756-0500
Fig. 1a Overall diagnosis in the breast cancer diagnosis dataset. b Overall diagnosis in the cardiovascular disease dataset. c Comparison of different hyperparameter optimization methods for the breast cancer dataset. d Comparison of different hyperparameter optimization methods for the cardiovascular disease dataset
Performance indicators of different classifiers on the breast cancer diagnostic and cardiovascular disease datasets
| Classifier | Indicator | |||||
|---|---|---|---|---|---|---|
| Accuracy (%) | Precision (%) | Recall (%) | F1 score (%) | AUC | KS Value | |
| XGBoost_BC | 0.9061 | |||||
| LightGBM_BC | 94.74 (94.05,1.69) | 92.19 (92.65,3.49) | 93.65 (91.26,3.61) | 92.91 (92.00,2.33) | 0.9821 (0.9835,0.80) | |
| GBDT_BC | 94.15 ( | 90.77 ( | 93.65 ( | 92.19 ( | 0.9856 ( | 0.8968 |
| LR_BC | 92.40 (93.64,1.62) | 89.06 (92.77,3.15) | 90.48 (90.09,3.42) | 89.76 (91.25,2.19) | 0.9825 (0.9847,0.58) | 0.8796 |
| RF_BC | 92.40 (91.81,1.94) | 90.32 (90.24,3.64) | 88.89 (87.77,4.67) | 89.60 (88.93,2.78) | 0.9710 (0.9757,0.94) | 0.8690 |
| BPNN_BC | 89.47 (92.86,1.85) | 89.47 (91.64,4.18) | 80.95 (88.80,3.67) | 85.00 (90.23,2.56) | 0.9669 (0.9778,0.92) | 0.8439 |
| DT_BC | 87.72 (90.75,1.83) | 90.38 (86.76,3.84) | 74.60 (88.53,4.92) | 81.74 (87.41,2.80) | 0.9314 (0.9500,1.53) | 0.6997 |
| XGBoost_CVD | 73.50 (73.51,0.27) | 75.80 (75.55.0.49) | 69.54 (69.53,0.51) | 72.54 ( | 0.4733 | |
| LightGBM_CVD | 73.53 ( | 75.38 (75.82,0.47) | 70.40 (69.17,0.60) | 72.81 (72.32,0.32) | 0.8042 (0.8023,0.26) | |
| GBDT_CVD | 75.70 (75.60,0.49) | 69.90 (69.43,0.58) | 72.68 (72.38,0.33) | 0.8041 (0.8023,0.25) | 0.4746 | |
| LR_CVD | 72.32 (71.92,0.41) | 74.90 (74.02,0.72) | 67.69 (67.50,0.54) | 71.11 (70.62,0.40) | 0.7869 (0.7829,0.38) | 0.4503 |
| RF_CVD | 73.55 (73.51,0.27) | 75.98 (76.02,0.72) | 69.39 (68.70,0.60) | 72.54 (72.17,0.32) | 0.8026 (0.8012,0.26) | 0.4717 |
| BPNN_CVD | 72.85 (72.81,0.31) | 73.07 (73.73,1.11) | 0.7945 (0.7917,0.28) | 0.4686 | ||
| DT_CVD | 73.26 (73.12,0.17) | 67.72 (66.42,1.33) | 71.83 (71.22,0.48) | 0.7954 (0.7942,0.30) | 0.4667 | |
(a) Values in parentheses are the average and standard deviation of the performance indicator values. For the BC dataset, 300 samples were randomly selected from 569 samples each time and repeated 1000 times. For the CVD data set, 1000 samples were randomly selected from 65,535 samples each time and repeated 1000 times.
(b) Italics numbers indicate optimal values
Fig. 2a ROC curves for the breast cancer diagnosis dataset. b PR curves for the breast cancer diagnosis dataset. c ROC curve for the cardiovascular disease dataset. d PR curves for the cardiovascular disease dataset. e Feature importance rankings for the breast cancer dataset. f Feature importance rankings for the cardiovascular disease dataset