| Literature DB >> 35251161 |
Manal A Abdel-Fattah1, Nermin Abdelhakim Othman1,2, Nagwa Goher1,3.
Abstract
Chronic kidney disease (CKD) has become a widespread disease among people. It is related to various serious risks like cardiovascular disease, heightened risk, and end-stage renal disease, which can be feasibly avoidable by early detection and treatment of people in danger of this disease. The machine learning algorithm is a source of significant assistance for medical scientists to diagnose the disease accurately in its outset stage. Recently, Big Data platforms are integrated with machine learning algorithms to add value to healthcare. Therefore, this paper proposes hybrid machine learning techniques that include feature selection methods and machine learning classification algorithms based on big data platforms (Apache Spark) that were used to detect chronic kidney disease (CKD). The feature selection techniques, namely, Relief-F and chi-squared feature selection method, were applied to select the important features. Six machine learning classification algorithms were used in this research: decision tree (DT), logistic regression (LR), Naive Bayes (NB), Random Forest (RF), support vector machine (SVM), and Gradient-Boosted Trees (GBT Classifier) as ensemble learning algorithms. Four methods of evaluation, namely, accuracy, precision, recall, and F1-measure, were applied to validate the results. For each algorithm, the results of cross-validation and the testing results have been computed based on full features, the features selected by Relief-F, and the features selected by chi-squared feature selection method. The results showed that SVM, DT, and GBT Classifiers with the selected features had achieved the best performance at 100% accuracy. Overall, Relief-F's selected features are better than full features and the features selected by chi-square.Entities:
Mesh:
Year: 2022 PMID: 35251161 PMCID: PMC8890824 DOI: 10.1155/2022/9898831
Source DB: PubMed Journal: Comput Intell Neurosci
Related works for prediction of CKD.
| REF | Year | Models | Feature selection methods | Dataset |
|
| ||||
| [ | 2021 | SVM, KNN, DT, and RF | Recursive feature elimination (RFE) | CKD dataset |
| [ | 2020 | ANN, C5.0, and LR | CFS, Lasso, and | CKD dataset |
| LSVM, KNN, and RF | Wrapper method | |||
| [ | 2020 | RF, SVM, NB, and LR | RF-FS, FS, FES, BS, and BES | CKD dataset |
| [ | 2020 | An ensemble of decision tree models | Cost-sensitive ensemble | CKD dataset |
| Feature ranking | ||||
| [ | 2020 | Bagging and random subspace | No | CKD dataset |
| Methods based on KNN | ||||
| NB and DT | ||||
| [ | 2020 | Decision Table, J48 | Genetic search algorithm | CKD dataset |
| MLP and NB | ||||
| [ | 2019 | LR, RF, SVM, KNN | No | CKD dataset |
| NB and FNN | ||||
| A hybrid model LR and RF | ||||
| [ | 2019 | Artificial neural network (ANN) and SVM | Correlation coefficients | CKD dataset |
| [ | 2018 | NB and K-Star | No | CKD dataset |
| SVM | ||||
| J48 | ||||
| [ | 2018 | AdaBoost and KNN | CFS | CKD dataset |
| NB and SVM | ||||
Figure 1The steps of prediction CKD based on Apache Spark.
The CKD dataset description.
| Features | Explain |
|
| |
| age | Age |
| bp | Blood pressure |
| sg | Specific gravity |
| al | Albumin |
| su | Sugar |
| rbc | Red blood cells |
| pc | Pus cell |
| pcc | Pus cell clumps |
| ba | Bacteria |
| bgr | Blood glucose random |
| bu | Blood urea |
| sc | Serum creatinine |
| sod | Sodium |
| pot | Potassium |
| hemo | Hemoglobin |
| pcv | Packed cell volume |
| wc | White blood cell count |
| rc | Red blood cell count |
| htn | Hypertension |
| dm | Diabetes mellitus |
| cad | Coronary artery disease |
| appet | Appetite |
| pe | Pedal edema |
| ane | Anemia |
| class | Class |
Figure 2The important features selected by chi-square.
The scores of all features that are selected by chi-square.
| Features | Scores |
|
| |
| wc | 12 733.72 |
| bgr | 2428.327 |
| bu | 2336.00 |
| sc | 354.410 |
| pcv | 324.706 |
| al | 228.104 |
| haem | 125.065 |
| age | 113.460 |
| su | 100.94 |
| htn | 86.29 |
| dm | 80.44 |
| bp | 80.02 |
| pe | 45.10 |
| ane | 35.611 |
| sod | 28.793 |
| pcc | 24.075 |
| rc | 20.84 |
| cad | 19.93 |
| pc | 14.16 |
| ba | 12.58 |
| appe | 12.58 |
| rbc | 9.41 |
| pot | 4.07 |
| sg | 0.0050 |
The performance of ML with the features selected by chi-square.
| Models | Cross-validation performance | Test performance | ||||||
| AC | PR | RE | FS | ACC | PR | RE | FS | |
|
| ||||||||
| DT | 97 | 98 | 98 | 98 | 92 | 93 | 93 | 93 |
| RF | 100 | 100 | 100 | 100 | 95 | 95 | 95 | 95 |
| LR | 97 | 97 | 97 | 97 | 97 | 98 | 97 | 97 |
| SVM | 97 | 97 | 97 | 97 | 100 | 100 | 100 | 100 |
| NB | 81 | 85 | 82 | 82 | 82 | 88 | 82 | 82 |
| GBT Classifier | 98 | 98 | 98 | 98 | 95 | 95 | 95 | 95 |
The best values of ML's parameters are applied to the features selected by chi-square.
| Model | Parameters | Values |
|
| ||
| DT | Impurity | Gini |
| maxDepth | 3 | |
| maxBins | 10 | |
|
| ||
| RF | Impurity | Gini |
| maxDepth | 6 | |
| maxBins | 32 | |
|
| ||
| LR | regParam | 0.8 |
| maxIter | 20 | |
|
| ||
| SVM | regParam | 0.01 |
| maxIter | 100 | |
| NB | Smoothing | 0.2 |
|
| ||
| GBT Classifier | maxDepth | 2 |
| maxBins | 60 | |
Figure 3The weights of the most essential selected by Relief-F.
The performance of ML with the features selected by Relief-F.
| Models | Cross-validation performance | Test performance | ||||||
| AC | PR | RE | FS | AC | PR | RE | FS | |
|
| ||||||||
| DT | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
| RF | 100 | 100 | 100 | 100 | 98 | 99 | 99 | 99 |
| LR | 99 | 99 | 99 | 99 | 98 | 99 | 99 | 99 |
| SVM | 99 | 99 | 99 | 99 | 98 | 99 | 99 | 99 |
| NB | 88 | 89 | 89 | 89 | 95 | 95 | 95 | 95 |
| GBT Classifier | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
The best values of ML's parameters which are applied to the features selected by Relief-F.
| Model | Parameters | Values |
|
| ||
| DT | Impurity | Gini |
| maxDepth | 4 | |
| maxBins | 32 | |
|
| ||
| RF | Impurity | Gini |
| maxDepth | 5 | |
| maxBins | 32 | |
|
| ||
| LR | regParam | 0.1 |
| maxIter | 20 | |
|
| ||
| SVM | regParam | 0.01 |
| maxIter | 100 | |
| NB | Smoothing | 0.1 |
|
| ||
| GBT Classifier | maxDepth | 4 |
| maxBins | 20 | |
The weights of all features that are selected by Relief-F.
| Features | Weights |
|
| |
| rbc | 0.455 1 |
| haem | 0.365 745 |
| pcv | 0.311 56 |
| sg | 0.289 825 |
| htn | 0.275 375 |
| al | 0.257 775 |
| dm | 0.240 85 |
| rc | 0.160 433 |
| pc | 0.136 225 |
| sod | 0.104 587 |
| Age | 0.065 923 |
| appe | 0.062 875 |
| pe | 0.056 825 |
| su | 0.031 65 |
| bgr | 0.029 549 |
| ane | 0.027 |
| bu | 0.022 733 |
| sc | 0.015 806 |
| pcc | 0.015 675 |
| wc | 0.006 426 |
| ba | −0.000 12 |
| pot | −0.004 11 |
| cad | −0.011 97 |
| bp | −0.015 84 |
The performance of ML with full features.
| Models | Cross-validation performance | Test performance | ||||||
|---|---|---|---|---|---|---|---|---|
| AC | PR | RE | FS | AC | PR | RE | FS | |
| DT | 98.43 | 98 | 98 | 98 | 95 | 95 | 95 | 95 |
| RF | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
| LR | 99 | 99 | 99 | 99 | 98 | 99 | 99 | 99 |
| SVM | 99 | 99 | 99 | 99 | 100 | 100 | 100 | 100 |
| NB | 84 | 88 | 84 | 84 | 87 | 91 | 88 | 88 |
| GBT Classifier | 99 | 99 | 99 | 99 | 95 | 95 | 95 | 95 |
The best values of ML's parameters which are applied to full features.
| Model | Parameters | Values |
|
| ||
| DT | Impurity | Gini |
| maxDepth | 4 | |
| maxBins | 10 | |
|
| ||
| RF | Impurity | Gini |
| maxDepth | 7 | |
| maxBins | 32 | |
|
| ||
| LR | regParam | 0.3 |
| maxIter | 10 | |
|
| ||
| SVM | regParam | 0.01 |
| maxIter | 1000 | |
| NB | Smoothing | 0.2 |
|
| ||
| GBT Classifier | maxDepth | 2 |
| maxBins | 60 | |
Best models for cross-validation results.
| Best models | Features | Measure methods | |||
| AC | PR | RE | FS | ||
|
| |||||
| RF | Full features | 100 | 100 | 100 | 100 |
| RF | Features selected by chi-square | 100 | 100 | 100 | 100 |
| DT | Features selected by Relief-F | 100 | 100 | 100 | 100 |
| RF | Features selected by Relief-F | 100 | 100 | 100 | 100 |
| GBT Classifier | Features selected by Relief-F | 100 | 100 | 100 | 100 |
Best models for the testing results.
| Best models | Features | Measure methods | |||
| AC | PR | RE | FS | ||
|
| |||||
| SVM | Full features | 100 | 100 | 100 | 100 |
| RF | Full features | 100 | 100 | 100 | 100 |
| SVM | Features selected by chi-square | 100 | 100 | 100 | 100 |
| DT | Features selected by Relief-F | 100 | 100 | 100 | 100 |
| GBT Classifier | Features selected by Relief-F | 100 | 100 | 100 | 100 |
The comparison of performance between the previous studies and our work on the same dataset.
| REF | Feature selection methods | The best model | Dataset | Result |
|
| ||||
| [ | RFE | RF | CKD dataset | AC = 100% |
| PR = 100% | ||||
| RE = 100% | ||||
| FS = 100% | ||||
| [ | No | A hybrid model LR and RF | CKD dataset | AC = 99.94% |
|
| ||||
|
| ||||
| [ | CFS | AdaBoost based on KNN | CKD dataset | AC = 98.1% |
| PR = 98% | ||||
| RE = 98% | ||||
| FS = 98% | ||||
| [ | Rffs, FS, FES, BS, BES | RF | CKD dataset | AC = 98.825% |
| RE = 98.04% | ||||
| [ | Cost-sensitive ensemble feature ranking | An ensemble of decision tree models | CKD dataset | AC = 97.27% |
| PRC = 99.44% | ||||
| RE = 96.25% | ||||
| FS = 97.68% | ||||
| [ | No | Random subspace-based KNN | CKD dataset | AC = 100% |
| RE = 100% | ||||
| [ | Genetic search algorithm | Multilayer perceptron | CKD dataset | AC = 99.75% |
| Our work | Relief-F | DT | CKD dataset | Cross-validation result AC = 100%, PRC = 100%, RRE = 100% FS = 100% result of testing AC = 100%, PRC = 100%, RRE = 100%, FS = 100% |
| GBT Classifier | CKD dataset | Cross-validation result AC = 100%, PRC = 100%, RRE = 100%, FS = 100%; result of testing AC = 100%, PRC = 100%, RRE = 100%, FS = 100% | ||