| Literature DB >> 36158117 |
Sunday O Olatunji1, Aisha Alansari1, Heba Alkhorasani1, Meelaf Alsubaii1, Rasha Sakloua1, Reem Alzahrani1, Yasmeen Alsaleem1, Mona Almutairi1, Nada Alhamad1, Albandari Alyami1, Zainab Alshobbar1, Reem Alassaf1, Mehwash Farooqui1, Mohammed Imran Basheer Ahmed1.
Abstract
Rheumatoid arthritis (RA) is a chronic inflammatory disease caused by numerous genetic and environmental factors leading to musculoskeletal system pain. RA may damage other tissues and organs, causing complications that severely reduce patients' quality of life. According to the World Health Organization (WHO), over 1.71 billion individuals worldwide had musculoskeletal problems in 2021. Rheumatologists face challenges in the early detection of RA since its symptoms are similar to other illnesses, and there is no definitive test to diagnose the disease. Accordingly, it is preferable to profit from the power of computational intelligence techniques that can identify hidden patterns to diagnose RA early. Although multiple studies were conducted to diagnose RA early, they showed unsatisfactory performance, with the highest accuracy of 87.5% using imaging data. Yet, imaging data requires diagnostic tools that are challenging to collect and examine and are more costly. Recent studies indicated that neither a blood test nor a physical finding could early confirm the diagnosis. Therefore, this study proposes a novel ensemble technique for the preemptive prediction of RA and investigates the possibility of diagnosing the disease using clinical data before the symptoms appear. Two datasets were obtained from King Fahad University Hospital (KFUH), Dammam, Saudi Arabia, including 446 patients, with 251 positive cases of RA and 195 negative cases of RA. Two experiments were conducted where the former was developed without upsampling the dataset, and the latter was carried out using an upsampled dataset. Multiple machine learning (ML) algorithms were utilized to assemble the novel voting ensemble, including support vector machine (SVM), logistic regression (LR), and adaptive boosting (Adaboost). The results indicated that clinical laboratory tests fed to the proposed voting ensemble technique could accurately diagnose RA preemptively with an accuracy, recall, and precision of 94.03%, 96.00%, and 93.51%, respectively, with 30 clinical features when utilizing the original data and sequential forward feature selection (SFFS) technique. It is concluded that deploying the proposed model in local hospitals can contribute to introducing a method that aids medical specialists in preemptively diagnosing RA and stopping or delaying the course using clinical laboratory tests.Entities:
Mesh:
Year: 2022 PMID: 36158117 PMCID: PMC9492338 DOI: 10.1155/2022/2339546
Source DB: PubMed Journal: Comput Math Methods Med ISSN: 1748-670X Impact factor: 2.809
Figure 1Rheumatoid arthritis prediction framework.
Dataset description.
| Feature | Description |
|---|---|
| Sex | Male or female. |
| Age | Age of the patient in years. |
| Mean corpuscular hemoglobin (MCH) | The hemoglobin's average amount in a red blood cell. |
| Mean corpuscular volume (MCV) | The average size of red blood cells. |
| Mean corpuscular hemoglobin concentration (MCHC) | The hemoglobin concentration in a given volume of a red blood cell. |
| Red cell distribution width (RDW) | The variance in size and volume of red blood cells. |
| Platelet count | The number of platelets in the body. |
| Mean platelet volume (MPV) | The average size of platelets. |
| Hemoglobin (HGB) | The amount of HGB in red blood cells. |
| Hematocrit (HCT) | The red blood cells proportion in the blood. |
| White blood cells (WBC) | The WBC count in the blood. |
| Red blood cells (RBC) | The RBC count in the blood. |
| Monocytes (MONO)% | The percentage of a particular type of WBCs. |
| Monocytes (MONO)# | The absolute count of a particular type of WBCs. |
| Eosinophils (EOS)% | The percentage of eosinophils in WBCs. |
| Eosinophils (EOS)# | The absolute count of eosinophils in WBCs. |
| Neutrophil (NEU)% | The percentage of neutrophils in WBCs. |
| Neutrophil (NEU)# | The absolute count of neutrophils in WBCs. |
| Lymphocytes (LYMPH)% | The percentage of lymphocytes in the blood. |
| Lymphocytes (LYMPH)# | The absolute count of lymphocytes in the blood. |
| Basophils (BASO)% | The percentage of basophils in the blood. |
| Basophils (BASO)# | The absolute count of basophils in the blood. |
| SODIUM | The amount of sodium in the blood. |
| Potassium | The amount of potassium in the blood. |
| Chloride | The amount of chloride in the blood. |
| Carbon dioxide (CO2) | The amount of CO2 in the blood. |
| Anion gap | The measurement of acid-base balance in the blood. |
| Gamma-glutamyl transferase (GGTP) | The amount of GGTP in the blood. |
| Alkaline phosphatase (ALP) | The measurement of ALP in the blood. |
| Serum glutamic-oxaloacetic transaminase (SGOT) | The measurement of aspartate aminotransferase (AST) enzyme in the blood serum. |
| Serum glutamic pyruvic transaminase (SGPT) | The amount of glutamate pyruvate transaminase (GPT) in blood serum. |
Statistical analysis of numerical features.
| Feature | Mean | Standard deviation | Minimum | 25% | 50% | 75% | Maximum | Missing values |
|---|---|---|---|---|---|---|---|---|
| Age | 57.54 | 14.86 | 17.00 | 48.00 | 58.00 | 68.00 | 92.00 | 1.00 |
| MCH | 26.79 | 3.36 | 14.00 | 25.00 | 27.05 | 29.00 | 36.20 | 13.00 |
| MCV | 81.55 | 8.09 | 48.00 | 77.15 | 82.70 | 87.00 | 106.50 | 14.00 |
| MCHC | 32.77 | 1.43 | 25.00 | 32.00 | 33.00 | 33.90 | 36.00 | 14.00 |
| RDW | 15.19 | 2.39 | 12.00 | 13.70 | 14.50 | 16.00 | 26.70 | 14.00 |
| Platelet count | 278.09 | 92.43 | 53.00 | 213.00 | 268.00 | 329.00 | 664.00 | 14.00 |
| MPV | 9.34 | 1.20 | 6.00 | 8.40 | 9.30 | 10.20 | 13.10 | 22.00 |
| HGB | 12.30 | 1.91 | 7.00 | 11.20 | 12.40 | 13.58 | 17.30 | 15.00 |
| HCT | 37.55 | 5.25 | 20.30 | 34.70 | 37.80 | 41.10 | 50.80 | 15.00 |
| WBC | 7.45 | 3.45 | 0.73 | 5.30 | 6.60 | 8.90 | 28.35 | 35.00 |
| RBC | 4.62 | 0.70 | 2.14 | 4.18 | 4.65 | 5.06 | 7.24 | 33.00 |
| MONO% | 8.68 | 2.46 | 1.00 | 7.00 | 8.60 | 10.20 | 18.50 | 24.00 |
| EOS% | 3.30 | 2.56 | 0.00 | 1.70 | 2.70 | 4.20 | 18.00 | 24.00 |
| NEU% | 53.09 | 13.61 | 14.00 | 43.98 | 52.85 | 61.90 | 93.00 | 25.00 |
| LYMPH% | 34.12 | 12.38 | 1.00 | 26.15 | 33.55 | 42.13 | 78.00 | 25.00 |
| BASO% | 0.60 | 0.36 | 0.00 | 0.30 | 0.50 | 0.80 | 2.20 | 25.00 |
| BASO# | 0.03 | 0.05 | 0.00 | 0.00 | 0.00 | 0.10 | 0.40 | 28.00 |
| SODIUM | 139.07 | 3.13 | 117.0 | 137.00 | 139.00 | 141.00 | 146.00 | 23.00 |
| NEU# | 4.00 | 2.22 | 0.70 | 2.40 | 3.50 | 5.10 | 13.60 | 26.00 |
| LYMPH# | 2.34 | 1.18 | 0.20 | 1.70 | 2.20 | 2.80 | 18.50 | 28.00 |
| MONO# | 0.61 | 0.25 | 0.10 | 0.40 | 0.60 | 0.70 | 2.20 | 28.00 |
| EOS# | 0.23 | 0.23 | 0.00 | 0.10 | 0.20 | 0.30 | 2.20 | 29.00 |
| Potassium | 4.29 | 0.48 | 2.90 | 4.00 | 4.30 | 4.60 | 6.40 | 27.00 |
| Chloride | 102.81 | 3.00 | 82.00 | 101.00 | 103.00 | 105.00 | 111.00 | 27.00 |
| CO2 | 27.40 | 3.12 | 6.00 | 26.00 | 28.00 | 29.00 | 38.00 | 70.00 |
| Anion gap | 8.64 | 2.79 | 1.00 | 7.00 | 9.00 | 10.00 | 29.00 | 61.00 |
| GGTP | 51.92 | 69.29 | 11.00 | 23.00 | 33.50 | 48.25 | 838.00 | 81.00 |
| ALP | 85.07 | 54.89 | 11.00 | 61.00 | 74.00 | 95.00 | 703.00 | 46.00 |
| SGOT | 26.51 | 32.10 | 7.00 | 16.00 | 21.00 | 27.00 | 442.00 | 48.00 |
| SGPT | 31.52 | 23.20 | 9.00 | 20.00 | 26.00 | 35.00 | 277.00 | 49.00 |
Figure 2Correlation heatmap.
Figure 3(a) RBF kernel; (b) sigmoid kernel; (c) linear kernel; and (d) poly kernel.
Figure 4(a) None penalty; (b) L1 penalty; and (c) L2 penalty.
Figure 5Adaboost hyperparameter tuning.
Figure 6(a) RBF kernel; (b) sigmoid kernel; (c) linear kernel; and (d) poly kernel.
Figure 7(a) None penalty; (b) L1 penalty; and (c) L2 penalty.
Figure 8Adaboost hyperparameter tuning.
The optimal hyperparameters for each classifier.
| Experiment | Classifier | Hyperparameter | Values | Validation accuracy |
|---|---|---|---|---|
| Experiment 1 | SVM | Cost | 5 | 85.90% |
| Gamma | 1 | |||
| Kernel | Linear | |||
| LR | Cost | 1 | 85.90% | |
| Penalty | L1 | |||
| Solver | Saga | |||
| Adaboost | N_estimators | 80 | 84.61% | |
| Learning_rate | 0.1 | |||
|
| ||||
| Experiment 2 | SVM | Cost | 5 | 87.49% |
| Gamma | 0.1 | |||
| Kernel | RBF | |||
| LR | Cost | 3 | 87.77% | |
| Penalty | L2 | |||
| Solver | Newton-cg | |||
| Adaboost | N_estimators | 100 | 86.06% | |
| Learning_rate | 0.1 | |||
Figure 9Proposed voting ensemble (a) experiment 1 and (b) experiment 2.
Classifiers testing accuracy, precision, and recall using the optimal hyperparameters.
| Experiment | Classifier | Test accuracy | Test precision | Test recall |
|---|---|---|---|---|
| Experiment 1 | SVM | 93.28% | 93.42% | 94.67% |
| LR | 91.04% | 89.87% | 94.67% | |
| Adaboost | 91.79% | 93.24% | 92.00% | |
| Voting | 94.03% | 93.51% | 96.00% | |
|
| ||||
| Experiment 2 | SVM | 91.79% | 92.11% | 93.33% |
| LR | 93.28% | 93.42% | 94.67% | |
| Adaboost | 89.55% | 94.20% | 86.67% | |
| Voting | 93.28% | 93.42% | 94.67% | |
Comparison of the results using forward feature selection.
| Features | Classifier | Number of features | Features selected | Test accuracy |
|---|---|---|---|---|
| Experiment 1 | ||||
| Forward selection | SVM | 15 | {Sex, age, MCHC, RDW, platelet count, HGB, HCT, NEU%, LYMPH%, BASO#, NEU#, MONO#, chloride, anion gap, SGPT} | 89.55% |
| LR | 30 | {Sex, age, MCV, MCHC, RDW, platelet count, MPV, HGB, HCT, WBC, RBC, MONO%, EOS%, NEU%, LYMPH%, BASO%, BASO#, sodium, NEU#, LYMPH#, MONO#, EOS#, potassium, chloride, CO2, anion gap, GGTP, ALP, SGOT, SGPT} | 91.04% | |
| Adaboost | 19 | {Sex, age, MCV, platelet count, MPV, RBC, MONO%, EOS%, NEU%, BASO%, sodium, LYMPH#, EOS#, potassium, chloride, anion gap, GGTP, SGOT, SGPT} | 88.81% | |
| Voting | 30 | {Sex, age, MCH, MCV, MCHC, RDW, platelet count, MPV, HGB, HCT, WBC, RBC, MONO%, EOS%, NEU%, LYMPH%, BASO%, BASO#, sodium, NEU#, LYMPH#, MONO#, EOS#, potassium, chloride, CO2, anion gap, GGTP, ALP, SGOT} | 94.03% | |
|
| ||||
| Experiment 2 | ||||
| Forward selection | SVM | 16 | {Sex, age, platelet count, MPV, HCT, EOS%, NEU%, LYMPH%, BASO%, BASO#, sodium, NEU#, EOS#, potassium, chloride, SGOT} | 88.81% |
| LR | 20 | {Sex, age, MCHC, RDW, platelet count, MPV, HGB, HCT, WBC, RBC, BASO%, SODIUM, NEU#, MONO#, potassium, CO2, anion gap, GGTP, ALP, SGPT} | 93.28% | |
| Adaboost | 6 | {Sex, age, MPV, HGB, MONO%, BASO#} | 85.82% | |
| Voting | 29 | {Sex, age, MCH, MCV, MCHC, RDW, platelet count, HGB, HCT, WBC, RBC, MONO%, EOS%, NEU%, LYMPH%, BASO%, BASO#, SODIUM, NEU#, MONO#, EOS#, potassium, chloride, CO2, anion gap, GGTP, ALP, SGOT, SGPT} | 91.04% | |
Comparison of the results using backward feature selection.
| Features | Classifier | Number of features | Features selected | Test accuracy |
|---|---|---|---|---|
| Experiment 1 | ||||
| Backward elimination | SVM | 16 | {Sex, age, MCV, platelet count, MPV, WBC, EOS%, NEU%, LYMPH%, BASO#, SODIUM, EOS#, potassium, chloride, CO2, SGPT} | 92.54% |
| LR | 31 | {Sex, age, MCH, MCV, MCHC, RDW, platelet count, MPV, HGB, HCT, WBC, RBC, MONO%, EOS%, NEU%, LYMPH%, BASO%, BASO#, sodium, NEU#, LYMPH#, MONO#, EOS#, potassium, chloride, CO2, anion gap, GGTP, ALP, SGOT, SGPT} | 91.04% | |
| Adaboost | 9 | {Sex, age, MCHC, RDW, platelet count, MPV, NEU%, potassium, SGPT} | 91.04% | |
| Voting | 12 | {Sex, age, MCHC, RDW, platelet count, MPV, WBC, MONO%, NEU%, BASO#, potassium, GGTP} | 93.28% | |
|
| ||||
| Experiment 2 | ||||
| Backward elimination | SVM | 22 | {Sex, age, MCH, MCV, MCHC, platelet count, MPV, HGB, HCT, RBC, MONO%, EOS%, NEU%, LYMPH%, BASO%, BASO#, sodium, NEU#, potassium, chloride, anion gap, SGPT} | 91.79% |
| LR | 24 | {Sex, age, MCH, MCV, MCHC, RDW, platelet count, MPV, HGB, HCT, WBC, RBC, MONO%, EOS%, NEU%, BASO%, BASO#, sodium, MONO#, potassium, CO2, anion gap, GGTP, SGOT} | 92.54% | |
| Adaboost | 6 | {Sex, age, platelet count, MPV, NEU%, potassium} | 92.54% | |
| Voting | 22 | {Sex, age, MCH, MCV, MCHC, RDW, platelet count, MPV, HGB, HCT, WBC, RBC, MONO%, EOS%, NEU%, BASO#, sodium, NEU#, LYMPH#, potassium, chloride, ALP} | 92.54% | |
Final results of the best-selected classifiers.
| Sampling | Classifier | Test accuracy | Test precision | Test recall |
|---|---|---|---|---|
| Without sampling | SVM | 92.54% | 92.21% | 94.67% |
| With sampling | LR | 93.28% | 93.42% | 94.67% |
| With sampling | Adaboost | 92.54% | 95.77% | 90.67% |
| Without sampling | Voting | 94.03% | 93.51% | 96.00% |
Figure 10(a) SVM confusion matrix; (b) LR confusion matrix; (c) Adaboost confusion matrix; and (d) voting confusion matrix.
Figure 11(a) SVM AUROC; (b) LR AUROC; (c) Adaboost AUROC; and (d) voting AUROC.