| Literature DB >> 35885672 |
Hanaa Torkey1, Nahla A Belal2.
Abstract
Multiple Sclerosis (MS) is a disease attacking the central nervous system. According to MS Atlas's most recent statistics, there are more than 2.8 million people worldwide diagnosed with MS. Recently, studies started to explore machine learning techniques to predict MS using various data. The objective of this paper is to develop an ensemble approach for diagnosis of MS using gene expression profiles, while handling the class imbalance problem associated with the data. A hierarchical ensemble approach employing voting and boosting techniques is proposed. This approach adopts a heterogeneous voting approach using two base learners, random forest and support vector machine. Experiments show that our approach outperforms state-of-the-art methods, with the highest recorded accuracy being 92.81% and 93.5% with BoostFS and DEGs for feature selection, respectively. Conclusively, the proposed approach is able to efficiently diagnose MS using the gene expression profiles that are more relevant to the disease. The approach is not merely an ensemble classifier outperforming previous work; it also identifies differentially expressed genes between normal samples and patients with multiple sclerosis using a genome-wide expression microarray. The results obtained show that the proposed approach is an efficient diagnostic tool for MS.Entities:
Keywords: diagnosis; differentially expressed genes; ensemble learning; gene expression; multiple sclerosis
Year: 2022 PMID: 35885672 PMCID: PMC9316893 DOI: 10.3390/diagnostics12071771
Source DB: PubMed Journal: Diagnostics (Basel) ISSN: 2075-4418
MS datasets’ description.
| Datasets | Classes | Features (Genes) | Patients Samples | Control Samples |
|---|---|---|---|---|
|
| Two classes: | 18,722 | 691 | 126 |
|
| Control or patient | 21,653 | 85 | 28 |
|
| Only patient samples | 22,653 | 250 | – |
|
| Only patient samples | 21,147 | 144 | – |
Figure 1Proposed model framework.
Figure 2Proposed method in voting stage.
Benchmark datasets: the KEEL imbalanced datasets.
| Dataset | Imbalance Ratio | No. of Samples | No. of Features |
|---|---|---|---|
| Pima | 1.87 | 768 | 8 |
| vehicle0 | 3.25 | 846 | 18 |
| new-thyroid2 | 5.14 | 215 | 5 |
| ecoli3 | 8.60 | 336 | 7 |
| ecoli-0 | 9.28 | 257 | 7 |
Evaluation results for the proposed algorithm to handle the imbalanced class problem—random forest.
| Random Forest | |||
|---|---|---|---|
|
|
|
|
|
| Pima | 95% | 0.069 | 0.94 |
| vehicle0 | 94% | 0.068 | 0.91 |
| new-thyroid2 | 92% | 0.047 | 0.89 |
| ecoli3 | 90% | 0.063 | 0.88 |
| ecoli-0 | 91% | 0.067 | 0.90 |
Evaluation Results for the proposed algorithm to handle the imbalanced class problem—SVM.
| SVM | |||
|---|---|---|---|
|
|
|
|
|
| Pima | 93% | 0.051 | 0.94 |
| vehicle0 | 95% | 0.056 | 0.92 |
| new-thyroid2 | 94% | 0.041 | 0.95 |
| ecoli3 | 89% | 0.053 | 0.88 |
| ecoli-0 | 90% | 0.076 | 0.89 |
Evaluation results for the proposed algorithm to handle the imbalanced class problem—KNN.
| KNN | |||
|---|---|---|---|
|
|
|
|
|
| Pima | 94% | 0.092 | 0.85 |
| vehicle0 | 89% | 0.071 | 0.87 |
| new-thyroid2 | 90% | 0.077 | 0.84 |
| ecoli3 | 90% | 0.079 | 0.89 |
| ecoli-0 | 90% | 0.072 | 0.90 |
Evaluation results for the proposed algorithm to handle the imbalanced class problem—Guo et al.
| Guo et al [ | |||
|---|---|---|---|
|
|
|
|
|
| Pima | 95% | 0.081 | 0.91 |
| vehicle0 | 92% | 0.069 | 0.89 |
| new-thyroid2 | 91% | 0.064 | 0.88 |
| ecoli3 | 92% | 0.090 | 0.91 |
| ecoli-0 | 87% | 0.097 | 0.84 |
Evaluation results for the proposed algorithm to handle the imbalanced class problem—proposed method.
| Proposed Method | |||
|---|---|---|---|
|
|
|
|
|
| Pima | 95% | 0.029 | 0.94 |
| vehicle0 | 96% | 0.046 | 0.95 |
| new-thyroid2 | 94% | 0.065 | 0.95 |
| ecoli3 | 95% | 0.048 | 0.94 |
| ecoli-0 | 92% | 0.051 | 0.91 |
Figure 3Venn diagram for Differentially Expressed Genes (DEGs) of MS for three stages; Control vs. Patient (C-vs-P), Baseline to the first-year follow-up (B-1-fu), and first-year follow-up to second-year follow-up (1-2-fu). (A) shows the Venn diagram of up-regulated and (B) shows the Venn diagram of down-regulated genes in the three MS stages.
Proposed MS detection approach evaluation.
| Classification Algorithm | Feature Selection | Accuracy | MCC | RMSE | F1 Score |
|---|---|---|---|---|---|
|
| All Features | 82.41% | 82% | 0.058 | 0.79 |
| Chi-Squared | 82.50% | 83% | 0.064 | 0.81 | |
| RFE-SVC | 83.30% | 84% | 0.065 | 0.85 | |
| BoostFS | 84.81% | 85% | 0.061 | 0.86 | |
| With DEGs | 86.81% | 87% | 0.023 | 0.86 | |
|
| All Features | 84.61% | 84% | 0.075 | 0.84 |
| Chi-Squared | 86.50% | 85% | 0.056 | 0.86 | |
| RFE-SVC | 87.53% | 87% | 0.099 | 0.87 | |
| BoostFS | 88.81% | 89% | 0.091 | 0.89 | |
| With DEGs | 89.50% | 86% | 0.089 | 0.88 | |
|
| All Features | 84.61% | 85% | 0.084 | 0.84 |
| Chi-Squared | 88.50% | 88% | 0.056 | 0.88 | |
| RFE-SVC | 89.50% | 90% | 0.072 | 0.89 | |
| BoostFS | 90.81% | 91% | 0.063 | 0.90 | |
| With DEGs | 90.17% | 89% | 0.043 | 0.89 | |
|
| All Features | 89.61% | 89% | 0.081 | 0.88 |
| Chi-Squared | 89.50% | 88% | 0.043 | 0.89 | |
| RFE-SVC | 90.90% | 91% | 0.079 | 0.91 | |
| BoostFS | 92.81% | 93% | 0.067 | 0.93 | |
| With DEGs | 93.54% | 94% | 0.059 | 0.93 |
Figure 4ROC curve for the proposed approach with different feature algorithms.
Figure 5Comparison of the feature selection algorithms with different numbers of features.
Figure 6Enriched Gene Ontology terms in the identified DEGs.