| Literature DB >> 35484501 |
Tapas Bhadra1, Saurav Mallik2, Neaj Hasan1, Zhongming Zhao3,4.
Abstract
BACKGROUND: As many complex omics data have been generated during the last two decades, dimensionality reduction problem has been a challenging issue in better mining such data. The omics data typically consists of many features. Accordingly, many feature selection algorithms have been developed. The performance of those feature selection methods often varies by specific data, making the discovery and interpretation of results challenging. METHODS ANDEntities:
Keywords: Classifier; Feature selection; Multi-omics data; Redundancy rate; Representation entropy
Mesh:
Year: 2022 PMID: 35484501 PMCID: PMC9052461 DOI: 10.1186/s12859-022-04678-y
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.307
Fig. 1Pipeline of the proposed method
Fundamental characteristics such as statistically significant features, number of samples and number of classes among samples for the used datasets
| Data profile | # classes | # statistically significant features | # samples |
|---|---|---|---|
| Exp | 3 | 728 | 161 |
| ExpExon | 3 | 1100 | 161 |
| hMethyl27 | 3 | 272 | 161 |
| Gistic2 | 3 | 904 | 161 |
| Pathway activity data | 3 | 265 | 161 |
Comparison of classification accuracies (with 10-fold cross-validation and 10 times repetitions each case) determined by various feature subsets selected through various feature selection algorithms
| Dataset | Classifier | Feature selection algorithm | ||||
|---|---|---|---|---|---|---|
| mRMR | INMIFS | DFS | SVM-RFE-CBR | VWMRmR | ||
| Avg. (std) | Avg. (std) | Avg. (std) | Avg. (std) | Avg. (std) | ||
| Exp | C4.5 | 72.67(3.29) | 73.98(2.08) | 72.73(2.08) | 73.29(3.06) | |
| Naive Bayes | 83.04(0.88) | 83.66(0.83) | 80.94(0.72) | 84.47(0.83) | ||
| KNN | 88.01(0.66) | 84.35(1.16) | 86.15(0.51) | 88.01(0.83) | ||
| AdaBoost | 81.86(1.30) | 81.43(1.42) | 74.47(2.14) | 81.37(0.97) | ||
| ExpExon | C4.5 | 77.08(1.69) | 75.09(2.72) | 73.23(3.43) | 75.03(2.76) | |
| Naive Bayes | 85.22(0.57) | 84.10(0.43) | 85.90(0.59) | 85.46(0.60) | ||
| KNN | 87.58(0.65) | 87.45(1.01) | 87.01(1.19) | 84.35(0.39) | ||
| AdaBoost | 84.03(1.17) | 82.24(1.99) | 77.20(1.44) | 83.35(0.71) | ||
| hMethyl27 | C4.5 | 72.05(2.27) | 70.87(1.26) | 70.99(1.94) | 71.99(2.18) | |
| Naive Bayes | 76.21(1.01) | 78.82(1.26) | 80.94(1.28) | 80.68(1.39) | ||
| KNN | 82.48(1.09) | 83.48(0.89) | 82.80(0.83) | 83.23(0.97) | ||
| AdaBoost | 74.29(1.74) | 77.89(1.38) | 77.52(2.41) | 78.14(1.77) | ||
| Gistic2 | C4.5 | 68.01(2.45) | 71.49(1.42) | 72.55(0.76) | 67.95(2.07) | |
| Naive Bayes | 76.89(1.05) | 75.84(0.46) | 78.07(0.42) | 77.20(1.02) | ||
| KNN | 76.02(0.73) | 75.22(0.85) | 74.10(0.42) | 74.84(0.60) | ||
| AdaBoost | 78.57(0.73) | 75.84(0.80) | 78.20(0.46) | 77.33(0.98) | ||
| Pathway | C4.5 | 66.65(2.70) | 68.82(3.28) | 64.97(2.72) | 66.89(3.26) | |
| activity | Naive Bayes | 79.44(1.19) | 82.30(1.56) | 66.65(0.78) | 76.83(1.02) | |
| KNN | 78.32(0.99) | 77.08(0.80) | 70.37(1.17) | 78.32(0.80) | ||
| AdaBoost | 77.39(2.07) | 78.94(1.86) | 70.87(1.32) | 76.15(1.38) | ||
aThe best mean scores of percentage accuracy for each row is highlighted in bold font
Summary of the comparative performance of the proposed feature selection algorithm against other feature selection algorithms
| Dataset | Criteria | mRMR | INMIFS | DFS | SVM-RFE-CBR |
|---|---|---|---|---|---|
| Exp | W-D-L (VWMRmR against othera) | 2-0-2 | 1-1-2 | 4-0-0 | 1-0-3 |
| ExpExon | W-D-L (VWMRmR against othera) | 3-0-1 | 4-0-0 | 4-0-0 | 3-0-1 |
| hMethyl27 | W-D-L (VWMRmR against othera) | 3-0-1 | 3-0-1 | 4-0-0 | 2-0-2 |
| Gistic2 | W-D-L (VWMRmR against othera) | 0-0-4 | 1-0-3 | 3-0-1 | 0-0-4 |
| Pathway activity data | W-D-L(VWMRmR against othera) | 2-1-1 | 4-0-0 | 2-0-2 | 4-0-0 |
aOther is the feature selection method denoted by each specific column (e.g., mRMR in third column)
Average redundancy rate in terms of normalized mutual information (denoted as ) of different subsets of features selected using various algorithms. Least value of signifies better choice of the feature selection algorithm
| Dataset | Feature selection algorithm | ||||
|---|---|---|---|---|---|
| mRMR | INMIFS | DFS | SVM-RFE-CBR | VWMRmR | |
| Exp | 0.0978 | 0.1053 | 0.0988 | 0.1053 | |
| ExpExon | 0.0972 | 0.0991 | 0.105 | 0.0961 | |
| hMethyl27 | 0.0984 | 0.1 | 0.1045 | 0.0894 | |
| Gistic2 | 0.3159 | 0.215 | 0.4258 | 0.3209 | |
| Pathway activity data | 0.0659 | 0.0533 | 0.0778 | 0.1215 | |
aBold font justifies least among all algorithms (i.e., best performance) for each data (row)
Average redundancy rate in terms of Pearson correlation coefficient (denoted as ) of different subsets of features selected using various algorithms. Least value of signifies better choice of the feature selection algorithm
| Dataset | Feature selection algorithm | ||||
|---|---|---|---|---|---|
| mRMR | INMIFS | DFS | SVM-RFE-CBR | VWMRmR | |
| Exp | 0.0281 | 0.0345 | 0.0145 | 0.0219 | |
| ExpExon | 0.0258 | 0.0378 | 0.0545 | 0.025 | |
| hMethyl27 | 0.109 | 0.1235 | 0.1324 | 0.084 | |
| Gistic2 | 0.2953 | 0.1665 | 0.3244 | 0.2437 | |
| Pathway activity data | 0.042 | 0.0291 | 0.0467 | 0.0684 | |
aBold font justifies least among all algorithms (i.e., best performance) for each data (row)
Representation Entropy (RE) of feature subsets obtained using different supervised feature selection algorithms. The higher value of representation entropy is the better choice of the feature selection algorithm
| Dataset | Feature selection algorithm | ||||
|---|---|---|---|---|---|
| mRMR | INMIFS | DFS | SVM-RFE-CBR | VWMRmR | |
| Exp | 4.4272 | 4.3292 | 4.4012 | 4.3777 | |
| ExpExon | 4.4364 | 4.3815 | 4.4418 | 4.457 | |
| hMethyl27 | 4.3622 | 4.3389 | 4.3951 | 4.4462 | |
| Gistic2 | 2.275 | 2.9071 | 1.5355 | 1.9769 | |
| Pathway activity data | 4.7957 | 4.8386 | 4.3463 | 4.0656 | |
aBold font justifies the highest RE among all algorithms (i.e., best performance) for each data
Fig. 2Venn diagrams showing the intersection of top 50 extracted statistically significant features (genes) among five feature selection algorithms: A expression data, B exon expression data, C methylation data, D copy number variation (Gistic2) data, and E pathway activity data