| Literature DB >> 34141889 |
Muhammad Hamraz1, Naz Gul1, Mushtaq Raza2, Dost Muhammad Khan1, Umair Khalil1, Seema Zubair3, Zardad Khan1.
Abstract
In this paper, a novel feature selection method called Robust Proportional Overlapping Score (RPOS), for microarray gene expression datasets has been proposed, by utilizing the robust measure of dispersion, i.e., Median Absolute Deviation (MAD). This method robustly identifies the most discriminative genes by considering the overlapping scores of the gene expression values for binary class problems. Genes with a high degree of overlap between classes are discarded and the ones that discriminate between the classes are selected. The results of the proposed method are compared with five state-of-the-art gene selection methods based on classification error, Brier score, and sensitivity, by considering eleven gene expression datasets. Classification of observations for different sets of selected genes by the proposed method is carried out by three different classifiers, i.e., random forest, k-nearest neighbors (k-NN), and support vector machine (SVM). Box-plots and stability scores of the results are also shown in this paper. The results reveal that in most of the cases the proposed method outperforms the other methods.Entities:
Keywords: Binary classification; Feature selection; Functional genomic; Overlapping analysis
Year: 2021 PMID: 34141889 PMCID: PMC8176540 DOI: 10.7717/peerj-cs.562
Source DB: PubMed Journal: PeerJ Comput Sci ISSN: 2376-5992
Figure 1Gene expression data.
Figure 2Workflow of RPOS.
Algorithm of RPOS Method For Gene Selection.
| 1: |
| 2: |
| 3: |
| 4: |
| 5: Compute the relative dominant class for each gene, i.e., |
| 6: |
| 7: |
| 8: Compute the gene mask for each gene, i.e., |
| 9: Compute the |
| 10: Assign |
| 11: |
| 12: let |
| 13: Compute the total or aggregate mask of genes and denote it by |
| 14: Use the Greedy search approach to select the minimum subset of genes from |
| 15: Perform |
| 16: Arrange the genes in |
| 17: |
| 18: |
| 19: |
| 20: Then |
| 21: |
| 22: Increase |
| 23: |
| 24: |
| 25: |
Datasets description showing number of samples, number of genes, class wise distribution of samples in the data.
| Dataset | Samples | Genes | Class wise distribution | Source |
|---|---|---|---|---|
| Leukeamia | 68 | 7,029 | 49/23 | |
| nki | 144 | 76 | 96/48 | |
| Colon | 62 | 2,000 | 40/22 | |
| Breast | 78 | 4,948 | 34/44 | |
| GSE4045 | 37 | 22,215 | 29/8 | |
| Prostate | 412 | 10,936 | 343/69 | |
| Srbct | 54 | 2,308 | 28/25 | |
| Lung | 148 | 12,600 | 134/14 | |
| DLBCL | 76 | 7,070 | 58/19 | |
| TumorC | 60 | 7,129 | 39/21 |
Classification error rate, sensitivity and Brier score produced by Random Forest, k-Nearest Neighbors and Support Vector Machine classifiers on TumorC dataset based on genes selected by the given methods. The best result is shown in bold.
| RF | kNN | SVM | |||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Genes | POS | RPOS | GClust | sigF | Wilc | mRmR | POS | RPOS | GClust | sigF | Wilc | mRmR | POS | RPOS | GClust | sigF | Wilc | mRmR | |
| Err | 0.362 | 0.334 | 0.482 | 0.450 | 0.451 | 0.423 | 0.383 | 0.407 | 0.398 | 0.396 | 0.362 | 0.333 | 0.277 | 0.442 | 0.373 | ||||
| 5 | BS | 0.013 | 0.023 | 0.274 | 0.268 | 0.276 | 0.026 | 0.259 | 0.260 | 0.257 | 0.035 | 0.037 | 0.188 | 0.254 | 0.244 | ||||
| sen | 0.311 | 0.363 | 0.217 | 0.236 | 0.261 | 0.344 | 0.454 | 0.346 | 0.348 | 0.389 | 0.557 | 0.700 | 0.579 | 0.070 | 0.189 | ||||
| Err | 0.336 | 0.313 | 0.482 | 0.401 | 0.348 | 0.332 | 0.355 | 0.471 | 0.391 | 0.395 | 0.341 | 0.336 | 0.302 | 0.396 | 0.349 | ||||
| 10 | BS | 0.022 | 0.274 | 0.252 | 0.231 | 0.019 | 0.029 | 0.282 | 0.254 | 0.260 | 0.086 | 0.072 | 0.204 | 0.241 | 0.230 | ||||
| sen | 0.358 | 0.375 | 0.217 | 0.278 | 0.408 | 0.427 | 0.505 | 0.332 | 0.365 | 0.387 | 0.532 | 0.699 | 0.589 | 0.150 | 0.268 | ||||
| Err | 0.351 | 0.312 | 0.482 | 0.391 | 0.338 | 0.293 | 0.344 | 0.415 | 0.382 | 0.386 | 0.312 | 0.249 | 0.311 | 0.400 | 0.351 | ||||
| 15 | BS | 0.016 | 0.026 | 0.274 | 0.250 | 0.228 | 0.018 | 0.039 | 0.272 | 0.249 | 0.249 | 0.052 | 0.066 | 0.166 | 0.245 | 0.232 | |||
| sen | 0.286 | 0.399 | 0.217 | 0.177 | 0.439 | 0.462 | 0.511 | 0.246 | 0.363 | 0.398 | 0.472 | 0.714 | 0.588 | 0.095 | 0.279 | ||||
| Err | 0.297 | 0.303 | 0.482 | 0.464 | 0.371 | 0.305 | 0.345 | 0.426 | 0.393 | 0.383 | 0.270 | 0.313 | 0.233 | 0.387 | 0.387 | ||||
| 20 | BS | 0.021 | 0.274 | 0.266 | 0.234 | 0.017 | 0.033 | 0.280 | 0.260 | 0.255 | 0.074 | 0.054 | 0.170 | 0.249 | 0.247 | ||||
| sen | 0.440 | 0.404 | 0.217 | 0.091 | 0.312 | 0.478 | 0.555 | 0.245 | 0.347 | 0.365 | 0.561 | 0.721 | 0.666 | 0.033 | 0.115 | ||||
| Err | 0.306 | 0.300 | 0.482 | 0.423 | 0.373 | 0.336 | 0.333 | 0.459 | 0.377 | 0.392 | 0.281 | 0.217 | 0.296 | 0.379 | 0.378 | ||||
| 25 | BS | 0.015 | 0.026 | 0.274 | 0.263 | 0.237 | 0.018 | 0.028 | 0.284 | 0.245 | 0.253 | 0.031 | 0.049 | 0.157 | 0.245 | 0.250 | |||
| sen | 0.399 | 0.411 | 0.217 | 0.120 | 0.257 | 0.364 | 0.539 | 0.262 | 0.331 | 0.348 | 0.518 | 0.716 | 0.678 | 0.044 | 0.062 | ||||
| Err | 0.335 | 0.309 | 0.482 | 0.441 | 0.379 | 0.373 | 0.331 | 0.467 | 0.388 | 0.400 | 0.283 | 0.286 | 0.380 | 0.384 | |||||
| 30 | BS | 0.014 | 0.020 | 0.274 | 0.263 | 0.240 | 0.020 | 0.025 | 0.282 | 0.253 | 0.259 | 0.024 | 0.039 | 0.151 | 0.252 | 0.248 | |||
| sen | 0.317 | 0.423 | 0.217 | 0.064 | 0.304 | 0.304 | 0.560 | 0.174 | 0.331 | 0.382 | 0.480 | 0.665 | 0.661 | 0.019 | 0.066 | ||||
Classification error rate, sensitivity and Brier score produced by Random Forest, k-Nearest Neighbors and Support Vector Machine classifiers on Breastcancer dataset based on genes selected by the given methods. The best result is shown in bold.
| RF | KNN | SVM | |||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Genes | POS | RPOS | GClust | sigF | Wilc | mRmR | POS | RPOS | GClust | sigF | Wilc | mRmR | POS | RPOS | GClust | sigF | Wilc | mRmR | |
| 5 | Err | 0.296 | 0.261 | 0.490 | 0.390 | 0.455 | 0.314 | 0.313 | 0.448 | 0.405 | 0.402 | 0.310 | 0.512 | 0.522 | 0.384 | 0.367 | |||
| BS | 0.013 | 0.165 | 0.287 | 0.254 | 0.277 | 0.014 | 0.290 | 0.275 | 0.261 | 0.254 | 0.021 | 0.262 | 0.251 | 0.262 | 0.244 | ||||
| sen | 0.784 | 0.810 | 0.621 | 0.722 | 0.661 | 0.706 | 0.798 | 0.703 | 0.761 | 0.760 | 0.704 | 0.558 | 0.506 | 0.714 | 0.791 | ||||
| 10 | Err | 0.308 | 0.261 | 0.514 | 0.360 | 0.462 | 0.276 | 0.297 | 0.501 | 0.390 | 0.396 | 0.272 | 0.522 | 0.484 | 0.351 | 0.456 | |||
| BS | 0.013 | 0.168 | 0.278 | 0.225 | 0.266 | 0.202 | 0.304 | 0.251 | 0.254 | 0.022 | 0.261 | 0.260 | 0.237 | 0.260 | |||||
| sen | 0.757 | 0.818 | 0.613 | 0.709 | 0.654 | 0.786 | 0.819 | 0.677 | 0.754 | 0.764 | 0.750 | 0.575 | 0.412 | 0.704 | 0.743 | ||||
| 15 | Err | 0.323 | 0.202 | 0.519 | 0.337 | 0.414 | 0.297 | 0.241 | 0.514 | 0.391 | 0.401 | 0.262 | 0.515 | 0.462 | 0.350 | 0.427 | |||
| BS | 0.013 | 0.145 | 0.275 | 0.222 | 0.246 | 0.012 | 0.182 | 0.324 | 0.255 | 0.256 | 0.046 | 0.262 | 0.260 | 0.235 | 0.255 | ||||
| sen | 0.709 | 0.848 | 0.643 | 0.767 | 0.719 | 0.781 | 0.810 | 0.685 | 0.768 | 0.763 | 0.741 | 0.564 | 0.354 | 0.742 | 0.798 | ||||
| 20 | Err | 0.290 | 0.199 | 0.481 | 0.377 | 0.468 | 0.279 | 0.257 | 0.473 | 0.408 | 0.395 | 0.225 | 0.542 | 0.409 | 0.386 | 0.474 | |||
| BS | 0.014 | 0.155 | 0.265 | 0.234 | 0.258 | 0.014 | 0.188 | 0.284 | 0.260 | 0.256 | 0.041 | 0.259 | 0.254 | 0.251 | 0.265 | ||||
| sen | 0.767 | 0.851 | 0.694 | 0.717 | 0.686 | 0.794 | 0.815 | 0.734 | 0.745 | 0.763 | 0.793 | 0.526 | 0.390 | 0.673 | 0.781 | ||||
| 25 | Err | 0.300 | 0.223 | 0.495 | 0.366 | 0.473 | 0.256 | 0.271 | 0.462 | 0.404 | 0.393 | 0.250 | 0.223 | 0.523 | 0.406 | 0.377 | 0.427 | ||
| BS | 0.012 | 0.156 | 0.270 | 0.229 | 0.265 | 0.012 | 0.178 | 0.279 | 0.260 | 0.251 | 0.033 | 0.264 | 0.254 | 0.246 | 0.259 | ||||
| sen | 0.777 | 0.832 | 0.694 | 0.726 | 0.659 | 0.801 | 0.829 | 0.693 | 0.753 | 0.759 | 0.790 | 0.567 | 0.397 | 0.691 | 0.790 | ||||
| 30 | Err | 0.268 | 0.242 | 0.411 | 0.350 | 0.454 | 0.261 | 0.258 | 0.436 | 0.387 | 0.394 | 0.249 | 0.455 | 0.418 | 0.351 | 0.457 | |||
| BS | 0.009 | 0.158 | 0.248 | 0.222 | 0.261 | 0.011 | 0.182 | 0.281 | 0.250 | 0.253 | 0.031 | 0.260 | 0.253 | 0.236 | 0.262 | ||||
| sen | 0.823 | 0.836 | 0.733 | 0.736 | 0.694 | 0.813 | 0.818 | 0.708 | 0.776 | 0.767 | 0.787 | 0.661 | 0.365 | 0.698 | 0.767 | ||||
Classification error rate, sensitivity and Brier score produced by Random Forest, k-Nearest Neighbors and Support Vector Machine classifiers on srbct dataset based on genes selected by the given methods. The best result is shown in bold.
| RF | kNN | SVM | |||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Genes | POS | RPOS | GClust | sigF | Wilc | mRmR | POS | RPOS | GClust | sigF | Wilc | mRmR | POS | RPOS | GClust | sigF | Wilc | mRmR | |
| 5 | Err | 0.048 | 0.096 | 0.040 | 0.021 | 0.390 | 0.078 | 0.034 | 0.100 | 0.074 | 0.078 | 0.086 | 0.021 | 0.035 | 0.328 | 0.412 | |||
| BS | 0.005 | 0.096 | 0.029 | 0.023 | 0.236 | 0.007 | 0.057 | 0.002 | 0.071 | 0.074 | 0.037 | 0.028 | 0.011 | 0.217 | 0.255 | ||||
| sen | 0.919 | 0.961 | 0.980 | 0.978 | 0.549 | 0.718 | 0.915 | 0.914 | 0.878 | 0.942 | 0.984 | 0.608 | 0.574 | ||||||
| 10 | Err | 0.018 | 0.021 | 0.027 | 0.035 | 0.086 | 0.039 | 0.038 | 0.055 | 0.071 | 0.069 | 0.016 | 0.011 | 0.029 | 0.204 | 0.143 | |||
| BS | 0.003 | 0.029 | 0.027 | 0.022 | 0.089 | 0.004 | 0.002 | 0.041 | 0.076 | 0.071 | 0.016 | 0.031 | 0.013 | 0.138 | 0.093 | ||||
| sen | 0.991 | 0.957 | 0.981 | 0.977 | 0.879 | 0.852 | 0.925 | 0.918 | 0.992 | 0.995 | 0.943 | 0.766 | 0.785 | ||||||
| 15 | Err | 0.004 | 0.014 | 0.016 | 0.013 | 0.165 | 0.039 | 0.035 | 0.075 | 0.074 | 0.071 | 0.004 | 0.004 | 0.015 | 0.188 | 0.182 | |||
| BS | 0.028 | 0.021 | 0.024 | 0.142 | 0.002 | 0.002 | 0.047 | 0.071 | 0.073 | 0.005 | 0.015 | 0.010 | 0.118 | 0.129 | |||||
| sen | 0.995 | 0.991 | 0.956 | 0.977 | 0.805 | 0.807 | 0.927 | 0.910 | 0.995 | 0.962 | 0.756 | 0.764 | |||||||
| 20 | Err | 0.009 | 0.016 | 0.010 | 0.009 | 0.081 | 0.036 | 0.036 | 0.053 | 0.066 | 0.071 | 0.011 | 0.003 | 0.020 | 0.144 | 0.130 | |||
| BS | 0.029 | 0.021 | 0.023 | 0.088 | 0.002 | 0.002 | 0.041 | 0.069 | 0.072 | 0.007 | 0.019 | 0.010 | 0.098 | 0.082 | |||||
| sen | 0.987 | 0.990 | 0.956 | 0.986 | 0.875 | 0.895 | 0.919 | 0.911 | 0.997 | 0.999 | 0.986 | 0.797 | 0.816 | ||||||
| 25 | Err | 0.009 | 0.017 | 0.011 | 0.009 | 0.067 | 0.038 | 0.020 | 0.060 | 0.066 | 0.074 | 0.011 | 0.008 | 0.030 | 0.134 | 0.098 | |||
| BS | 0.031 | 0.021 | 0.024 | 0.084 | 0.002 | 0.039 | 0.071 | 0.072 | 0.006 | 0.023 | 0.008 | 0.087 | 0.067 | ||||||
| sen | 0.992 | 0.997 | 0.956 | 0.987 | 0.881 | 0.870 | 0.923 | 0.915 | 0.999 | 0.997 | 0.977 | 0.826 | 0.885 | ||||||
| 30 | Err | 0.023 | 0.007 | 0.005 | 0.075 | 0.034 | 0.002 | 0.047 | 0.064 | 0.065 | 0.009 | 0.014 | 0.018 | 0.131 | 0.129 | ||||
| BS | 0.029 | 0.022 | 0.024 | 0.094 | 0.002 | 0.040 | 0.069 | 0.070 | 0.006 | 0.017 | 0.006 | 0.087 | 0.090 | ||||||
| sen | 0.992 | 0.997 | 0.957 | 0.994 | 0.883 | 0.866 | 0.914 | 0.924 | 0.998 | 0.999 | 0.951 | 0.828 | 0.855 | ||||||
Figure 3Boxplots of classification error rates for 20 number of genes for the datasets; (A) TumorC, (B) Breast, (C) srbct, (D) DLBCL, (E) Prostate, (F) nki, (G) Lung, (H) GSE4045, (I) Colon and (J) Leukaemia.
Figure 4Classification error rates of the methods for different number of genes for the datasets; (A) TumorC, (B) Breast, (C) srbct, (D) DLBCL, (E) Lung and (F) Leukaemia.
Figure 5Brier scores of the methods for different number of genes for the datasets; (A) TumorC, (B) Breast, (C) srbct, (D) DLBCL, (E) Lung and (F) Leukaemia.
Figure 6Sensitivity of the methods for different number of genes for the datasets; (A) TumorC, (B) Breast, (C) srbct, (D) DLBCL, (E) Lung and (F) Leukaemia.