| Literature DB >> 34054930 |
Juanying Xie1, Mingzhao Wang1,2, Shengquan Xu2, Zhao Huang1, Philip W Grant3.
Abstract
To tackle the challenges in genomic data analysis caused by their tens of thousands of dimensions while having a small number of examples and unbalanced examples between classes, the technique of unsupervised feature selection based on standard deviation and cosine similarity is proposed in this paper. We refer to this idea as SCFS (Standard deviation and Cosine similarity based Feature Selection). It defines the discernibility and independence of a feature to value its distinguishable capability between classes and its redundancy to other features, respectively. A 2-dimensional space is constructed using discernibility as x-axis and independence as y-axis to represent all features where the upper right corner features have both comparatively high discernibility and independence. The importance of a feature is defined as the product of its discernibility and its independence (i.e., the area of the rectangular enclosed by the feature's coordinate lines and axes). The upper right corner features are by far the most important, comprising the optimal feature subset. Based on different definitions of independence using cosine similarity, there are three feature selection algorithms derived from SCFS. These are SCEFS (Standard deviation and Exponent Cosine similarity based Feature Selection), SCRFS (Standard deviation and Reciprocal Cosine similarity based Feature Selection) and SCAFS (Standard deviation and Anti-Cosine similarity based Feature Selection), respectively. The KNN and SVM classifiers are built based on the optimal feature subsets detected by these feature selection algorithms, respectively. The experimental results on 18 genomic datasets of cancers demonstrate that the proposed unsupervised feature selection algorithms SCEFS, SCRFS and SCAFS can detect the stable biomarkers with strong classification capability. This shows that the idea proposed in this paper is powerful. The functional analysis of these biomarkers show that the occurrence of the cancer is closely related to the biomarker gene regulation level. This fact will benefit cancer pathology research, drug development, early diagnosis, treatment and prevention.Entities:
Keywords: 2-dimensional space; cosine similarity; gene selection; standard deviation; unsupervised feature selection
Year: 2021 PMID: 34054930 PMCID: PMC8155687 DOI: 10.3389/fgene.2021.684100
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
FIGURE 1The toy case to test SCFS, (A) discernibility and independence are x-coordinate and y-coordinate respectively, (B) the number of features is the x axis and feature score the y axis respectively.
Descriptions of datasets.
| Dataset name | Ng | Ns | Nc | Source |
| Colon | 2000 | 62 | 2 | |
| Leukemia | 7129 | 72 | 2 | |
| CNS | 7129 | 90 | 2 | |
| CNS2 | 7129 | 60 | 2 | |
| DLBCL | 7129 | 77 | 2 | |
| Lymphoma | 4026 | 45 | 2 | |
| Carcinoma | 7457 | 36 | 2 | |
| SRBCT | 2308 | 83 | 4 | |
| ALL1 | 12625 | 128 | 2 | |
| ALL4 | 12625 | 93 | 2 | |
| Lung cancer | 12600 | 203 | 5 | |
| Prostate1 | 12625 | 102 | 2 | |
| Prostate2 | 12558 | 108 | 3 | |
| 11_Tumors | 12533 | 174 | 11 | |
| Leukemia_MLL | 12582 | 72 | 3 | |
| Gastric | 22645 | 65 | 2 | |
| Gastric1 | 22283 | 144 | 2 | |
| Non-small lung cancer | 54675 | 58 | 2 |
FIGURE 2The flow chart of the experiments in this paper.
FIGURE 3Features displaying in the 2-dimensional space of each algorithm on Colon dataset, (A) and (G) SCEFS, (B) and (H) SCRFS, (C) and (I) SCAFS, (D) and (J) EDPFS, (E) and (K) RDPFS, (F) and (L) DGFS.
Performance comparison of KNN and SVM classifiers on Colon dataset by our algorithms and unsupervised feature selection algorithms based on density peaks.
| Algorithms | KNN | SVM | Feature numbers | ||||||||
| Acc | AUC | F2 | Sen | Spe | Acc | AUC | F2 | Sen | Spe | ||
| SCEFS | 0.878 | 0.930 | 0.673 | 0.786 | 0.823 | 0.691 | 0.945 | 0.493 | 4 | ||
| 0.824 | 0.768 | 0.610 | 0.795 | 0.814 | 0.716 | 0.940 | 0.527 | 10 | |||
| SCRFS | 0.821 | 0.873 | 0.809 | 0.910 | 0.660 | 0.757 | 0.784 | 0.539 | 0.970 | 0.363 | 3 |
| 0.814 | 0.897 | 0.753 | 0.880 | 0.617 | 0.803 | 0.832 | 0.734 | 0.955 | 0.527 | 5 | |
| SCAFS | 0.794 | 0.855 | 0.760 | 0.890 | 0.617 | 0.761 | 0.800 | 0.594 | 0.955 | 0.403 | 3 |
| 0.834 | 0.894 | 0.809 | 0.920 | 0.805 | 0.827 | 0.745 | 0.940 | 0.560 | 8 | ||
| EDPFS | 0.614 | 0.779 | 0.246 | 0.850 | 0.180 | 0.648 | 0.716 | 0 | 0 | 2 | |
| 0.674 | 0.799 | 0.487 | 0.835 | 0.380 | 0.647 | 0.772 | 0.016 | 0.995 | 0.007 | 3 | |
| 0.811 | 0.886 | 0.788 | 0.890 | 0.670 | 0.853 | 0.935 | 10 | ||||
| 0.789 | 0.887 | 0.736 | 0.895 | 0.597 | 0.812 | 0.764 | 0.930 | 0.603 | 12 | ||
| RDPFS | 0.647 | 0.780 | 0.288 | 0.900 | 0.180 | 0.648 | 0.776 | 0 | 0 | 1 | |
| 0.647 | 0.780 | 0.288 | 0.900 | 0.180 | 0.644 | 0.776 | 0 | 0.995 | 0 | 2 | |
| 0.740 | 0.850 | 0.661 | 0.825 | 0.580 | 0.691 | 0.842 | 0.272 | 0.975 | 0.163 | 6 | |
| DGFS | 0.551 | 0.698 | 0.164 | 0.790 | 0.11 | 0.648 | 0.738 | 0 | 0 | 1 | |
| 0.628 | 0.803 | 0.361 | 0.805 | 0.297 | 0.648 | 0.690 | 0 | 0 | 4 | ||
| 0.601 | 0.763 | 0.346 | 0.765 | 0.303 | 0.648 | 0.743 | 0 | 0 | 9 | ||
FIGURE 4Features displaying in the 2-dimensional space of each algorithm on Leukemia_MLL dataset, (A) and (G) SCEFS, (B) and (H) SCRFS, (C) and (I) SCAFS, (D) and (J) EDPFS, (E) and (K) RDPFS, (F) and (L) DGFS.
Performance comparison of KNN and SVM classifiers on Leukemia_MLL dataset by our algorithms and unsupervised feature selection based on density peaks.
| Algorithms | KNN | SVM | Features numbers | ||||||||
| Acc | MAUC | F2 | Sen | Spe | Acc | MAUC | F2 | Sen | Spe | ||
| SCEFS | 0.642 | 0.841 | 0.539 | 0.672 | 0.719 | 0.624 | 0.883 | 0.397 | 0.564 | 0.637 | 1 |
| 0.800 | 0.945 | 0.881 | 0.882 | 0.946 | 0.891 | 0.966 | 0.900 | 0.882 | 0.961 | 7 | |
| 0.803 | 0.920 | 10 | |||||||||
| SCRFS | 0.466 | 0.764 | 0.424 | 0.591 | 0.606 | 0.410 | 0.773 | 0.058 | 0.128 | 0.634 | 2 |
| 0.745 | 0.919 | 0.774 | 0.883 | 0.789 | 0.723 | 0.926 | 0.639 | 0.790 | 0.763 | 10 | |
| 0.721 | 0.919 | 0.757 | 0.901 | 0.754 | 0.752 | 0.949 | 0.764 | 0.914 | 0.769 | 14 | |
| SCAFS | 0.719 | 0.896 | 0.684 | 0.798 | 0.779 | 0.719 | 0.918 | 0.595 | 0.739 | 0.780 | 4 |
| 0.948 | 0.895 | 0.911 | 0.907 | 0.875 | 0.976 | 0.917 | 0.927 | 20 | |||
| EDPFS | 0.416 | 0.774 | 0.311 | 0.561 | 0.479 | 0.388 | 0.730 | 0 | 0 | 0.667 | 1 |
| 0.477 | 0.765 | 0.422 | 0.644 | 0.549 | 0.388 | 0.713 | 0 | 0 | 0.667 | 2 | |
| 0.565 | 0.803 | 0.504 | 0.670 | 0.663 | 0.538 | 0.795 | 0.293 | 0.516 | 0.631 | 5 | |
| RDPFS | 0.416 | 0.774 | 0.311 | 0.561 | 0.479 | 0.388 | 0.730 | 0 | 0 | 0.667 | 1 |
| 0.477 | 0.765 | 0.422 | 0.644 | 0.549 | 0.388 | 0.713 | 0 | 0 | 0.667 | 2 | |
| DGFS | 0.424 | 0.761 | 0.350 | 0.566 | 0.508 | 0.412 | 0.758 | 0.055 | 0.106 | 0.656 | 1 |
| 0.606 | 0.846 | 0.528 | 0.670 | 0.702 | 0.606 | 0.828 | 0.285 | 0.596 | 0.632 | 5 | |
| 0.670 | 0.860 | 0.658 | 0.794 | 0.738 | 0.665 | 0.868 | 0.429 | 0.690 | 0.672 | 11 | |
FIGURE 5The average accuracy (Acc) and F2 of all algorithms on three datasets using KNN classifier, (A) and (D) Leukemia, (B) and (E) ALL1, (C) and (F) Non-small lung cancer.
FIGURE 6The maximal mean Acc comparison of each algorithm on 18 datasets using KNN classifier.
The comparison between proposed algorithms and other algorithms in terms of win/draw/loss based on the maximal mean Acc.
| Algorithms | SCEFS | SCRFS | SCAFS | EDPFS | RDPFS | DGFS | MCFS | Laplacian | RUFS | NDFS | UDFS |
| SCEFS | 0/18/0 | 7/1/10 | |||||||||
| SCRFS | 8/1/9 | 0/18/0 | 6/1/11 | ||||||||
| SCAFS | 0/18/0 |
FIGURE 7The maximal mean F2 comparison of each algorithm on 18 datasets using KNN classifier.
The comparison between proposed algorithms and other compared algorithms in terms of win/draw/loss based on the maximal mean F2.
| Algorithms | SCEFS | SCRFS | SCAFS | EDPFS | RDPFS | DGFS | MCFS | Laplacian | RUFS | NDFS | UDFS |
| SCEFS | 0/18/0 | 6/2/10 | |||||||||
| SCRFS | 8/1/9 | 0/18/0 | 8/1/9 | ||||||||
| SCAFS | 0/18/0 |
FIGURE 8Comparison of unsupervised feature selection algorithms against each other on maximal mean Acc and F2 with Nemenyi’s test, (A) Acc, (B) F2.
Runtime of each unsupervised feature selection algorithm on five datasets (in seconds).
| Datasets | Algorithms | ||||||||||
| SCEFS | SCRFS | SCAFS | EDPFS | RDPFS | DGFS | MCFS | Laplacian | RUFS | NDFS | UDFS | |
| SRBST | 0.335 ± 0.53 | 0.244 ± 0.17 | 0.223 ± 0.10 | 0.590 ± 0.12 | 0.582 ± 0.13 | 0.413 ± 0.08 | 0.956 ± 0.75 | 0.027 ± 0.02 | 11.15 ± 5.50 | 13.56 ± 3.89 | 37.29 ± 8.92 |
| CNS | 1.617 ± 0.36 | 1.687 ± 0.53 | 1.611 ± 0.37 | 6.182 ± 0.95 | 6.220 ± 0.98 | 4.789 ± 0.79 | 2.975 ± 1.76 | 0.073 ± 0.02 | 17.85 ± 7.49 | 240.71 ± 22.15 | 1003.2 ± 43.63 |
| Prostate2 | 5.478 ± 1.84 | 5.480 ± 1.43 | 5.900 ± 2.07 | 26.05 ± 9.28 | 24.19 ± 7.54 | 16.32 ± 4.76 | 2.871 ± 0.99 | 0.174 ± 0.05 | 43.10 ± 16.77 | 1851.4 ± 264.9 | – |
| Gastric | 12.49 ± 0.63 | 12.68 ± 0.95 | 12.63 ± 0.98 | 51.63 ± 4.26 | 51.71 ± 4.63 | 37.85 ± 2.82 | 2.412 ± 0.34 | 0.115 ± 0.02 | 40.71 ± 6.47 | 6957.0 ± 461.5 | – |
| NLC | 96.08 ± 9.59 | 95.50 ± 10.17 | 98.18 ± 12.83 | 756.97 ± 33.47 | 753.75 ± 40.19 | 353.86 ± 36.60 | 8.616 ± 1.59 | 0.350 ± 0.08 | 391.18 ± 28.30 | – | – |
The gene biomarkers of Prostate2 and Non-small lung cancer selected by our algorithms.
| Datasets | Algorithms | Gene biomarkers |
| Prostate2 | SCEFS | FOS, DNALI1, VWA5A, BTRC, PMF1-BGLAP, MGAT4C, KAT5, IER2, TRAF6, CYP27A1, CSPG4, MET, TIGR: HG3999-HT4269, LOC100289561, CDKN3, AP2B1, TK2, MSMB, TTPA, YME1L1, B3GALT2 |
| SCRFS | SEMG1, ALB, TNNT1, CRP, MYL1, CTNNB1, FGB, TNNC1, ACTA1, MYH7, MYLPF, CST4, FGG, HP, APOA1, DDN, MYL3, TPM2, FGA, SEMG2, NEB, SLN, APOC3, PCK1, ENO3, APOC4-APOC2 | |
| SCAFS | CDKN3, FOS, CYP27A1, SSX2B, VWA5A, TTN, TGM4, CCL19, HPGD, CSPG4, AR, MSMB, TNNT1, MYL1, HDAC9, TNNI1, ALOX15B, PMF1-BGLAP, ACTA1, COL2A1, ACTC1, SERPINB5, PEG10, HBB | |
| Non-small lung cancer | SCEFS | KRT5, SPRR1B, DSG3, DSC3, NTS, MAGEA6, MAGEA9B, XIST, SERPINB13, SPRR3, CLCA2, SPRR1A, MAGEA6, MAGEA10-MAGEA5 |
| SCRFS | GP2, RHOXF1, REG4, ACTN2, NCAN, PRL, REG1B, CYP2F1, FGF3, REG4, RHOXF2B, DEFA5, FRG2EP, GFI1B, BPIFB4, MUC6, EREG | |
| SCAFS | DSG3, NTS, XIST, SERPINB13, DSC3, SPRR1B, MAGEA9B, CLCA2, LIN28B, MAGEC2, SPRR3 |