| Literature DB >> 32962626 |
Da Xu1, Jialin Zhang1, Hanxiao Xu1, Yusen Zhang2, Wei Chen1, Rui Gao3, Matthias Dehmer4,5,6.
Abstract
BACKGROUND: The small number of samples and the curse of dimensionality hamper the better application of deep learning techniques for disease classification. Additionally, the performance of clustering-based feature selection algorithms is still far from being satisfactory due to their limitation in using unsupervised learning methods. To enhance interpretability and overcome this problem, we developed a novel feature selection algorithm. In the meantime, complex genomic data brought great challenges for the identification of biomarkers and therapeutic targets. The current some feature selection methods have the problem of low sensitivity and specificity in this field.Entities:
Keywords: Biomarker; Classification; Clustering; Feature selection; Machine learning; Therapeutic target
Mesh:
Substances:
Year: 2020 PMID: 32962626 PMCID: PMC7510277 DOI: 10.1186/s12864-020-07038-3
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Summary of ten gene expression data sets
| Types | Data sets | Samples | Genes | Classes | References |
|---|---|---|---|---|---|
| Two-class cancer data sets | AMLALL | 72 | 7129 | 2 | [ |
| DLBCL | 77 | 7129 | 2 | [ | |
| Gastric cancer | 40 | 1519 | 2 | [ | |
| Colon Cancer | 62 | 2000 | 2 | [ | |
| Multi-class cancer data sets | Lymphoma | 62 | 4026 | 3 | [ |
| SRBCT | 83 | 2308 | 4 | [ | |
| Brain-Tumor1 | 90 | 5920 | 5 | [ | |
| Lung-Cancer | 203 | 12,600 | 5 | [ | |
| Single-cell data sets | Pollen | 249 | 14,805 | 11 | [ |
| Usoskin | 622 | 17,772 | 4 | [ |
Fig. 1The relationship between the average classification error rates and the number of selected genes
Fig. 2Comparison results of multi-scale distance method and single distance method. a The average results of four methods on four two-class cancer data sets. b The average results of four methods on four multi-class cancer data sets
Fig. 3Comparison results of MCBFS and seven benchmark feature selections. a The average performance of four two-class data sets by different feature selection methods with SVM classifier. b The average performance of four two-class data sets by different feature selection methods with kNN classifier. c The average performance of four multi-class data sets by different feature selection methods with SVM classifier. d The average performance of four multi-class data sets by different feature selection methods with kNN classifier
Fig. 4Comparison results of MCBFS and six state-of-the-art feature selections. a The average performance of four two-class data sets with SVM classifier. b The average performance of four two-class data sets with kNN classifier. c The average performance of four multi-class data sets with SVM classifier. d The average performance of four multi-class data sets with kNN classifier
Fig. 5The sample distributions of four data sets are described by PCA. a PCA results of using all genes. b PCA results of using the top 100 genes
Fig. 6The t-test results of genes. a The t-test results of the top 200 genes selected by MCBFS in GSE10072. b The t-test results of the top 200 genes selected by MCBFS in GSE7670
Summary of ten hub informative genes
| Gene name | Protein name | Reference | |
|---|---|---|---|
| TEK | Angiopoietin-1 receptor | 8.90e-10 | [ |
| ANGPT1 | Angiopoietin-1 | 4.30e-05 | [ |
| CAV1 | Caveolin-1 | 4.90e-05 | [ |
| SPP1 | Osteopontin Secreted phosphoprotein 1 | 0.0015 | [ |
| CDH5 | Cadherin-5 | 0.0034 | [ |
| PECAM1 | Platelet endothelial cell adhesion molecule | 0.0036 | [ |
| CLDN5 | Claudin-5 | 0.045 | [ |
| AGTR1 | Type-1 angiotensin II receptor | 0.054 | [ |
| GJA4 | Gap junction alpha-4 protein | 0.13 | [ |
| FABP4 | Fatty acid-binding protein | 0.25 | [ |
Fig. 7a The cluster heat map of 10 hub gene expressions. b Genetic alterations network of hub genes
Fig. 8a The classification performances of typical combinations in the key genes. b Expression levels of two genes, SPP1 and CDH5, in LUAD 70 tissue samples
Fig. 9The integrative drug-target network
Fig. 10The flowchart of McbfsNW. a The workflow of the MCBFS algorithm. b The iterative process of the MCBFS algorithm. c The network analysis and wrapper of McbfsNW