| Literature DB >> 34753514 |
Subrata Saha1, Ahmed Soliman2, Sanguthevar Rajasekaran3.
Abstract
BACKGROUND: Nowadays we are observing an explosion of gene expression data with phenotypes. It enables us to accurately identify genes responsible for certain medical condition as well as classify them for drug target. Like any other phenotype data in medical domain, gene expression data with phenotypes also suffer from being a very underdetermined system. In a very large set of features but a very small sample size domain (e.g. DNA microarray, RNA-seq data, GWAS data, etc.), it is often reported that several contrasting feature subsets may yield near equally optimal results. This phenomenon is known as instability. Considering these facts, we have developed a robust and stable supervised gene selection algorithm to select a set of robust and stable genes having a better prediction ability from the gene expression datasets with phenotypes. Stability and robustness is ensured by class and instance level perturbations, respectively.Entities:
Keywords: Gain ratio (GR); Linear Support Vector Machine (LSVM); Robust and Stable Gene Selection Algorithm (RSGSA); Support vector machine-recursive feature elimination (SVM-RFE); Symmetric Uncertainty (SU)
Mesh:
Year: 2021 PMID: 34753514 PMCID: PMC8579680 DOI: 10.1186/s40246-021-00366-9
Source DB: PubMed Journal: Hum Genomics ISSN: 1473-9542 Impact factor: 4.639
Datasets used for gene selection
| Dataset | Name | # of Genes | # of Instances | # of classes |
|---|---|---|---|---|
| D1 | Colon tumor | 2000 | 60 | 2 |
| D2 | Central nervous system | 7129 | 60 | 2 |
| D3 | ALL-AML | 7129 | 72 | 2 |
| D4 | Breast cancer | 24,481 | 97 | 2 |
| D5 | Ovarian cancer | 15,154 | 253 | 2 |
| D6 | ALL-AML | 7129 | 72 | 3 |
| D7 | ALL-AML | 7129 | 72 | 4 |
| D8 | Lung cancer | 12,533 | 181 | 3 |
| D9 | MLL | 12,581 | 72 | 3 |
| D10 | SRBCT | 2308 | 83 | 4 |
Performance of two-class gene selection
| Average jaccard indices ( | Average informedness ( | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Dataset | Top genes | SU | GR | KLD | RELIEF | SVM-RFE | RSGSA | SU | GR | KLD | RELIEF | SVM-RFE | RSGSA |
| D1 | 50 | 0.18 | 0.21 | 0.05 | 0.29 | 0.22 | 0.63 | 0.62 | 0.52 | 0.67 | 0.75 | ||
| 100 | 0.21 | 0.21 | 0.11 | 0.31 | 0.28 | 0.67 | 0.68 | 0.67 | 0.64 | 0.75 | |||
| 150 | 0.23 | 0.23 | 0.14 | 0.32 | 0.31 | 0.66 | 0.67 | 0.62 | 0.62 | 0.74 | |||
| 200 | 0.25 | 0.25 | 0.18 | 0.33 | 0.34 | 0.66 | 0.68 | 0.61 | 0.62 | 0.74 | |||
| D2 | 50 | 0.04 | 0.06 | 0.02 | 0.05 | 0.14 | 0.14 | 0.06 | 0.05 | 0.33 | 0.56 | ||
| 100 | 0.05 | 0.06 | 0.03 | 0.06 | 0.19 | 0.21 | 0.17 | 0.01 | 0.34 | 0.59 | |||
| 150 | 0.07 | 0.08 | 0.05 | 0.07 | 0.21 | 0.19 | 0.27 | 0.11 | 0.38 | 0.63 | |||
| 200 | 0.08 | 0.08 | 0.07 | 0.08 | 0.24 | 0.18 | 0.33 | 0.16 | 0.36 | 0.68 | |||
| D3 | 50 | 0.39 | 0.44 | 0.05 | 0.24 | 0.32 | 0.92 | 0.90 | 0.81 | 0.93 | 0.96 | ||
| 100 | 0.35 | 0.43 | 0.05 | 0.22 | 0.34 | 0.95 | 0.92 | 0.83 | 0.92 | 0.98 | |||
| 150 | 0.33 | 0.43 | 0.07 | 0.22 | 0.35 | 0.92 | 0.92 | 0.88 | 0.91 | 0.98 | |||
| 200 | 0.31 | 0.42 | 0.08 | 0.22 | 0.37 | 0.93 | 0.93 | 0.88 | 0.94 | 0.97 | |||
| D4 | 50 | 0.04 | 0.09 | 0.01 | 0.03 | 0.12 | 0.38 | 0.23 | 0.26 | 0.45 | 0.47 | ||
| 100 | 0.05 | 0.13 | 0.02 | 0.04 | 0.17 | 0.39 | 0.22 | 0.31 | 0.40 | 0.45 | |||
| 150 | 0.06 | 0.11 | 0.03 | 0.05 | 0.18 | 0.38 | 0.20 | 0.25 | 0.44 | 0.50 | |||
| 200 | 0.07 | 0.11 | 0.04 | 0.06 | 0.20 | 0.40 | 0.25 | 0.29 | 0.43 | 0.52 | |||
| D5 | 50 | 0.78 | 0.66 | 0.61 | 0.56 | 0.65 | 0.99 | 0.99 | 0.99 | 0.99 | |||
| 100 | 0.65 | 0.54 | 0.56 | 0.63 | 0.99 | ||||||||
| 150 | 0.73 | 0.61 | 0.48 | 0.55 | 0.64 | ||||||||
| 200 | 0.74 | 0.60 | 0.44 | 0.56 | 0.65 | 0.99 | |||||||
| Average | 50 | 0.30 | 0.32 | 0.16 | 0.24 | 0.27 | 0.61 | 0.56 | 0.53 | 0.67 | 0.76 | ||
| 100 | 0.28 | 0.30 | 0.15 | 0.27 | 0.31 | 0.64 | 0.60 | 0.56 | 0.66 | 0.78 | |||
| 150 | 0.28 | 0.29 | 0.15 | 0.29 | 0.32 | 0.63 | 0.61 | 0.57 | 0.67 | 0.79 | |||
| 200 | 0.29 | 0.29 | 0.16 | 0.30 | 0.34 | 0.63 | 0.64 | 0.59 | 0.67 | 0.80 | |||
| Average RSGSA gain | 50 | 0.33 | 0.25 | 0.67 | 0.48 | 0.33 | 0.45 | 0.21 | 0.07 | ||||
| 100 | 0.43 | 0.33 | 0.48 | 0.29 | 0.28 | 0.37 | 0.24 | 0.05 | |||||
| 150 | 0.50 | 0.45 | 0.45 | 0.31 | 0.30 | 0.34 | 0.22 | 0.04 | |||||
| 200 | 0.48 | 0.48 | 0.43 | 0.26 | 0.33 | 0.31 | 0.25 | 0.05 | |||||
The best performance metric value among the algorithms on each dataset is highlighted in bold typeface
Fig. 1Performance evaluations a average stability for binary class b Average accuracy for binary class c average stability for multi class d average accuracy for multi class
Fig. 2Performance evaluations of RSGSA over other notable algorithms. a Average gain over informedness b Average gain over stability
Fig. 3Classification accuracy of selected genes by employing LSVM, RF, and KNN classifiers for a GR algorithm b SU algorithm c KLD algorithm d RELIEF algorithm e SVM-RFE algorithm f RSGSA algorithm
Performance of multi-class gene selection
| Average Jaccard indices ( | Average informedness ( | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Dataset | Top genes | SU | GR | KLD | RELIEF | SVM-RFE | RSGSA | SU | GR | KLD | RELIEF | SVM-RFE | RSGSA |
| D6 | 50 | 0.34 | 0.46 | 0.10 | 0.30 | 0.30 | 0.93 | 0.94 | 0.87 | 0.94 | 0.96 | ||
| 100 | 0.32 | 0.46 | 0.08 | 0.27 | 0.36 | 0.94 | 0.94 | 0.90 | 0.95 | 0.96 | |||
| 150 | 0.31 | 0.43 | 0.08 | 0.27 | 0.36 | 0.94 | 0.94 | 0.90 | 0.96 | 0.96 | |||
| 200 | 0.30 | 0.42 | 0.09 | 0.27 | 0.38 | 0.94 | 0.94 | 0.92 | 0.95 | 0.96 | |||
| D7 | 50 | 0.39 | 0.49 | 0.12 | 0.17 | 0.25 | 0.90 | 0.90 | 0.82 | 0.84 | 0.92 | ||
| 100 | 0.35 | 0.48 | 0.13 | 0.18 | 0.28 | 0.90 | 0.91 | 0.89 | 0.86 | 0.94 | |||
| 150 | 0.34 | 0.45 | 0.13 | 0.18 | 0.31 | 0.90 | 0.91 | 0.90 | 0.87 | 0.95 | |||
| 200 | 0.33 | 0.44 | 0.12 | 0.18 | 0.32 | 0.90 | 0.91 | 0.90 | 0.88 | 0.94 | |||
| D8 | 50 | 0.60 | 0.19 | 0.20 | 0.22 | 0.50 | 0.89 | 0.91 | 0.89 | 0.90 | 0.95 | ||
| 100 | 0.60 | 0.17 | 0.24 | 0.28 | 0.54 | 0.91 | 0.91 | 0.91 | 0.91 | 0.96 | |||
| 150 | 0.60 | 0.17 | 0.26 | 0.31 | 0.57 | 0.92 | 0.91 | 0.93 | 0.93 | 0.96 | |||
| 200 | 0.59 | 0.16 | 0.28 | 0.34 | 0.61 | 0.93 | 0.92 | 0.93 | 0.93 | 0.96 | |||
| D9 | 50 | 0.34 | 0.36 | 0.10 | 0.24 | 0.29 | 0.92 | 0.93 | 0.91 | 0.93 | 0.97 | ||
| 100 | 0.36 | 0.36 | 0.07 | 0.27 | 0.33 | 0.94 | 0.94 | 0.93 | 0.94 | ||||
| 150 | 0.36 | 0.37 | 0.07 | 0.27 | 0.35 | 0.95 | 0.95 | 0.93 | 0.94 | ||||
| 200 | 0.36 | 0.37 | 0.07 | 0.27 | 0.36 | 0.96 | 0.95 | 0.93 | 0.94 | 0.98 | |||
| D10 | 50 | 0.32 | 0.50 | 0.16 | 0.27 | 0.42 | 0.99 | 0.98 | 0.98 | 0.99 | |||
| 100 | 0.34 | 0.49 | 0.16 | 0.31 | 0.47 | 0.99 | 0.99 | ||||||
| 150 | 0.36 | 0.49 | 0.18 | 0.34 | 0.48 | ||||||||
| 200 | 0.38 | 0.49 | 0.19 | 0.36 | 0.51 | ||||||||
| Average | 50 | 0.40 | 0.48 | 0.13 | 0.24 | 0.30 | 0.93 | 0.94 | 0.89 | 0.92 | 0.96 | ||
| 100 | 0.39 | 0.50 | 0.12 | 0.25 | 0.34 | 0.94 | 0.94 | 0.92 | 0.93 | 0.97 | |||
| 150 | 0.39 | 0.49 | 0.13 | 0.26 | 0.36 | 0.94 | 0.94 | 0.93 | 0.94 | 0.97 | |||
| 200 | 0.39 | 0.48 | 0.13 | 0.27 | 0.38 | 0.95 | 0.94 | 0.94 | 0.94 | 0.97 | |||
| Average RSGSA Gain | 50 | 0.38 | 0.15 | 1.29 | 0.83 | 0.06 | 0.05 | 0.08 | 0.03 | ||||
| 100 | 0.49 | 0.16 | 1.32 | 0.71 | 0.05 | 0.05 | 0.06 | 0.02 | |||||
| 150 | 0.51 | 0.20 | 1.27 | 0.64 | 0.05 | 0.05 | 0.05 | 0.02 | |||||
| 200 | 0.56 | 0.27 | 1.26 | 0.61 | 0.04 | 0.02 | |||||||
The best performance metric value among the algorithms on each dataset is highlighted in bold typeface
Fig. 4Outcome of a recursive stage of SSVM-RFE a Weights of top 10 genes b CV of top 10 genes c Weights of least 10 genes d CV of least 10 genes e Weights of all genes produced by 5 LSVM and their average in a recursive stage
Top 10 genes selected by RSGSA from D1 dataset
| Rank | Probe ID | Gene symbol | Full name |
|---|---|---|---|
| 0 | X59871 | TCF7 | Transcription factor 7 |
| 1 | X05276 | TPM4 | Tropomyosin 4 |
| 2 | X63432 | ACTB | Actin beta |
| 3 | J05032 | DARS1 | aspartyl-tRNA synthetase 1 |
| 4 | D26535 | DLST | Dihydrolipoamide S-succinyltransferase |
| 5 | H68220 | FAU | FAU ubiquitin like and ribosomal protein S30 fusion |
| 6 | T97199 | ITGB4 | Integrin subunit beta 4 |
| 7 | T56244 | PSMB2 | Proteasome subunit beta 2 |
| 8 | R16255 | PPP3CB | Protein phosphatase 3 catalytic subunit beta |
| 9 | T70063 | EIF4G2 | Eukaryotic translation initiation factor 4 gamma 2 |
Top 10 enriched (GO-BP) terms
| ID | Description | ||
|---|---|---|---|
| GO:0001503 | Ossification | 2.89E-06 | 0.007458585 |
| GO:0016051 | Carbohydrate biosynthetic process | 9.61E-06 | 0.012383589 |
| GO:0006352 | DNA-templated transcription, initiation | 3.30E-05 | 0.018847646 |
| GO:0006367 | Transcription initiation from RNA polymerase II promoter | 3.96E-05 | 0.018847646 |
| GO:0035270 | Endocrine system development | 4.28E-05 | 0.018847646 |
| GO:0070371 | ERK1 and ERK2 cascade | 5.02E-05 | 0.018847646 |
| GO:0002683 | Negative regulation of immune system process | 6.27E-05 | 0.018847646 |
| GO:0006006 | Glucose metabolic process | 7.11E-05 | 0.018847646 |
| GO:0010038 | Response to metal ion | 7.30E-05 | 0.018847646 |
| GO:0031018 | Endocrine pancreas development | 7.31E-05 | 0.018847646 |
Top 10 enriched disease ontology (DO) terms
| ID | Description | ||
|---|---|---|---|
| DOID:3996 | Urinary system cancer | 0.000161338 | 0.031918678 |
| DOID:4450 | Renal cell carcinoma | 0.00027404 | 0.031918678 |
| DOID:0060116 | Sensory system cancer | 0.000278766 | 0.031918678 |
| DOID:2174 | Ocular cancer | 0.000278766 | 0.031918678 |
| DOID:4451 | Renal carcinoma | 0.000670596 | 0.03438733 |
| DOID:768 | Retinoblastoma | 0.000674808 | 0.03438733 |
| DOID:771 | Retinal cell cancer | 0.000674808 | 0.03438733 |
| DOID:4645 | Retinal cancer | 0.000738581 | 0.03438733 |
| DOID:14067 | Plasmodium falciparum malaria | 0.000771616 | 0.03438733 |
| DOID:2377 | Multiple sclerosis | 0.000903255 | 0.03438733 |
Top 10 enriched biological pathways
| Pathway | Source | ID | Hypergeometric |
|---|---|---|---|
| IL3 | NetPath | Pathway_IL3 | 2.75E-06 |
| Cellular senescence | KEGG | path:hsa04218 | 4.51E-06 |
| CD4 T cell receptor signaling | INOH | None | 8.12E-06 |
| VEGF | INOH | None | 1.59E-05 |
| Alpha6Beta4Integrin | NetPath | Pathway_Alpha6Beta4Integrin | 1.93E-05 |
| TCR | NetPath | Pathway_TCR | 2.14E-05 |
| Fibroblast growth factor-1 | NetPath | Pathway_Fibroblast__growth__factor-1 | 2.44E-05 |
| a6b1 and a6b4 Integrin signaling | PID | a6b1_a6b4_integrin_pathway | 2.59E-05 |
| BCR | NetPath | Pathway_BCR | 6.87E-05 |
| EPO signaling | INOH | None | 1.02E-04 |