| Literature DB >> 31296201 |
Sehee Wang1, Hyun-Hwan Jeong2,3, Kyung-Ah Sohn4.
Abstract
BACKGROUND: Feature selection or scoring methods for the detection of biomarkers are essential in bioinformatics. Various feature selection methods have been developed for the detection of biomarkers, and several studies have employed information-theoretic approaches. However, most of these methods generally require a long processing time. In addition, information-theoretic methods discretize continuous features, which is a drawback that can lead to the loss of information.Entities:
Keywords: Breast cancer; Dimension reduction; Feature scoring; Feature selection; Low-dimensional embedding; Mutual information (MI); Principal component analysis (PCA); Reconstruction error
Mesh:
Substances:
Year: 2019 PMID: 31296201 PMCID: PMC6624178 DOI: 10.1186/s12920-019-0512-9
Source DB: PubMed Journal: BMC Med Genomics ISSN: 1755-8794 Impact factor: 3.063
Fig. 1An overview of a supervised feature scoring method using class-wise embedding and reconstruction
Fig. 2An illustration of feature-wise reconstruction error computation
Fig. 3Conversion of a mutual information concept to a formula using the reconstruction error
Fig. 4Simulation results of the relationship between the entropy and reconstruction error
Fig. 5Simulation results confirming the applicability of the scoring method for feature selection. The red arrows indicate the direction and size of the components. a shows the result on the dataset with two features that differ widely between classes, and b shows the result when there is little difference between classes
Detailed information of benchmark datasets
| Data set | Data type | Number of classes | Number of features | Number of samples | Data information |
|---|---|---|---|---|---|
| Leukemia | Discrete | 2 | 7070 | 72 | SNP |
| ProstateGE | Continuous | 2 | 5966 | 102 | Gene expression |
| TOX171 | Continuous | 4 | 5748 | 171 | Gene expression |
| Lung | Continuous | 5 | 3312 | 203 | Gene expression |
| LungDiscrete | Discrete | 7 | 325 | 73 | SNP |
Fig. 6Cross-validation accuracy for the Lung dataset with respect to the number of selected features. a presents the results of the PCA (ClearF-normal), KernelPCA with RBF kernel (ClearF-rbf) and KernelPCA with polynomial kernel (ClearF-poly) used in the proposed method; and b compares the results of the other algorithms with our method using the best result kernel
Fig. 7Cross-validation accuracy for the LungDiscrete dataset with respect to the number of selected features. a presents the results of the PCA (ClearF-normal), KernelPCA with RBF kernel (ClearF-rbf) and KernelPCA with polynomial kernel (ClearF-poly) used in the proposed method; and b compares the results of the other algorithms with our method using the best result kernel
Fig. 8Cross-validation accuracy for the ProstateGE dataset with respect to the number of features. a presents the results of the PCA (ClearF-normal), KernelPCA with RBF kernel (ClearF-rbf) and KernelPCA with polynomial kernel (ClearF-poly) used in the proposed method; and b compares the results of the other algorithms with our method using the best result kernel
Average accuracy of using 5 to 50 features per method and dataset
| Fisher score | Trace ratio | Multi SURF | ClearF normal | ClearF rbf | ClearF poly | CMIM | mRMR | t-score | |
|---|---|---|---|---|---|---|---|---|---|
| Leukemia | 0.945 | 0.945 | 0.945 |
| 0.959 | 0.945 |
| 0.959 | 0.959 |
| ProstateGE | 0.912 | 0.857 | 0.857 | 0.868 | 0.912 |
| – | – | 0.902 |
| TOX171 | 0.672 | 0.713 |
|
| 0.666 | 0.683 | – | – | – |
| Lung | 0.865 | 0.841 | 0.901 |
|
| 0.902 | – | – | – |
| LungDiscrete | 0.716 | 0.689 | 0.811 |
|
| 0.811 | 0.841 | 0.743 | – |
The bold italic numbers indicate the best results for each dataset, and the bold non-italic numbers indicate the second-best result
Fig. 9Comparison of execution times for the LungDiscrete (a) and ProstatGE (b) dataset
Fig. 10Cross-validation accuracy for the TCGA dataset with respect to the number of features. a presents the results of the PCA (ClearF-normal), KernelPCA with RBF kernel (ClearF-rbf) and KernelPCA with polynomial kernel (ClearF-poly) used in the proposed method; and b compares the results of the other algorithms with our method using the best result kernel
The top 30 genes with the highest scores obtained from the TCGA dataset
| Rank | Gene symbol | Entrez Gene Id | Gene Description | Score |
|---|---|---|---|---|
| 1 | ERBB2 | 2064 | v-erb-b2 erythroblastic leukemia viral oncogene homolog 2, neuro/glioblastoma derived oncogene homolog (avian) | 0.416 |
| 2 | STARD3 | 10,948 | StAR-related lipid transfer (START) domain containing 3 | 0.339 |
| 3 | PGAP3 | 93,210 | post-GPI attachment to proteins 3 | 0.295 |
| 4 | FOXC1 | 2296 | forkhead box C1 | 0.276 |
| 5 | CDKN2A | 1029 | cyclin-dependent kinase inhibitor 2A (melanoma, p16, inhibits CDK4) | 0.270 |
| 6 | ORMDL3 | 94,103 | ORM1-like 3 ( | 0.259 |
| 7 | GSDMB | 55,876 | gasdermin B | 0.245 |
| 8 | B3GNT5 | 84,002 | UDP-GlcNAc:betaGal beta-1,3-N-acetylglucosaminyltransferase 5 | 0.236 |
| 9 | PSMD3 | 5709 | proteasome (prosome, macropain) 26S subunit, non-ATPase, 3 | 0.235 |
| 10 | HAPLN3 | 145,864 | hyaluronan and proteoglycan link protein 3 | 0.231 |
| 11 | CDCA7 | 83,879 | cell division cycle associated 7 | 0.222 |
| 12 | PSAT1 | 29,968 | phosphoserine aminotransferase 1 | 0.216 |
| 13 | C17orf37 | 84,299 | migration and invasion enhancer 1 | 0.215 |
| 14 | GABRP | 2568 | gamma-aminobutyric acid (GABA) A receptor, pi | 0.215 |
| 15 | TMSB15B | 286,527 | thymosin beta 15B | 0.214 |
| 16 | MED1 | 5469 | mediator complex subunit 1 | 0.208 |
| 17 | CDCA2 | 157,313 | cell division cycle associated 2 | 0.207 |
| 18 | FAM171A1 | 221,061 | family with sequence similarity 171, member A1 | 0.203 |
| 19 | CCNE1 | 898 | cyclin E1 | 0.197 |
| 20 | CDK12 | 51,755 | cyclin-dependent kinase 12 | 0.194 |
| 21 | DSC2 | 1824 | desmocollin 2 | 0.192 |
| 22 | STAC | 6769 | SH3 and cysteine rich domain | 0.189 |
| 23 | PADI2 | 11,240 | peptidyl arginine deiminase, type II | 0.189 |
| 24 | RCOR2 | 283,248 | REST corepressor 2 | 0.179 |
| 25 | IGF2BP2 | 10,644 | insulin-like growth factor 2 mRNA binding protein 2 | 0.176 |
| 26 | CDH3 | 1001 | cadherin 3, type 1, P-cadherin (placental) | 0.175 |
| 27 | ZNF695 | 57,116 | zinc finger protein 695 | 0.175 |
| 28 | CLCN4 | 1183 | chloride channel 4 | 0.172 |
| 29 | MEX3A | 92,312 | mex-3 homolog A ( | 0.171 |
| 30 | CBS | 875 | cystathionine-beta-synthase | 0.171 |
Significant gene sets of overlap between MSigDB and Selected Genes
| Gene Set Name (# Genes) | Description | # Genes in | FDR | |
|---|---|---|---|---|
| SMID_BREAST_CANCER_BASAL_UP (648) | Genes up-regulated in basal subtype of breast cancer samples. | 13 | 7.43 e-17 | 7.85 e-13 |
| NIKOLSKY_BREAST_CANCER_17Q11_Q21_AMPLIPLICON (133) | Genes within amplicon 17q11-q21 identified in a copy number alterations study of 191 breast tumor samples. | 9 | 1.47 e-16 | 7.85 e-13 |
| FARMER_BREAST_CANCER_CLUSTER_8 (7) | Cluster 8: selected ERBB2 (GeneID = 2064) amplicon genes clustered together across breast cancer samples. | 5 | 1.75 e-15 | 6.23 e-12 |
| VANTVEER_BREAST_CANCER_ESR1_DN (240) | Down-regulated genes from the optimal set of 550 markers discriminating breast cancer samples by ESR1 (GeneID = 2099) expression: ER(+) vs ER(−) tumors. | 9 | 3.23 e-14 | 8.16 e-11 |
| SMID_BREAST_CANCER_LUMINAL_B_DN (564) | Genes down-regulated in the luminal B subtype of breast cancer. | 11 | 3.82 e-14 | 8.16 e-11 |
| SMID_BREAST_CANCER_ERBB2_UP (147) | Genes up-regulated in the erbb2 subype of breast cancer samples, characterized by higher expression of ERBB2 (GeneID = 2064). | 7 | 5.68 e-12 | 1.01 e-8 |
| FARMER_BREAST_CANCER_BASAL_VS_LULMINAL (330) | Genes which best discriminated between two groups of breast cancer according to the status of ESR1 and AR (GeneID = 2099;367): basal (ESR1- AR-) and luminal (ESR1+ AR+). | 8 | 3.31 e-11 | 5.05 e-8 |
| SMID_BREAST_CANCER_RELAPSE_IN_BONE_DN (315) | Genes down-regulated in bone relapse of breast cancer. | 7 | 1.18 e-9 | 1.58 e-6 |
| DOANE_BREAST_CANCER_ESR1_DN (48) | Genes down-regulated in breast cancer samples positive for ESR1 (GeneID = 2099) compared to the ESR1 negative tumors. | 4 | 2.81 e-8 | 3.34 e-5 |
| FONTAINE_PAPILLARY_THYROID_CARCINOMA_UA_UP (66) | Genes up-regulated in papillary thyroid carcinoma (PTC) compared to other thyroid tumors. | 4 | 1.03 e-7 | 1.1 e-4 |
Fig. 11Cluster information based on overlap of MsigDB and Selected Genes