| Literature DB >> 30380746 |
Yuanting Yan1,2, Tao Dai3, Meili Yang4, Xiuquan Du5,6, Yiwen Zhang7,8, Yanping Zhang9,10.
Abstract
(1) Background: Gene-expression data usually contain missing values (MVs). Numerous methods focused on how to estimate MVs have been proposed in the past few years. Recent studies show that those imputation algorithms made little difference in classification. Thus, some scholars believe that how to select the informative genes for downstream classification is more important than how to impute MVs. However, most feature-selection (FS) algorithms need beforehand imputation, and the impact of beforehand MV imputation on downstream FS performance is seldom considered. (2) Method: A modified chi-square test-based FS is introduced for gene-expression data. To deal with the challenge of a small sample size of gene-expression data, a heuristic method called recursive element aggregation is proposed in this study. Our approach can directly handle incomplete data without any imputation methods or missing-data assumptions. The most informative genes can be selected through a threshold. After that, the best-first search strategy is utilized to find optimal feature subsets for classification. (3)Entities:
Keywords: best first search; classification; feature selection; gene-expression data
Mesh:
Year: 2018 PMID: 30380746 PMCID: PMC6274900 DOI: 10.3390/ijms19113398
Source DB: PubMed Journal: Int J Mol Sci ISSN: 1422-0067 Impact factor: 5.923
Number of selected genes with respect to different thresholds.
| Dataset | 0.01 | 0.005 | 0.001 | 0.0005 | 0.0001 | 0.00001 |
|---|---|---|---|---|---|---|
| alizadeh-v1 | 215 | 151 | 72 | 48 | 23 | 7 |
| alizadeh-v2 | 1811 | 1534 | 1077 | 924 | 589 | 306 |
| alizadeh-v3 | 1771 | 1535 | 1049 | 891 | 560 | 262 |
| bredel | 2904 | 2324 | 809 | 551 | 215 | 63 |
| chen | 5759 | 5007 | 3714 | 3276 | 2499 | 1700 |
| garber | 2125 | 1494 | 705 | 512 | 231 | 75 |
| lapointe-v1 | 3834 | 2875 | 1238 | 875 | 349 | 85 |
| lapointe-v2 | 7615 | 6113 | 3696 | 3012 | 1895 | 957 |
| liang | 2349 | 1302 | 717 | 623 | 17 | 3 |
| risinger | 681 | 419 | 114 | 61 | 16 | 1 |
| tomlins-v1 | 4699 | 3874 | 2460 | 1976 | 1284 | 570 |
| tomlins-v2 | 2650 | 2874 | 1678 | 1320 | 745 | 335 |
Accuracies of a voting-based extreme-learning machine (V-ELM) under different threshold.
| Dataset | 0.01 | 0.005 | 0.001 | 0.0005 | 0.0001 |
|---|---|---|---|---|---|
| alizadeh-v1 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
| alizadeh-v2 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
| alizadeh-v3 | 0.9379 | 0.9390 | 0.9445 | 0.9455 | 0.9491 |
| bredel | 0.8653 | 0.8650 | 0.8690 | 0.8739 | 0.8782 |
| chen | 0.9578 | 0.9606 | 0.9638 | 0.9751 | 0.9751 |
| garber | 0.9009 | 0.9014 | 0.9024 | 0.9060 | 0.9035 |
| lapointe-v1 | 0.8732 | 0.8767 | 0.8827 | 0.8887 | 0.9211 |
| lapointe-v2 | 0.8679 | 0.8685 | 0.8670 | 0.8699 | 0.8709 |
| liang | 0.9751 | 0.9760 | 0.9822 | 0.9830 | 0.9575 |
| risinger | 0.8577 | 0.8627 | 0.8672 | 0.8852 | 0.8242 |
| tomlins-v1 | 0.8855 | 0.8895 | 0.8960 | 0.9048 | 0.9135 |
| tomlins-v2 | 0.8866 | 0.8874 | 0.8914 | 0.8944 | 0.9105 |
Figure 1Relationship between FB-MCFS performance and threshold k of mean imputation.
Accuracy with beforehand/afterward MCFS under three imputation methods.
| Datasets | BPCA | KNN | MEAN | |||
|---|---|---|---|---|---|---|
| MCFS1 | MCFS2 | MCFS1 | MCFS2 | MCFS1 | MCFS2 | |
| alizadeh-v1 |
| 0.9928 |
| 0.9956 |
| 0.9944 |
| alizadeh-v2 | 0.9967 |
|
| 1.0000 |
| 0.9949 |
| alizadeh-v3 |
| 0.9461 |
| 0.9486 |
| 0.9432 |
| bredel |
| 0.8579 |
| 0.8481 |
| 0.8644 |
| chen |
| 0.9597 |
| 0.9641 |
| 0.9581 |
| garber |
| 0.9011 | 0.8889 |
| 0.9054 |
|
| lapointe-1 | 0.8523 |
| 0.8533 |
| 0.8492 |
|
| lapointe-2 |
| 0.8470 |
| 0.8565 | 0.8506 |
|
| liang |
| 0.9863 | 0.9860 |
| 0.9820 |
|
| risinger | 0.8643 |
| 0.8575 |
| 0.8656 |
|
| tomlins-v1 |
| 0.8809 |
| 0.8792 |
| 0.8847 |
| tomlins-v2 |
| 0.8678 |
| 0.8637 |
| 0.8776 |
| Average |
| 0.9130 |
| 0.9129 |
| 0.9149 |
Bold: best performance.
Performance comparison of FS algorithms under the three MV imputation methods.
| Datasets | NCA | PCA | UFF | ReliefF | MCFS | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| BPCA | KNN | MEAN | BPCA | KNN | MEAN | BPCA | KNN | MEAN | BPCA | KNN | MEAN | BPCA | KNN | MEAN | |
| ali1 |
|
|
|
|
|
|
|
|
| 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
| ali2 | 1.0000 | 1.0000 | 1.0000 | 0.9748 | 0.9933 | 0.9818 | 0.9023 | 0.9023 | 0.8860 | 1.0000 | 1.0000 | 1.0000 | 0.9967 | 1.0000 | 0.9971 |
| ali3 |
|
|
| 0.8893 | 0.8952 | 0.8882 | 0.8138 | 0.8209 | 0.8156 | 0.9722 | 0.9746 | 0.9686 | 0.9565 | 0.9503 | 0.9449 |
| bredel |
|
|
| 0.7840 | 0.7994 | 0.7931 | / | / | / |
|
|
| 0.8638 | 0.8706 | 0.8719 |
| chen | 0.9833 | 0.9925 | 0.9946 | 0.9379 | 0.9316 | 0.9374 | 0.9448 | 0.9422 | 0.9475 | 0.9877 | 0.9865 | 0.9835 | 0.9701 | 0.9679 | 0.9677 |
| garber | 0.9327 | 0.9496 | 0.9242 | 0.7837 | 0.7985 | 0.7860 |
|
|
| 0.8896 | 0.8965 | 0.9036 | 0.9074 | 0.8889 | 0.9054 |
| lap1 |
|
|
| 0.7176 | 0.7352 | 0.7202 | 0.7048 | 0.7052 | 0.7110 |
|
|
| 0.8523 | 0.8533 | 0.8492 |
| lap2 | 0.9353 | 0.9431 | 0.9401 | 0.7270 | 0.7324 | 0.7135 | 0.7169 | 0.7355 | 0.7285 | 0.9011 | 0.8948 | 0.9060 | 0.8621 | 0.8583 | 0.8506 |
| liang | 1.0000 | 1.0000 | 1.0000 |
|
|
|
|
|
| 1.0000 | 1.0000 | 1.0000 | 0.9923 | 0.9860 | 0.9820 |
| risinger | 0.8693 | 0.8679 | 0.8829 |
|
|
|
|
|
| 0.8267 | 0.8323 | 0.8427 | 0.8643 | 0.8575 | 0.8656 |
| tom1 |
|
|
|
|
|
|
|
|
|
|
|
| 0.8879 | 0.8892 | 0.8965 |
| tom2 |
|
|
|
|
|
|
|
|
|
|
|
| 0.8850 | 0.8943 | 0.8839 |
Bold: performance with more than 2% differences under the 3 imputation methods.
Figure 2Comparison of balanced accuracies with three imputation methods, respectively.
Summary of Friedman p-values between FB-MCFS and the other algorithms under three imputation methods.
| BPCA | KNN | MEAN | |
|---|---|---|---|
| NCA | 1 | 0.5271 | 0.5236 |
| PCA |
|
|
|
| UFF |
|
|
|
| ReliefF | 0.0578 | 0.0578 | 0.0557 |
| MCFS |
|
|
|
Bold: friedman p-values smaller than 0.05.
Figure 3Average accuracy of 100 trials of v-elm trained on selected genes.
Figure 4p-value corresponding to the 103 selected genes.
Top 30 genes selected by MCFS.
| MCFS | Gene Name | Ref [ | UFF |
|---|---|---|---|
| 1 | ‘growth arrest-specific 1’ | 4 | 33 |
| 2 | ‘selenium binding protein 1’ | 63 | / |
| 3 | ‘cyclin D1 (PRAD1: parathyroid adenomatosis 1)’ | 3 | 11 |
| 4 | ‘olfactomedin related ER (endoplasmic reticulum) localized protein’ | 19 | 23 |
| 5 | ‘recoverin’ | 29 | / |
| 6 | ‘thioredoxin’ | / | / |
| 7 | ‘quinone oxidoreductase homolog’ | 61 | / |
| 8 | ‘glycogen synthase 1 (muscle)’ | / | / |
| 9 | ‘amyloid precursor-like protein 1’ | 32 | / |
| 10 | ‘ESTs (EST: expressed sequence tag), Moderately similar to skeletal muscle LIM-protein (named for ‘LIN11, ISL1, and MEC3,’) FHL3 (FHL: four-and-a-half lim domains 3) (H.sapiens)’ | / | / |
| 11 | ‘type II integral membrane protein’ | / | / |
| 12 | ‘GLI (glioma-associated oncogene homolog)-Kruppel family member GLI3 (Greig cephalopolysyndactyly syndrome)’ | / | / |
| 13 | ‘transducin-like enhancer of split 2, homolog of Drosophila E(sp1)’ | 35 | / |
| 14 | ‘interferon-inducible’ | 44 | 78 |
| 15 | ‘calponin 3, acidic’ | 5 | 83 |
| 16 | ‘Fc (fragment, crystallizable) fragment of IgG (immunoglobulin G), receptor, transporter, alpha’ | 6 | 50 |
| 17 | ‘protein tyrosine phosphatase, non-receptor type 12’ | / | / |
| 18 | ‘cold shock domain protein A’ | / | / |
| 19 | ‘antigen identified by monoclonal antibodies 12E7, F21 and O13’ | 73 | 44 |
| 20 | ‘lectin, galactoside-binding, soluble, 3 binding protein (galectin 6 binding protein)’ | 20 | / |
| 21 | ‘Cbp/p300-interacting transactivator, with Glu/Asp-rich carboxy-terminal domain, 2’ | / | / |
| 22 | ‘dihydropyrimidinase-like 2’ | 60 | / |
| 23 | ‘suppression of tumorigenicity 5’ | / | / |
| 24 | ‘complement component 1 inhibitor (angioedema, hereditary)’ | 51 | 48 |
| 25 | ‘caveolin 1, caveolae protein, 22kD’ | 18 | 18 |
| 26 | ‘homeo box B7’ | / | / |
| 27 | ‘guanine nucleotide exchange factor; 115-kD; mouse Lsc homolog’ | / | / |
| 28 | ‘EphB4 (ephrin type-B receptor 4)’ | / | / |
| 29 | ‘death-associated protein kinase 1’ | 82 | / |
| 30 | ‘insulin-like growth factor 2 (somatomedin A)’ | 1 | 2 |
Figure 5Experimentally identified protein–protein interaction (PPI) network containing the reported genes. Original figures appear in the Supplementary Material.
Specification of cancer gene-expression data.
| Dataset | Array Type | Tissue | Dimensionality | Samples per Class | Classes |
|---|---|---|---|---|---|
| alizadeh-v1 | Double Channel | Blood | 4026 | 21,21 | DLBCL1, DLBCL2 |
| alizadeh-v2 | Double Channel | Blood | 4026 | 42, 9, 11 | DLBCL, FL, CLL |
| alizadeh-v3 | Double Channel | Blood | 4026 | 21, 21, 9, 11 | DLBCL1, DLBCL2, FL, CLL |
| bredel | Double Channel | Brain | 41472 | 31, 14, 5 | GBM, OG, A |
| chen | Double Channel | Liver | 24192 | 104, 75 | HCC, liver |
| garber | Double Channel | Lung | 24192 | 17, 40, 4, 5 | SCC, AC, LCLC, SCLC |
| lapointe-v1 | Double Channel | Prostate | 42640 | 11, 39, 19 | PT1, PT2, PT3 |
| lapointe-v2 | Double Channel | Prostate | 42640 | 11, 39, 19, 41 | PT1, PT2, PT3, Normal |
| liang | Double Channel | Brain | 24192 | 28, 6, 3 | GBM, ODG, Normal |
| risinger | Double Channel | Endometrium | 8872 | 13, 3, 19, 7 | PS, CC, E, N |
| tomlins-v1 | Double Channel | Prostate | 20000 | 27, 20, 32, 13, 12 | EPI, MET, PCA, PIN, STROMA |
| tomlins-v2 | Double Channel | Prostate | 20000 | 27, 20, 32, 13 | EPI, MET, PCA, PIN |
Figure 6Flowchart of the proposed method. The dotted line divide the framework into three main steps.
Contingency table M.
|
|
| ||
|
| … |
| |
|
|
| … |
|
| … | … | … | … |
|
|
| … |
|
Frequency table of a1 with respect to d.
| Yes | No | ? | |
|---|---|---|---|
| low | 1 | 1 | 0 |
| medium | 2 | 0 | 0 |
| high | 1 | 1 | 0 |
| ? | 1 | 1 | 0 |
Contingence table M.
|
| ||
|---|---|---|
| yes | no | |
| low | 4/3 | 4/3 |
| medium | 7/3 | 1/3 |
| high | 4/3 | 4/3 |
Figure 7Case for recursive-element aggregation process.
Figure 8Flowchart of Forward Best-First Search (FBFS) strategy on modified chi-square test-based feature selection (MCFS) subsets.
Example table with missing values.
| Sample | Wind ( | Humidity ( | Temperature ( | Trip ( |
|---|---|---|---|---|
| u1 | low | low | high | yes |
| u2 | medium | medium | medium | yes |
| u3 | high | high | ? | yes |
| u4 | low | medium | high | no |
| u5 | ? | ? | high | no |
| u6 | medium | high | low | yes |
| u7 | ? | low | low | yes |
| u8 | high | high | high | no |