| Literature DB >> 27470995 |
Yuan Chen1,2, Dan Cao3, Jun Gao4,5, Zheming Yuan1,2.
Abstract
Informative gene selection can have important implications for the improvement of cancer diagnosis and the identification of new drug targets. Individual-gene-ranking methods ignore interactions between genes. Furthermore, popular pair-wise gene evaluation methods, e.g. TSP and TSG, are helpless for discovering pair-wise interactions. Several efforts to discover pair-wise synergy have been made based on the information approach, such as EMBP and FeatKNN. However, the methods which are employed to estimate mutual information, e.g. binarization, histogram-based and KNN estimators, depend on known data or domain characteristics. Recently, Reshef et al. proposed a novel maximal information coefficient (MIC) measure to capture a wide range of associations between two variables that has the property of generality. An extension from MIC(X; Y) to MIC(X1; X2; Y) is therefore desired. We developed an approximation algorithm for estimating MIC(X1; X2; Y) where Y is a discrete variable. MIC(X1; X2; Y) is employed to detect pair-wise synergy in simulation and cancer microarray data. The results indicate that MIC(X1; X2; Y) also has the property of generality. It can discover synergic genes that are undetectable by reference feature selection methods such as MIC(X; Y) and TSG. Synergic genes can distinguish different phenotypes. Finally, the biological relevance of these synergic genes is validated with GO annotation and OUgene database.Entities:
Mesh:
Year: 2016 PMID: 27470995 PMCID: PMC4965793 DOI: 10.1038/srep30672
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
A typical pair-wise synergy between X1 and X2.
| − | 1 | 1 | 0 |
| − | 0 | 0 | 0 |
| + | 1 | 0 | 1 |
| + | 0 | 1 | 1 |
⊕ is an exclusive-or operation.
Figure 1Synergic pairs conducted by function.
Y = |X1 – X2|(n = 200). Y is binarized with a median. Red point: positive sample. Green point: negative sample.
Figure 2Examples of scatter plots of discretization for gene expression.
(A,B) are real-word gene expression values for prostate dataset74 and yeast dataset75; the values of HTB1 gene are binarized with 0. C and D are simulation datasets from Y = 4·X2 and Y = sin (4·π·X), Y is binarized with 0.5 and 0, respectively. Red point: positive sample. Green point: negative sample.
Figure 3Schematic of getting superclumps partition for three variables.
The points with the same color belong to the same superclump.
Figure 4Y completely determined by the synergy between X1 and X2.
X1 and X2∈[10, 30], and result from binarization vector of X1 and X2, respectively. Y = (n = 1000). Green and red dots represent Y = 1 and Y = 0, respectively.
Figure 5Ten noiseless functions with Y = f (X1, X2).
Y is binarized with median, green and red dots represent Y=1 and Y=0, respectively.
Mean scores of the three components and the joint effect for 10 noiseless functions (n = 1000, 1000 replicates).
| Function | Domain of | Domain of | Joint effect | ||||
|---|---|---|---|---|---|---|---|
| A | [0, 1] | [0, 1] | x1+x2 | 0.3667 | 0.3817 | 0.3798 | 1.1283 |
| B | [0, 1] | [0, 1] | x1 | 0.3793 | 0.3824 | 0.3663 | 1.1280 |
| C | [0, 1] | [0, 1] | ABS(x1−x2) | 0.8222 | 0.1287 | 0.1281 | 1.0790 |
| D | [0, 1] | [0, 1] | x1×x2 | 0.3215 | 0.4134 | 0.4144 | 1.1493 |
| E | [0, 1] | [0, 1] | x1/x2 | 0.3835 | 0.3804 | 0.3653 | 1.1292 |
| F | [5, 23.3] | [5, 23.3] | 10x1+10x2 | 0.2390 | 0.4657 | 0.4628 | 1.1675 |
| G | [0, 1] | [0, 1] | ABS(1000x1−1000x2) | 0.4555 | 0.3386 | 0.3381 | 1.1322 |
| H | [0, 1] | [0, 1] | ABS(ABS(x1−0.5)−ABS(x2−0.5)) | 0.7080 | 0.1295 | 0.1298 | 0.9672 |
| I | [0, 3.13] | [1.5, 4.75] | LOG2(ABS(SIN(x1)−COS(x2))) | 0.2853 | 0.3824 | 0.4274 | 1.0950 |
| J | [0, 3] | [0, 3] | SIN(x1)−SIN(x2) | 0.3044 | 0.3848 | 0.3832 | 1.0723 |
Three binary-class gene expression datasets.
| Dataset | No. of Genes | No. of samples | No. of samples in class I | No. of samples in class II | Reference |
|---|---|---|---|---|---|
| Prostate | 12600 | 102 | 52 | 50 | |
| Lung | 12533 | 181 | 150 | 31 | |
| DLBCL | 7129 | 77 | 58 | 19 |
Figure 6Overlaps among the Top200s selected by MIC(X; Y), MRMR, SVM-RFE and TSG in the Prostate dataset.
Figure 7Overlaps among the Top200s selected by MIC(X; Y), MRMR, SVM-RFE and TSG in the DLBCL dataset.
Figure 8Overlaps among the Top200s selected by MIC(X; Y), MRMR, SVM-RFE and TSG in the Lung dataset.
Figure 9Overlaps between the Top200 selected by MIC(X1; X2; Y) and the Top200s selected by MIC(X; Y), MRMR, SVM-RFE and TSG in the Prostate dataset.
Figure 10Overlaps between the Top200 selected by MIC(X1; X2; Y) and the Top200s selected by MIC(X; Y), MRMR, SVM-RFE and TSG in the DLBCL dataset.
Figure 11Overlaps between the Top200 selected by MIC(X1; X2; Y) and the Top200s selected by MIC(X; Y), MRMR, SVM-RFE and TSG in the Lung dataset.
Figure 12Prediction accuracy of five feature selection methods combined with SVC Classifier over three datasets.
Figure 13GO annotations for the Top200s selected by different methods in the Prostate dataset.
Deeper colors of one point in the figure means the terms covered with more genes. We have removed the terms in which the sum of genes number is less than 25 across all methods.
The 67 cancer related genes out of the Top200 selected by MIC(X1; X2; Y) in the Prostate dataset.
| Genes | Related tumors |
|---|---|
| ABCB1, AMACR, CAV1, CCND1, CSF2, DPT, E2F3, ETV4, GOT2, GREB1, HBP1, HCLS1, HMGA1, PAX2, SFRP1, SOX9, TRAF4, ZNF143 | Prostate |
| ABCA4, CASC3, CD81, COMP, MAP1LC3B, PPP3CA, SLN, TFAP2C, TRO | Breast cancer |
| DSC2, EDG4, FBLN1, GALNT3, KRT10, NDN | Ovarian carcinomas |
| CTSE, DNAJA1, LY6E | Pancreatic cancer |
| NR2F6, TERF2, TPP1 | Colorectal cancer |
| PCBP2, RAF1 | Glioma |
| COL6A1, CYP2A13 | Lung cancer |
| PPP2R5C | leukemia |
| PPP6C | Hepatocellular carcinoma |
| AGXT | Lymphomas |
| DIO2 | Thyroid carcinomas |
| DYRK2 | Lung adenocarcinomas |
| FGFBP1 | Gallbladder cancer |
| PROP1 | Pituitary adenoma |
| PITX3 | Liposarcoma |
| RFP | Oligodendroglioma |
| CDKN1C | Adrenal adenoma |
| VAV1 | Ovarian carcinomas, Leukemia |
| JAG1 | Breast cancer, Cervical cancer |
| PHGDH | Breast cancer, Cervical cancer |
| HYAL1 | Breast cancer, Laryngeal carcinoma, Pancreatic cancer |
| NCAM1 | Sarcoidosis, Leukemia, Lymphomas |
| PPP2R2A | Squamous cell carcinoma, Leukemia, Esophageal cancer, Lung cancer |
| GATA2 | Breast cancer, Leukemia, Neuroblastoma, Choriocarcinoma |
| THBS2 | Breast cancer, Adenocarcinoma, Colorectal cancer, Ovarian carcinomas |
| WNT5A | Breast cancer, Leukemia, Pancreatic cancer, Ovarian carcinomas, Melanoma |
| TGM2 | Adenocarcinoma, Neuroblastoma, Pancreatic cancer, Ovarian carcinomas, Lung cancer, Hepatocellular carcinoma, Melanoma |
| GSTP1 | Squamous cell carcinoma, Leukemia, Lymphomas, Ovarian carcinomas, Lung cancer, Hepatocellular carcinoma, Melanoma, Colon cancer, Glioblastoma multiforme, Astrocytoma, Osteosarcoma |
| BAI1 | Carcinoma |
| PTP4A3 | Carcinoma |
| TGFBR3 | Carcinoma |
Results of independent test for erpos and pCR of Breast cancer.
| Dataset | Model | Number of genes | Validation accuracy | Validation |
|---|---|---|---|---|
| Breast | ||||
| erpos | Individual model, genes selected by | 8 | 89% | 0.77 |
| Synergic model, genes selected by | 34 | 90% | 0.79 | |
| Combined model, genes selected by | 42 | 92% | 0.83 | |
| Candidate model in reference | 6 | 87% | 0.73 | |
| Best model in reference | 316 | 90% | 0.79 | |
| Breast | ||||
| pCR | Individual model, genes selected by | 59 | 82% | 0.36 |
| Synergic model, genes selected by | 32 | 81% | 0.35 | |
| Combined model, genes selected by | 91 | 84% | 0.37 | |
| Candidate model in reference | 206 | 72% | 0.30 | |
| Best model in reference | 40 | 73% | 0.38 | |
Figure 14Three representative patterns of pair-wise synergy identified by MIC(X1, X2: Y) method.
(A–E) are from real-world datasets, (F–H) are the corresponding hypothetical extreme examples.