| Literature DB >> 29180805 |
Pengwei Xing1,2, Yuan Chen1,2, Jun Gao3, Lianyang Bai4, Zheming Yuan5,6.
Abstract
Selecting informative genes, including individually discriminant genes and synergic genes, from expression data has been useful for medical diagnosis and prognosis. Detecting synergic genes is more difficult than selecting individually discriminant genes. Several efforts have recently been made to detect gene-gene synergies, such as dendrogram-based I(X 1; X 2; Y) (mutual information), doublets (gene pairs) and MIC(X 1; X 2; Y) based on the maximal information coefficient. It is unclear whether dendrogram-based I(X 1; X 2; Y) and doublets can capture synergies efficiently. Although MIC(X 1; X 2; Y) can capture a wide range of interaction, it has a high computational cost triggered by its 3-D search. In this paper, we developed a simple and fast approach based on abs conversion type (i.e. Z = |X 1 - X 2|) and t-test, to detect interactions in simulation and real-world datasets. Our results showed that dendrogram-based I(X 1; X 2; Y) and doublets are helpless for discovering pair-wise gene interactions, our approach can discover typical pair-wise synergic genes efficiently. These synergic genes can reach comparable accuracy to the individually discriminant genes using the same number of genes. Classifier cannot learn well if synergic genes have not been converted properly. Combining individually discriminant and synergic genes can improve the prediction performance.Entities:
Mesh:
Year: 2017 PMID: 29180805 PMCID: PMC5703944 DOI: 10.1038/s41598-017-16748-w
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Four typical pair-wise synergies examples. Red and green dots represent cancer and normal samples, respectively.
Four binary class gene expression datasets.
| Datasets | Sample size | Number of genes | Reference |
|---|---|---|---|
| Prostate 1 | 102(52, 50) | 12600 | Singh, D(2002)[ |
| Lung cancer | 187 (97, 90) | 22,215 | Spira, A(2007)[ |
| Prostate 2 | 424 (264, 160) | 20,280 | Penney, K(2015)[ |
| Cardiovascular disease | 378 (138, 240) | 22,277 | Ellsworth, D(2014)[ |
Figure 2Top2 gene pairs selected by different methods in Prostate1 dataset. Red and green dots represent cancer and control, respectively. Gene expression levels are represented by the ranked values. K and L are from dendrogram-based I(X 1; X 2; Y)[4], M and N are from MIC(X 1; X 2; Y)[5].
Overlaps among the informative genes selected by different methods in the Prostate1 dataset.
|
|
|
|
|
|
| |
|---|---|---|---|---|---|---|
|
| ||||||
|
| 35 | |||||
|
| 36 | 41 | ||||
|
| 23 | 20 | 21 | |||
|
| 25 | 28 | 30 | 18 | ||
|
| 1 | 0 | 0 | 0 | 0 |
Ind(100): The Top 100 individually discriminant genes selected by t-test. Sum (98): The Top 100 gene pairs selected by Sum conversion type and t-test, 98 genes reserved after removing repeated genes; the others as well.
Figure 3The heat maps generated by the same top10 synergic genes which were selected by abs conversion type. Each row corresponds to a pair of genes (A–E) or a gene (F), and each column corresponds to a sample. Gene expression levels are represented by the ranked values, and normalized to [−1, 1].
The top10 synergic genes selected by abs conversion type in Prostate1 dataset.
| Pair-wise synergic Genes | Related carcinoma and Ref. |
|---|---|
|
| Breast cancer[ |
|
| Breast cancer[ |
|
| Colorectal cancer[ |
|
| Oral carcinoma[ |
|
| Breast cancer[ |
|
| Oral carcinoma[ |
|
| Bladder cancer[ |
|
| Prostate cancers[ |
|
| Skin cancer[ |
|
| Prostate cancer[ |
Ten simulation datasets and their input features.
| Dataset | Function | No converted input features | Converted input features |
|---|---|---|---|
| 1 |
| { | { |
| 2 |
| { | { |
| … | … | … | … |
| 10 |
| { | { |
Here, X is assigned with random values between 0 and 1, and Y is binarized with the median. Sample size for each dataset is 200.
Prediction accuracy with converted and not converted input features.
| Dataset | SVM-RBFa | SVM-linearb | SVM-polyc | SVM-sigd | RF | ANNs | DT | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
| 1 | 0.985 | 0.985 | 0.990 | 0.605 | 1.00 | 0.56 | 0.990 | 0.540 | 1.00 | 0.865 | 1.00 | 0.975 | 0.995 | 0.895 |
| 2 | 0.970 | 0.905 | 0.975 | 0.600 | 0.985 | 0.640 | 0.995 | 0.455 | 0.960 | 0.795 | 0.990 | 0.930 | 0.965 | 0.785 |
| 3 | 0.985 | 0.860 | 0.975 | 0.465 | 0.980 | 0.575 | 0.975 | 0.500 | 0.860 | 0.780 | 0.995 | 0.910 | 0.900 | 0.705 |
| 4 | 0.960 | 0.810 | 0.925 | 0.515 | 0.985 | 0.400 | 0.980 | 0.420 | 0.850 | 0.655 | 0.985 | 0.825 | 0.865 | 0.695 |
| 5 | 0.970 | 0.790 | 0.910 | 0.535 | 0.965 | 0.550 | 0.980 | 0.460 | 0.810 | 0.615 | 0.995 | 0.780 | 0.840 | 0.600 |
| 6 | 0.945 | 0.815 | 0.860 | 0.500 | 0.985 | 0.475 | 0980 | 0.485 | 0.770 | 0.620 | 0.990 | 0.770 | 0.795 | 0.615 |
| 7 | 0.940 | 0.715 | 0.905 | 0.530 | 0.980 | 0.500 | 0.980 | 0.535 | 0.865 | 0.610 | 0.985 | 0.670 | 0.795 | 0.585 |
| 8 | 0.970 | 0.675 | 0.955 | 0.410 | 0.970 | 0.455 | 0.955 | 0.455 | 0.760 | 0.545 | 0.995 | 0.695 | 0.760 | 0.610 |
| 9 | 0.955 | 0.660 | 0.885 | 0.515 | 0.960 | 0.460 | 0.955 | 0.435 | 0.790 | 0.510 | 0.990 | 0.665 | 0.770 | 0.580 |
| 10 | 0.955 | 0.655 | 0.860 | 0.480 | 0.955 | 0.525 | 0.975 | 0.525 | 0.735 | 0.520 | 0.960 | 0.600 | 0.750 | 0.625 |
Here, a: SVM with radial basis function (RBF) kernel; b: SVM with linear kernel; c: SVM with polynomial kernel; d: SVM with sigmoid kernel. RF: Random Forest; ANNs: artificial neuron network; DT: Decision Tree; Con: the converted input features; No con: the not converted input features.
Prediction accuracies of 5-fold CV in different schemes of input features (%).
| Input features | Lung | Prostate2 | Cardiovascular | Average |
|---|---|---|---|---|
| Top10_ | 74.41 (43.81) | 84.20 (64.39) | 73.29 (63.22) | 77.30 (57.14) |
| Top20_ | 76.49 (43.31) | 85.13 (61.08) | 74.59 (61.65) | 78.74 (55.35) |
| Top40_ | 75.93 (46.02) | 84.20 (61.09) | 80.96 (62.95) | 80.36 (56.69) |
| Top5_ | 76.54 (47.03) | 74.52 (62.25) | 75.67 (62.99) | 75.58 (57.42) |
| Top10_ | 84.44 (50.28) | 76.18 (55.90) | 84.40 (61.38) | 81.67 (55.85) |
| Top20_ | 83.98 (47.06) | 80.20 (62.96) | 89.70 (62.17) | 84.63 (57.40) |
| Top10_ | 82.33 (48.17) | 86.34 (62.27) | 82.55 (63.22) | 83.74 (57.89) |
| Top20_ | 83.91 (40.11) | 86.31 (57.54) | 87.04 (62.44) | 85.75 (53.36) |
Ind represents the individually discriminant genes, Syn represents the synergic genes. A number in parentheses indicates the result of label randomization test.
Prediction accuracies of 5-fold CV in different conversion types (%).
| Features | Lung | Prostate2 | Cardiovascular | Average |
|---|---|---|---|---|
| Top20_ | 76.49 | 85.13 | 74.59 | 78.73 |
| Top10_ | 80.68 | 81.61 | 78.83 | 80.37 |
| Top10_ | 83.37 | 85.84 | 76.97 | 82.06 |
| Top10_ | 80.81 | 81.61 | 79.09 | 80.50 |
| Top10_ | 78.08 | 84.68 | 79.38 | 80.71 |
| Top10_ | 84.44 | 76.18 | 84.40 | 81.67 |
| Top10_ | 79.70 | 85.14 | 80.42 | 81.75 |
| Top10_ | 82.33 | 84.44 | 83.33 | 83.37 |
| Top10_ | 78.11 |
| 79.64 | 81.43 |
| Top10_ | 81.35 | 84.43 | 76.21 | 80.66 |
| Top10_ |
| 86.31 |
|
|
Top20_Ind: The Top20 individually discriminant genes selected by t-test. Top10_Sum: the Top10 gene pairs selected by Sum conversion types + t-test, the others as well.