| Literature DB >> 32023848 |
Wei Dai1, Qi Chang1, Wei Peng1,2, Jiancheng Zhong3, Yongjiang Li2.
Abstract
Essential genes are a group of genes that are indispensable for cell survival and cell fertility. Studying human essential genes helps scientists reveal the underlying biological mechanisms of a human cell but also guides disease treatment. Recently, the publication of human essential gene data makes it possible for researchers to train a machine-learning classifier by using some features of the known human essential genes and to use the classifier to predict new human essential genes. Previous studies have found that the essentiality of genes closely relates to their properties in the protein-protein interaction (PPI) network. In this work, we propose a novel supervised method to predict human essential genes by network embedding the PPI network. Our approach implements a bias random walk on the network to get the node network context. Then, the node pairs are input into an artificial neural network to learn their representation vectors that maximally preserves network structure and the properties of the nodes in the network. Finally, the features are put into an SVM classifier to predict human essential genes. The prediction results on two human PPI networks show that our method achieves better performance than those that refer to either genes' sequence information or genes' centrality properties in the network as input features. Moreover, it also outperforms the methods that represent the PPI network by other previous approaches.Entities:
Keywords: feature representation; human essential genes; network embedding; protein–protein interaction network
Year: 2020 PMID: 32023848 PMCID: PMC7074227 DOI: 10.3390/genes11020153
Source DB: PubMed Journal: Genes (Basel) ISSN: 2073-4425 Impact factor: 4.096
Figure 1An overview of identifying human essential genes by the network embedding protein–protein interaction (PPI) network.
Details of two human PPI networks.
| Data set | Genes | Interactions | Essential genes | Training and testing genes |
|---|---|---|---|---|
| FIs | 12277 | 230243 | 1359 | 6747 |
| InWeb_IM | 17428 | 625641 | 1512 | 10548 |
Figure 2Parameter adjustments on FIs dataset. Different colors indicate different data ratios. (a) Fixing the p, w, and c and adjusting q; (b) fixing the q, w, and c and adjusting p; (c) fixing the p, q, and c and adjusting w; and (d) fixing the p, q, and wand adjusting c.
Figure 3Parameter adjustments on InWeb_IM dataset. Different colors indicate different data ratios. (a) Fixing the p, w ,and c and adjusting q; (b) fixing the q, w, and c and adjusting p; (c) fixing the p, q, and c and adjusting w; and (d) fixing the p, q, and w and adjusting c.
Performance comparison between our method and existing methods with different essential to non-essential gene ratios on FIs dataset.
| Methods | Precision | Recall | SP | NPV | F-measure | MCC | ACC | AUC | AP |
|---|---|---|---|---|---|---|---|---|---|
| DeepWalk (1:4) | 0.771 |
| 0.954 |
| 0.688 | 0.625 | 0.887 | 0.905 | 0.734 |
| LINE (1:4) |
| 0.394 |
| 0.866 | 0.539 | 0.513 | 0.863 | 0.856 | 0.693 |
| Centrality (1:4) | 0.608 | 0.523 | 0.917 | 0.879 | 0.562 | 0.464 | 0.836 | 0.753 | 0.550 |
| Z-curve (1:4) | 0.526 | 0.530 | 0.880 | 0.882 | 0.528 | 0.409 | 0.810 | 0.834 | 0.522 |
| Our method (1:4) | 0.822 | 0.597 | 0.968 | 0.905 |
|
|
|
|
|
| DeepWalk (1:1) | 0.833 | 0.846 | 0.830 |
| 0.839 | 0.676 | 0.838 | 0.907 | 0.896 |
| LINE (1:1) | 0.705 |
| 0.647 | 0.808 | 0.770 | 0.503 | 0.747 | 0.839 | 0.848 |
| Centrality (1:1) | 0.852 | 0.553 |
| 0.669 | 0.671 | 0.488 | 0.728 | 0.760 | 0.797 |
| Z-curve (1:1) | 0.733 | 0.800 | 0.708 | 0.780 | 0.765 | 0.511 | 0.754 | 0.824 | 0.783 |
| Our method (1:1) |
| 0.836 | 0.863 | 0.840 |
|
|
|
|
|
Performance comparison between our method and existing methods with different essential to non-essential gene ratios on InWeb_IM dataset.
| Methods | Precision | Recall | SP | NPV | F-measure | MCC | ACC | AUC | AP |
|---|---|---|---|---|---|---|---|---|---|
| DeepWalk (1:6) | 0.749 | 0.515 | 0.971 | 0.923 | 0.610 | 0.571 | 0.906 | 0.904 | 0.667 |
| LINE (1:6) | 0.819 | 0.165 |
| 0.877 | 0.275 | 0.332 | 0.875 | 0.832 | 0.547 |
| Centrality (1:6) | 0.527 |
| 0.916 | 0.926 | 0.544 | 0.465 | 0.865 | 0.828 | 0.511 |
| Z-curve (1:6) | 0.485 | 0.462 | 0.918 | 0.911 | 0.473 | 0.388 | 0.853 | 0.840 | 0.456 |
| Our method (1:6) |
| 0.550 | 0.983 |
|
|
|
|
|
|
| DeepWalk (1:1) | 0.817 | 0.848 | 0.811 | 0.842 | 0.832 | 0.659 | 0.829 | 0.908 | 0.894 |
| LINE (1:1) | 0.641 |
| 0.492 | 0.839 | 0.750 | 0.436 | 0.699 | 0.804 | 0.792 |
| Centrality (1:1) | 0.816 | 0.713 | 0.839 | 0.745 | 0.761 | 0.557 | 0.776 | 0.851 | 0.841 |
| Z-curve (1:1) | 0.730 | 0.777 | 0.713 | 0.762 | 0.753 | 0.491 | 0.745 | 0.827 | 0.801 |
| Our method (1:1) |
| 0.858 |
|
|
|
|
|
|
|
Performance comparison of our method with different classifiers and different essential and non-essential gene rates on FIs dataset.
| Methods | Precision | Recall | SP | NPV | F-measure | MCC | Accuracy | AUC | AP |
|---|---|---|---|---|---|---|---|---|---|
| DNN (1 layer, 1:4) | 0.258 | 0.770 | 0.220 | 0.805 | 0.234 | 0.026 | 0.667 | 0.460 | 0.242 |
| DNN (3 layers, 1:4) | 0.248 | 0.407 | 0.690 | 0.822 | 0.308 | 0.083 | 0.633 | 0.534 | 0.245 |
| DT (1:4) | 0.623 | 0.619 | 0.906 | 0.904 | 0.621 | 0.526 | 0.848 | 0.763 | 0.660 |
| NB (1:4) | 0.553 |
| 0.850 |
| 0.632 | 0.531 | 0.827 | 0.880 | 0.683 |
| KNN (1:4) | 0.795 | 0.628 | 0.959 | 0.911 | 0.702 | 0.644 | 0.893 | 0.889 | 0.782 |
| LR (1:4) | 0.787 | 0.585 | 0.960 | 0.902 | 0.671 | 0.613 | 0.885 | 0.914 | 0.755 |
| SVM (1:4) | 0.822 | 0.597 |
| 0.905 | 0.692 | 0.641 | 0.893 | 0.913 | 0.769 |
| RF (1:4) | 0.826 | 0.646 | 0.966 | 0.916 | 0.726 | 0.674 | 0.902 | 0.927 | 0.799 |
| ET (1:4) |
| 0.648 | 0.966 | 0.916 |
|
|
|
|
|
| DNN (1 layer, 1:1) | 0.554 | 0.452 | 0.503 | 0.504 | 0.527 | 0.007 | 0.503 | 0.500 | 0.519 |
| DNN (3 layers, 1:1) | 0.535 | 0.538 | 0.532 | 0.536 | 0.536 | 0.070 | 0.535 | 0.562 | 0.553 |
| DT (1:1) | 0.768 | 0.788 | 0.762 | 0.783 | 0.778 | 0.551 | 0.775 | 0.775 | 0.831 |
| NB (1:1) | 0.822 | 0.765 | 0.834 | 0.780 | 0.792 | 0.600 | 0.799 | 0.876 | 0.873 |
| KNN (1:1) | 0.837 | 0.805 | 0.844 | 0.812 | 0.821 | 0.649 | 0.824 | 0.895 | 0.906 |
| LR (1:1) | 0.839 | 0.827 | 0.841 | 0.829 | 0.833 | 0.668 | 0.834 | 0.910 | 0.907 |
| SVM (1:1) | 0.859 | 0.836 | 0.863 | 0.840 | 0.847 | 0.699 | 0.849 | 0.914 | 0.902 |
| RF (1:1) | 0.859 |
| 0.861 |
| 0.851 | 0.705 | 0.852 | 0.921 | 0.921 |
| ET (1:1) |
| 0.840 |
| 0.845 |
|
|
|
|
|
Performance comparison of our method with different classifiers and different essential and non-essential gene rates on InWeb_IM dataset.
| Methods | Precision | Recall | SP | NPV | F-measure | MCC | Accuracy | AUC | AP |
|---|---|---|---|---|---|---|---|---|---|
| DNN (1 layer, 1:6) | 0.355 | 0.772 | 0.206 | 0.877 | 0.261 | 0.103 | 0.712 | 0.499 | 0.206 |
| DNN (3 layers, 1:6) | 0.350 | 0.255 | 0.921 | 0.881 | 0.295 | 0.202 | 0.826 | 0.483 | 0.221 |
| DT (1:6) | 0.584 | 0.588 | 0.930 | 0.931 | 0.586 | 0.517 | 0.881 | 0.759 | 0.616 |
| NB (1:6) | 0.456 | 0.697 | 0.861 |
| 0.551 | 0.473 | 0.837 | 0.877 | 0.615 |
| KNN (1:6) | 0.778 | 0.564 | 0.973 | 0.930 | 0.654 | 0.617 | 0.914 | 0.888 | 0.740 |
| LR (1:6) | 0.785 | 0.591 | 0.973 | 0.934 | 0.675 | 0.637 | 0.918 | 0.931 | 0.749 |
| SVM (1:6) |
| 0.550 |
| 0.929 | 0.665 | 0.641 | 0.921 | 0.915 | 0.762 |
| RF (1:6) | 0.799 |
| 0.974 | 0.938 |
| 0.659 | 0.923 | 0.940 | 0.776 |
| ET (1:6) | 0.816 | 0.600 | 0.977 | 0.936 | 0.692 |
|
|
|
|
| DNN (1 layer, 1:1) | 0.652 | 0.504 | 0.568 | 0.591 | 0.607 | 0.157 | 0.578 | 0.603 | 0.640 |
| DNN (3 layers, 1:1) | 0.737 | 0.497 | 0.823 | 0.620 | 0.593 | 0.338 | 0.659 | 0.637 | 0.692 |
| DT (1:1) | 0.802 | 0.791 | 0.805 | 0.794 | 0.797 | 0.596 | 0.798 | 0.798 | 0.849 |
| NB (1:1) | 0.836 | 0.708 | 0.862 | 0.747 | 0.767 | 0.576 | 0.785 | 0.874 | 0.872 |
| KNN (1:1) | 0.853 | 0.843 | 0.854 | 0.845 | 0.848 | 0.697 | 0.849 | 0.904 | 0.906 |
| LR (1:1) |
| 0.834 |
| 0.840 | 0.849 | 0.704 | 0.852 | 0.925 | 0.920 |
| SVM (1:1) | 0.855 | 0.858 | 0.854 | 0.858 | 0.857 | 0.713 | 0.856 | 0.928 | 0.921 |
| RF (1:1) | 0.844 |
| 0.836 |
| 0.864 | 0.723 | 0.861 | 0.932 | 0.920 |
| ET (1:1) | 0.853 | 0.879 | 0.849 | 0.876 |
|
|
|
|
|
The Pearson's correlation coefficient between the centrality indexes of the human essential genes in the original PPI network and the rebuilt network.
| Data set | DC | BC | CC | NC | IC |
|---|---|---|---|---|---|
| FIs | 0.9262 | 0.8040 | 0.9998 | 0.9911 | 0.9794 |
| InWeb_IM | 0.9617 | 0.8372 | 0.9999 | 0.9938 | 0.9839 |
Figure 4The k-means clustering essential genes by two different features on the FIs dataset. (a) By connection relationship features; (b) by feature representation.
Figure 5The k-means clustering essential genes by two different features on the InWeb_IM dataset. (a) By connection relationship features; (b) by feature representation.
Details of the subgroups aggregated by two different features.
| Data Set | Features | Max Size | Min Size | Median Size | Silhouette | Dunn | Avg(-log( |
|---|---|---|---|---|---|---|---|
| FIs | Feature representation | 289 | 21 | 51.5 | 0.3242 | 0.59 | 63.50 |
| Connection relationship | 902 | 6 | 19.5 | 0.2448 | 0.58 | 45.01 | |
| InWeb_IM | Feature representation | 176 | 20 | 61.5 | 0.2279 | 0.63 | 54.95 |
| Connection relationship | 1022 | 1 | 13 | 0.1619 | 0.25 | 30.11 |
Ten example essential gene clusters with the smallest p-values gathered by feature representation on FIs dataset.
| GO ID | Description | Genes in Cluster | Gene Ratio | |
|---|---|---|---|---|
| GO:0000377 | RNA splicing, via transesterification reactions with bulged adenosine as nucleophile | 6.64·10−191 |
| 110/116 |
| GO:0070125 | mitochondrial translational elongation | 6.97·10−158 |
| 64/67 |
| GO:0000184 | nuclear-transcribed mRNA catabolic process, nonsense-mediated decay | 7.98·10−107 |
| 58/96 |
| GO:0000819 | sister chromatid segregation | 1.97·10−94 |
| 61/91 |
| GO:0006364 | rRNA processing | 5.08·10−94 |
| 50/50 |
| GO:0031145 | anaphase-promoting complex-dependent catabolic process | 5.71·10−89 |
| 40/51 |
| GO:0098781 | ncRNA transcription | 2.19·10−81 |
| 44/80 |
| GO:0006270 | DNA replication initiation | 2.12·10−66 |
| 25/28 |
| GO:0048193 | Golgi vesicle transport | 1.46·10−61 |
| 40/ 47 |
| GO:0042254 | ribosome biogenesis | 5.70·10−53 |
| 43/ 81 |