| Literature DB >> 27014079 |
Xue Zhang1, Marcio Luis Acencio2, Ney Lemke2.
Abstract
Essential proteins/genes are indispensable to the survival or reproduction of an organism, and the deletion of such essential proteins will result in lethality or infertility. The identification of essential genes is very important not only for understanding the minimal requirements for survival of an organism, but also for finding human disease genes and new drug targets. Experimental methods for identifying essential genes are costly, time-consuming, and laborious. With the accumulation of sequenced genomes data and high-throughput experimental data, many computational methods for identifying essential proteins are proposed, which are useful complements to experimental methods. In this review, we show the state-of-the-art methods for identifying essential genes and proteins based on machine learning and network topological features, point out the progress and limitations of current methods, and discuss the challenges and directions for further research.Entities:
Keywords: essential genes/proteins; machine learning; network topological features; prediction models; systems biology
Year: 2016 PMID: 27014079 PMCID: PMC4781880 DOI: 10.3389/fphys.2016.00075
Source DB: PubMed Journal: Front Physiol ISSN: 1664-042X Impact factor: 4.566
Summary of prediction methods using machine learning and network topological features alone or combined with other features.
| NN, SVM | PIN, GCN | DC | No | Same | Chen and Xu, | |
| WKNN, SVM, ensemble | PIN | DC | Sequence-related | Same | Saha and Heber, | |
| NB | PIN | DC | Sequence-related | Same | Gustafson et al., | |
| C4.5 decision tree | PIN, TRN, MN | DC | No | Same | Silva et al., | |
| SVM | PIN | DC, BC, CC, KL, CCo, EI, CFD | Sequence-related | Same | Hwang et al., | |
| Decision tree-based ensemble for prediction; single C4.5 decision tree for description | PIN, TRN, MN | DC, BC, CC, CCo, identicalness | Related to functional annotation | Same | Acencio and Lemke, | |
| GEP | PIN | DC, BC, CC, SC, EC, IC, NC, PeC, WDC, ION | Related to functional annotation | Same | Zhong et al., | |
| SVM | MN | RUP, PUP, ND, APL, LSP, NS, NP, NNR, NNNR, CCV, DIR, CP, LS, NDR, NDC, NDRD, NDCD, NDCR, NDCC, NDCRD, NDCCD, BC, CC, EC, eccentricity centrality | No | Different | Plaimas et al., | |
| Ensemble | GCN | DC, BC | Sequence and gene expression-related | Different | Deng et al., | |
| FWM (NB, logistic regression, genetic algorithm) | PIN | DC, CC, BC, CCo | Sequence and gene expression-related | Different | Cheng et al., | |
| NB | PIN | DC, CC, BC, CCo | Sequence and gene expression-related | Different | Cheng et al., | |
| Ensemble | GCN | DC, BC | Sequence and gene expression-related | Different | Lu et al., |
Abbreviations: NN, neural network; WKNN, weighted k-nearest-neighbor; SVM, support vector machine; NB, Naive bayes; GEP, gene expression programming; FWM, feature-based weighted Naïve Bayes model; PIN, protein-protein interaction network; GCN, gene co-expression network; TRN, transcriptional regulatory network; MN, metabolic network; DC, degree centrality; BC, betweenness centrality; CC, closeness centrality; KL, clique level; CCo, clustering coefficient; EI, essentiality index; CFD, common function degree; SC, subgraph centrality; EC, eigenvector centrality; IC, information centrality; NC, edge-clustering coefficient centrality; WDC, weighted degree centrality; RUP, reachable/unreachable products; PUP, percentage of unreachable products; ND, number of deviations; APL, average path length; LSP, length of the shortest path; NS, number of substrates; NP, number of products; NNR, number of neighboring reactions; NNNR, number of neighboring reactions; CCV, clustering coefficient value; DIR, directionality of a reaction; CP, choke point; LS, load score; NDR, number of damaged reactions; NDC, number of damaged compounds; NDRD, number of damaged reactions having no deviations; NDCR, number of damaged choke; NDCC, number of damaged choke point compounds; NDCRD, number of damaged choke point reactions having no deviations; NDCCD, number of damaged choke point compounds having no deviations.
Same, the sources of training and testing data sets are from same organisms; Different, the sources of training and testing data sets are from different organisms.
Figure 1A toy network showing the calculation of network topological features. We consider node C (yellow node) as an example to show the calculation of the network topological features. The degree centrality (DC) of node C is 4 because it has 4 edges connecting with nodes A, B, D, and E. The betweenness centrality (BC) of node B is the number of times that node B acts as a bridge along the shortest paths between two other nodes. There are six shortest paths between all other pair of nodes (ACD, ACE, AB, BCD, BCE, DE) of which node C acts a bridge 4 times. Then, BC of node C is 4/6 = 0.66. The closeness centrality (CC) of node C is the reciprocal of the average distance from node C to other nodes. Therefore, CC of node C is 1. The clustering coefficient (CCo) of node C is calculated as the proportion of actual connections among its neighbors (A, B, D, and E) that is, in this case, 2, and the number of all possible connections among its neighbors (in this case, 6). Therefore, CCo of node C is 2/6 = 0.33.