| Literature DB >> 30543651 |
Karthik Azhagesan1,2,3, Balaraman Ravindran4,2,3, Karthik Raman1,2,3.
Abstract
Machine learning approaches to predict essential genes have gained a lot of traction in recent years. These approaches predominantly make use of sequence and network-based features to predict essential genes. However, the scope of network-based features used by the existing approaches is very narrow. Further, many of these studies focus on predicting essential genes within the same organism, which cannot be readily used to predict essential genes across organisms. Therefore, there is clearly a need for a method that is able to predict essential genes across organisms, by leveraging network-based features. In this study, we extract several sets of network-based features from protein-protein association networks available from the STRING database. Our network features include some common measures of centrality, and also some novel recursive measures recently proposed in social network literature. We extract hundreds of network-based features from networks of 27 diverse organisms to predict the essentiality of 87000+ genes. Our results show that network-based features are statistically significantly better at classifying essential genes across diverse bacterial species, compared to the current state-of-the-art methods, which use mostly sequence and a few 'conventional' network-based features. Our diverse set of network properties gave an AUROC of 0.847 and a precision of 0.320 across 27 organisms. When we augmented the complete set of network features with sequence-derived features, we achieved an improved AUROC of 0.857 and a precision of 0.335. We also constructed a reduced set of 100 sequence and network features, which gave a comparable performance. Further, we show that our features are useful for predicting essential genes in new organisms by using leave-one-species-out validation. Our network features capture the local, global and neighbourhood properties of the network and are hence effective for prediction of essential genes across diverse organisms, even in the absence of other complex biological knowledge. Our approach can be readily exploited to predict essentiality for organisms in interactome databases such as the STRING, where both network and sequence are readily available. All codes are available at https://github.com/RamanLab/nbfpeg.Entities:
Mesh:
Substances:
Year: 2018 PMID: 30543651 PMCID: PMC6292609 DOI: 10.1371/journal.pone.0208722
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Performance comparison of various feature sets for classification of essential genes.
| Method/Feature set | AUROC | Precision | Recall |
|---|---|---|---|
| 0.784 | 0.254 | 0.688 | |
| 0.705 | 0.255 | 0.663 | |
| 0.800 | 0.289 | 0.702 | |
| 0.838 | 0.321 | 0.754 | |
| 0.830 | 0.317 | 0.733 | |
| 0.835 | 0.321 | 0.742 | |
The values in bold highlight the better-performing methods, based on AUROC, Precision and Recall measures. This table summarises the results of the different network-based features with Liu et al., ZUPLS and naïve network baselines. We can see that combined network properties, “12 centrality measures”, “14 network measures” and “ReFeX feature set” are effective in transferring essential genes across organisms, as compared to all the baseline methods. We can also see that adding sequence-based to network-based features yields more improvement in performance. Note that all the improvements over the baseline are statistically significant, as we show in S5 Table as described in Methods. The higher set of features included a smaller subset of features and are significantly better as shown in S6 Table. Area Under the curve of the Receiver Operating Characteristic (AUROC) measures the area under the plot of False Positive Rate vs True Positive Rate, Precision = True Positive/ (True Positive+False Positive), Recall = True Positive/ (True Positive+False Negative), Area Under Precision Recall Curve (AUPRC) measures the area under the plot of precision vs recall curve and the results are in S7 Table.
Centrality measures and their significance across 27 organisms.
| Centrality measure | Bootstrap test | Wilcoxon Rank-Sum test |
|---|---|---|
| Edge Clustering Coefficient Centrality | 0 | 8 |
| Betweenness Centrality | 27 | 23 |
| Load Centrality | 27 | 24 |
| Random Walk Betweenness Centrality | 19 | 25 |
| Information Centrality | 19 | 26 |
| Closeness Centrality | 27 | 26 |
| Degree Centrality | 27 | 26 |
| Harmonic Centrality | 27 | 26 |
| PageRank | 27 | 26 |
| Reaching Centrality | 27 | 26 |
| Subgraph Centrality | 27 | 26 |
| Eigenvector Centrality | 27 | 27 |
Table shows the number of organisms in which a given measure was found to be significant (p-value <0.05). For further details on p-value computation, refer text.