| Literature DB >> 27660763 |
Hong-Li Hua1, Fa-Zhan Zhang1, Abraham Alemayehu Labena1, Chuan Dong1, Yan-Ting Jin1, Feng-Biao Guo1.
Abstract
Investigation of essential genes is significant to comprehend the minimal gene sets of cell and discover potential drug targets. In this study, a novel approach based on multiple homology mapping and machine learning method was introduced to predict essential genes. We focused on 25 bacteria which have characterized essential genes. The predictions yielded the highest area under receiver operating characteristic (ROC) curve (AUC) of 0.9716 through tenfold cross-validation test. Proper features were utilized to construct models to make predictions in distantly related bacteria. The accuracy of predictions was evaluated via the consistency of predictions and known essential genes of target species. The highest AUC of 0.9552 and average AUC of 0.8314 were achieved when making predictions across organisms. An independent dataset from Synechococcus elongatus, which was released recently, was obtained for further assessment of the performance of our model. The AUC score of predictions is 0.7855, which is higher than other methods. This research presents that features obtained by homology mapping uniquely can achieve quite great or even better results than those integrated features. Meanwhile, the work indicates that machine learning-based method can assign more efficient weight coefficients than using empirical formula based on biological knowledge.Entities:
Year: 2016 PMID: 27660763 PMCID: PMC5021884 DOI: 10.1155/2016/7639397
Source DB: PubMed Journal: Biomed Res Int Impact factor: 3.411
Bacteria used in this work.
| Species | Abbreviation | Number of essential genes | Number of total genes |
|---|---|---|---|
|
|
| 499 | 3307 |
|
|
| 271 | 4175 |
|
|
| 325 | 4778 |
|
|
| 406 | 5632 |
|
|
| 480 | 3885 |
|
|
| 222 | 1572 |
|
|
| 296 | 4140 |
|
|
| 287 | 4146 |
|
|
| 390 | 1719 |
|
|
| 611 | 3906 |
|
|
| 378 | 475 |
|
|
| 310 | 782 |
|
|
| 463 | 2089 |
|
|
| 335 | 5892 |
| Salmonella enterica serovar Typhimurium 14028S |
| 105 | 5315 |
| Salmonella enterica serovar Typhimurium LT2 |
| 230 | 4451 |
| Salmonella enterica serovar Typhimurium SL1344 |
| 353 | 4446 |
| Salmonella enterica serovar Typhi Ty2 |
| 358 | 4352 |
|
|
| 402 | 4065 |
|
|
| 535 | 4850 |
|
|
| 302 | 2582 |
|
|
| 346 | 2767 |
|
|
| 111 | 2105 |
|
|
| 127 | 1814 |
|
|
| 218 | 2270 |
|
|
| 591 | 3503 |
The number of essential genes and total genes are counted after filtering unmatched data.
Figure 1Comparison of the number of conserved genes and essential genes between two organisms. (a) We compared the difference between SAS and ESC, two relatively closely related organisms. They shared 3204 orthologous genes and 244 common essential genes. The broken circle represents 353 SAS essential genes, and the dash dotted line circle represents 296 ESC essential genes. For ESC, there are 310 orthologous-essential genes in SAS. (b) We compared the difference between BAT and ESC, two relatively distantly related organisms. They shared 1457 orthologous genes and 123 common essential genes. The broken circle represents 325 BAT essential genes, and the dash dotted line circle represents 296 ESC essential genes. For ESC, there are 198 orthologous-essential genes in BAT. Obviously, the closer species may have more orthologous sequence and more common essential genes with the target species than the distant one.
Figure 2The flowchart for obtaining training sets by multiple homology mapping and training the model to predict essential genes. For a species under test, it was used for sequence alignment towards other 24 species, respectively, and each result was used as a training feature. The training sets obtained from multiple sequence alignment were used to train and test the prediction model by SVM. Meanwhile, we used the F-score to evaluate the discriminative capability of each feature. The optimal feature subsets were selected to train and test the model. Tenfold cross-validation was utilized to assess the performance of the classifier. For predicting essential genes in cross organisms, the feature sets of the closest organism or those of the organism/feature which has the biggest F-score for the target species were selected as the training sets to train model, and then this model was used to predict essential genes in target species.
Figure 326 AUC scores of 10-fold cross-validation within 25 bacteria as well as ESC_PEC, respectively. The last AUC score belongs to ESC whose data are obtained from PEC database. More than 70% of the results exceed the AUC score of 0.80, and 9 organisms' results of prediction yielded AUC scores more than 0.90.
Correlations between evolutionary distances and feature scores for each target organism.
| Organisms | Correlations |
|
|---|---|---|
|
| −0.45204 | 0.0266 |
|
| −0.37043 | 0.0075 |
|
| −0.50001 | 0.0128 |
|
| −0.41124 | 0.0459 |
|
| −0.41482 | 0.0439 |
|
| −0.23786 | 0.2631 |
|
| −0.50218 | 0.0120 |
|
| −0.52353 | 0.0087 |
|
| −0.39292 | 0.0575 |
|
| −0.35883 | 0.0851 |
|
| −0.49728 | 0.0134 |
|
| −0.54766 | 0.0056 |
|
| −0.46123 | 0.0233 |
|
| −0.60836 | 0.0016 |
|
| −0.60533 | 0.0017 |
|
| −0.28669 | 0.1744 |
|
| −0.24910 | 0.2405 |
|
| −0.31248 | 0.1371 |
|
| −0.50456 | 0.0119 |
|
| −0.65577 | 0.0005 |
|
| −0.11162 | 0.6036 |
|
| 0.03883 | 0.8570 |
|
| 0.11619 | 0.5887 |
|
| 0.24868 | 0.2413 |
|
| 0.19591 | 0.3589 |
|
| −0.50718 | 0.0114 |
∗ represents that the correlation is significant at the 0.05 level; ∗∗ represents that the correlation is significant at the 0.01 level.
Figure 4Comparison AUC scores of interspecies prediction for 25 bacteria between SVM and Geptop. The last AUC score belongs to ESC whose data are obtained from PEC database. The vertical axis, in the range from 0.5 to 1, represents AUC scores. More than 65% of the results exceed the AUC score of 0.80, and 8 organisms' results of prediction yielded AUC scores more than 0.90. For 26 genomes including ESC_PEC, 18 of all are better than Geptop.
Figure 5Comparison the ROC curve of Geptop and SVM for SYE. The blue curve represents the results obtained through SVM, and the area under it is 0.7855. The red curve represents the results obtained through Geptop, and the area under it is 0.7578.
Correlations between rank changes and AUC scores of 10-fold cross-validation.
| Organisms | Correlations |
|
|---|---|---|
|
| 0.66774 | 0.0005 |
|
| 0.50475 | 0.0140 |
|
| 0.61735 | 0.0017 |
|
| 0.64683 | 0.0009 |
|
| 0.64730 | 0.0008 |
|
| 0.52945 | 0.0094 |
|
| 0.69090 | 0.0003 |
|
| 0.70211 | 0.0002 |
|
| 0.58115 | 0.0036 |
|
| 0.58017 | 0.0037 |
|
| 0.67868 | 0.0004 |
|
| 0.58558 | 0.0033 |
|
| 0.62930 | 0.0013 |
|
| 0.36461 | 0.0872 |
|
| 0.66214 | 0.0006 |
|
| 0.69220 | 0.0003 |
|
| 0.70091 | 0.0002 |
|
| 0.73831 | 5.77 |
|
| 0.66206 | 0.0006 |
|
| 0.50613 | 0.0137 |
|
| 0.54941 | 0.0066 |
|
| 0.37110 | 0.0813 |
|
| 0.58267 | 0.0035 |
|
| 0.70091 | 0.0002 |
|
| 0.65402 | 0.0007 |
∗ represents that the correlation is significant at the 0.05 level; ∗∗ represents that the correlation is significant at the 0.01 level.