| Literature DB >> 29121868 |
Dawit Nigatu1, Patrick Sobetzko2, Malik Yousef3, Werner Henkel4.
Abstract
BACKGROUND: Identification of essential genes is not only useful for our understanding of the minimal gene set required for cellular life but also aids the identification of novel drug targets in pathogens. In this work, we present a simple and effective gene essentiality prediction method using information-theoretic features that are derived exclusively from the gene sequences.Entities:
Keywords: Essential genes; Information-theoretic features; Machine learning; Random Forest
Mesh:
Year: 2017 PMID: 29121868 PMCID: PMC5679510 DOI: 10.1186/s12859-017-1884-5
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
The list and detail of the organisms used in this work
| No. | Organism | Abbr. | Number of essential genes | Number of non-essential genes | Accession No. |
|---|---|---|---|---|---|
| 1 | Acinetobacter baylyi ADP1 | AB | 499 | 2594 | NC_005966 |
| 2 | Bacillus subtilis 168 | BS | 271 | 3904 | NC_000964 |
| 3 | Escherichia coli MG1655 | EC | 296 | 4077 | NC_000913 |
| 4 |
| FN |
|
| NC_008601 |
| 5 |
| HI |
|
| NC_000907 |
| 6 |
| HP |
|
| NC_000915 |
| 7 | Mycoplasma genitalium G37 | MG | 381 | 94 | NC_000908 |
| 8 | Mycoplasma pulmonis UAB CTIP | MP | 310 | 322 | NC_002771 |
| 9 |
| MT |
|
| NC_000962 |
| 10 | Pseudomonas aeruginosa UCBPP-PA14 | PA | 335 | 960 | NC_008463 |
| 11 | Staphylococcus aureus N315 | SA | 302 | 2281 | NC_002745 |
| 12 |
|
|
|
|
|
| 13 | Salmonella enterica serovar Typhi | SE | 353 | 4005 | NC_004631 |
| 14 | Salmonella typhimurium LT2 | ST | 230 | 4228 | NC_003197 |
| 15 |
| VC |
|
| NC_002505 |
| 16 | Schizosaccharomyces pombe 972h- | SP | 1260 | 3573 | NC_003424 |
Fig. 1Average AUC scores of intra-organism essential gene predictions in 15 bacteria species. The prediction performance of the top 50,60,70, and 80 features based on information gain is also shown
Fig. 2Pairwise cross-organism predictions results. 15×15 average AUC scores are presented. The phylogenetic relationship and the taxonomic classification of the bacteria are also shown
Comparing prediction performance (average AUC score) among AB, BS, EC and PA
| Train | Test | Deng et al. [ | Song et al. [ | Our method |
|---|---|---|---|---|
| AB | EC | 0.89 | 0.91 | 0.86 |
| BS | AB | - | 0.86 | 0.84 |
| BS | EC | 0.86 | 0.91 | 0.86 |
| BS | PA | - | 0.81 | 0.78 |
| EC | AB | 0.8 | 0.86 | 0.84 |
| EC | BS | 0.8 | 0.93 | 0.86 |
| EC | PA | - | 0.81 | 0.81 |
| PA | EC | 0.82 | - | 0.82 |
|
|
|
|
|
Leave-one-species-out results using SVM and Random Forest classifiers
| Our method | Liu et al. | Palaniappan and Mukherjee | Geptop (homology) | Geptop* (Composition) | ||
|---|---|---|---|---|---|---|
| Training on (No. of species) | 14 | 30 | 14 | 18 | 18 | |
| Random Forest | SVM | SVM | SVM | Score based | Score based | |
| AB | 0.81 | 0.83 | 0.75 | 0.74 | 0.85 | 0.79 |
| BS | 0.84 | 0.84 | 0.77 | 0.58 | 0.95 | 0.81 |
| EC | 0.87 | 0.88 | 0.83 | 0.65 | 0.95 | 0.84 |
| FN | 0.83 | 0.83 | 0.67 | 0.66 | 0.84 | 0.74 |
| HI | 0.75 | 0.77 | 0.54 | 0.46 | 0.57 | 0.59 |
| HP | 0.75 | 0.74 | 0.52 | 0.59 | 0.60 | 0.64 |
| MG | 0.68 | 0.66 | 0.60 | 0.64 | 0.72 | 0.56 |
| MP | 0.75 | 0.74 | 0.64 | 0.61 | 0.87 | 0.76 |
| MT | 0.80 | 0.77 | 0.70 | 0.49 | 0.73 | 0.77 |
| PA | 0.80 | 0.80 | 0.65 | 0.66 | 0.80 | 0.79 |
| SA | 0.88 | 0.90 | 0.81 | 0.66 | 0.84 | 0.86 |
| SA2 | 0.86 | 0.85 | 0.80 | - | 0.88 | 0.83 |
| SE | 0.86 | 0.86 | 0.69 | - | 0.95 | 0.86 |
| ST | 0.81 | 0.79 | 0.84 | 0.53 | 0.71 | 0.69 |
| VC | 0.75 | 0.72 | 0.69 | - | 0.61 | 0.72 |
|
|
|
|
|
|
|
|
The average AUC scores of four existing methods are also presented for comparison. Geptop* is a sequence composition based predictor presented along with Geptop [23]
Fig. 3Cross-taxon prediction results
Fig. 4Leave-one-taxon out predictions of our method and an existing method [26]
Fig. 5ROC curve for the prediction of Schizosaccharomyces pombe essential genes