| Literature DB >> 20870748 |
Jingyuan Deng1, Lei Deng, Shengchang Su, Minlu Zhang, Xiaodong Lin, Lan Wei, Ali A Minai, Daniel J Hassett, Long J Lu.
Abstract
Rapid and accurate identification of new essential genes in under-studied microorganisms will significantly improve our understanding of how a cell works and the ability to re-engineer microorganisms. However, predicting essential genes across distantly related organisms remains a challenge. Here, we present a machine learning-based integrative approach that reliably transfers essential gene annotations between distantly related bacteria. We focused on four bacterial species that have well-characterized essential genes, and tested the transferability between three pairs among them. For each pair, we trained our classifier to learn traits associated with essential genes in one organism, and applied it to make predictions in the other. The predictions were then evaluated by examining the agreements with the known essential genes in the target organism. Ten-fold cross-validation in the same organism yielded AUC scores between 0.86 and 0.93. Cross-organism predictions yielded AUC scores between 0.69 and 0.89. The transferability is likely affected by growth conditions, quality of the training data set and the evolutionary distance. We are thus the first to report that gene essentiality can be reliably predicted using features trained and tested in a distantly related organism. Our approach proves more robust and portable than existing approaches, significantly extending our ability to predict essential genes beyond orthologs.Entities:
Mesh:
Year: 2010 PMID: 20870748 PMCID: PMC3035443 DOI: 10.1093/nar/gkq784
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Features correlated with gene essentiality in S. cerevisiae and E. coli
| References | Genomic features |
|---|---|
| Jeong | (i) Fluctuation in mRNA expression; (ii) Protein functions; (iii) Connectivity in protein–protein interaction (PPI) network |
| Chen and Xu ( | (i) Evolutionary rate; (ii) Duplication rate; (iii) Gene expression correlation network; (iv) Connectivity in PPI network |
| Saha and Heber ( | (i) Phylogenetic conservation; (ii) Degree of paralogy; (iii) Number of PPIs |
| Seringhaus | 14 intrinsic features, such as: GC content; length of protein; hydrophobicity; codon adaptation index; predicted subcellular localization in six compartments, etc |
| Gustafson | (i) Codon usage; (ii) Paralogs; (iii) Size and localization; (iv) Protin interaction network degree; (v) Phyletic retention measure; (vi) Recombination rate; (vii) Strand bias; (viii) Regulatory complexity, etc |
Thirteen features that are selected for 10-fold cross-validation in EC
| Intrinsic features | Context-dependent features (From functional genomics experiment) | |
|---|---|---|
| Sequence based | Sequence derived | |
| Codon bias index (CBI) | Domain enrichment score (DES) | Fluctuation in gene expression (FLU) |
| Hydrophobicity score (Nc) | Phylogenetic score (PHYS) | Co-expression network bottlenecks (CEB) |
| Length of Amino Acid (L_aa) | Subcellular localization: cytoplasm (Cyto) | Co-expression network hubs (CEH) |
| Aromaticity (Aromo) | Subcellular localization: extracellular (Extra) | |
| Paralogy (PA) | ||
| Subcellular localization: inner membrane (Inner) | ||
Figure 1.Comparison of genomes and essential genes in EC and AB. The square represents 4289 EC total genes; the rectangle represents 3308 AB total genes. The overlap of the two represents 1198 orthologs determined by the RBH method. The rectangle with dashed border represents the total 302 EC essential genes. The rectangle with diagonal brick shades represents the total 499 AB essential genes. The rectangle within the dashed border and with diagonal brick shades represents the common essential genes in both species. The area of each rectangle is approximately proportional to the number of genes it represents.
Figure 2.The Nomogram for visualization of the 13 selected features. Each feature has a corresponding line indicating the relationship between a feature value and its predictive contribution assessed by Naïve Bayes analysis. The number on the line is the value of the feature and each value corresponds to a point score above. The longer the line is, the more predictive power the feature has in prediction.
Figure 3.ROC curves plot the TPR versus FPR for different thresholds of classifier probability output. (A) and (B): EC → AB; (C) and (D): AB → EC. (A) Ten-fold cross-validations on the EC essential gene data set. (B) Predictions of AB essential genes. The classifier was trained on EC dataset and evaluated on AB essential genes. (C) Ten-fold cross-validations on the AB essential gene data set. (D) Predictions of EC essential genes. The classifier was trained on AB data set and evaluated on EC essential genes.
Figure 4.Precision of predictions from EC to three target organisms. The Precision versus Rank plot for the three pairs of bacteria: EC→AB (Gray solid), EC→PA (Gray dashed) and EC→BS (Black solid). The little cross on the curve represents the precision or PPV with the corresponding probability threshold set at 0.5.
Figure 5.The integrative approach significantly extends the coverage of homology mapping. IG stands for the integrative approach. RBH stands for the reciprocal best hit approach. For the IG method, the cutoffs are set to be the same as the number of essential genes in each organism, i.e. (PA: 678, AB: 499, BS: 192).
Examples of correct and incorrect predictions
The number in the parenthesis indicates the normalized log-odds ratio (points). The larger number indicates a higher correlation with essentiality. Shaded features are those that determine the prediction outcome in each case.