| Literature DB >> 24970124 |
Jingyuan Deng1, Lirong Tan2, Xiaodong Lin3, Yao Lu4, Long J Lu5.
Abstract
Accurately predicting essential genes is important in many aspects of biology, medicine and bioengineering. In previous research, we have developed a machine learning based integrative algorithm to predict essential genes in bacterial species. This algorithm lends itself to two approaches for predicting essential genes: learning the traits from known essential genes in the target organism, or transferring essential gene annotations from a closely related model organism. However, for an understudied microbe, each approach has its potential limitations. The first is constricted by the often small number of known essential genes. The second is limited by the availability of model organisms and by evolutionary distance. In this study, we aim to determine the optimal strategy for predicting essential genes by examining four microbes with well-characterized essential genes. Our results suggest that, unless the known essential genes are few, learning from the known essential genes in the target organism usually outperforms transferring essential gene annotations from a related model organism. In fact, the required number of known essential genes is surprisingly small to make accurate predictions. In prokaryotes, when the number of known essential genes is greater than 2% of total genes, this approach already comes close to its optimal performance. In eukaryotes, achieving the same best performance requires over 4% of total genes, reflecting the increased complexity of eukaryotic organisms. Combining the two approaches resulted in an increased performance when the known essential genes are few. Our investigation thus provides key information on accurately predicting essential genes and will greatly facilitate annotations of microbial genomes.Entities:
Year: 2011 PMID: 24970124 PMCID: PMC4030871 DOI: 10.3390/biom2010001
Source DB: PubMed Journal: Biomolecules ISSN: 2218-273X
Thirty-five considered features.
| Feature | Description | Class * | Data type | Available ** | ||
|---|---|---|---|---|---|---|
| Aromo | Aromaticity score | A | Real | |||
| A3s | Base composition A | A | Real | EC/AB/SC/NC | ||
| C3s | Base composition C | A | Real | EC/AB/SC/ | ||
| G3s | Base composition G | A | Real | EC/ | ||
| T3s | Base composition T | A | Real | EC/ | ||
| CAI | Codon adaptation index | A | Real | EC/ | ||
| CBI | Codon bias index | A | Real | |||
| Fop | Frequency of optimal codons | A | Real | EC/AB/ | ||
| Nc | Effective number of codons | A | Real | |||
| L_sym | Frequency of synonymous codons | A | Integer | EC/AB/SC/NC | ||
| L_aa | Length amino acids | A | Integer | |||
| GC | GC content | A | Real | EC/AB/ | ||
| GC3s | GC content 3rd position of synonymous codons | A | Real | EC/AB/SC/NC | ||
| Gravy | Hydrophobicity score | A | Real | EC/AB/ | ||
| Cytoplasm | Subcellular localization: cytoplasm | B | Boolean | |||
| Extracellular | Subcellular localization: Extracellular | B | Boolean | |||
| Inner | Subcellular localization: Inner membrane | B | Boolean | |||
| Outer | Subcellular localization: Outer membrane | B | Boolean | EC/AB | ||
| Periplasm | Subcellular localization: Periplasm | B | Boolean | EC/AB | ||
| Golgi | Subcellular localization: Golgi | B | Boolean | SC/NC | ||
| Nucleus | Subcellular localization: Nucleus | B | Boolean | |||
| Mito | Subcellular localization: Mitochondrion | B | Boolean | SC/NC | ||
| Plasma | Subcellular localization: Plasma membrane | B | Boolean | SC/ | ||
| Vacuole | Subcellular localization: Vacuole | B | Boolean | SC/NC | ||
| Peroxisome | Subcellular localization: Peroxisome | B | Boolean | SC/NC | ||
| ER | Subcellular localization: Endoplasmic reticulum | B | Boolean | SC/NC | ||
| ExpAA | Expect number of Amino acids in helices | B | Real | EC/AB/SC/NC | ||
| First60 | Expect number of AAs in helices in first 60 AAs | B | Real | EC/AB/SC/NC | ||
| PredHel | Number of predicted TM helices | B | Integer | EC/AB/ | ||
| PHYS | Phylogenetic score | B | Real | |||
| PA | Paralogy | B | Boolean | |||
| DES | Domain enrichment score | B | Real | |||
| FLU | Fluctuation | C | Real | |||
| CEH | Coexpression network hubs | C | Boolean | |||
| CEB | Coexpression network bottlenecks | C | Boolean | |||
*—Class A: Sequence-based intrinsic features; Class B: Sequence-derived intrinsic features; Class C: Context-dependent features; **—Features used in the training and testing in each organism are in bold.
Summary of the three approaches (see Experimental Section for details).
| Approach | Description | “Gold Standard” Set | Prediction Set | |
|---|---|---|---|---|
| Training Set | Testing Set | |||
| Same-organism approach | Learning from the limited number of known essential genes in the target organism | 9/10 of the “gold standard” set of the target organism | 1/10 of the “gold standard” set of the target organism | The entire set of genes except the “gold standard” in the target organism |
| Cross-organism approach | Learning from essential genes from a closely-related model organism | 9/10 of the “gold standard” set in the related model organism | 1/10 of the “gold standard” set in the related model organism | The entire set of genes except the “gold standard” in the target organism |
| Combined approach | Learning from known essential genes in the target organism as well as a closely-related model organism with higher weights to the former | 9/10 of the “gold standard” combined set. The weights assigned to the genes in the target and model organism is w:1 | 1/10 of the “gold standard” combined set | The entire set of genes except the “gold standard” in the target organism |
Figure 1Comparison of three approaches in EC. (a) The distribution of AUC along with the different sizes of known essential genes in EC: red curve: same-organism approach “with no-DES”; black curve: same-organism approach “with DES”; blue curve: combined approach; green curve: the DES feature only dashed line: cross-organism approach. The bar chart of the correctly classified essential genes among the top 400 predictions with respect to the different sizes of known essential genes in EC using (b) “no-DES” model; (c) “with-DES” model; and (d) combined model. The black bar shows the correctly classified essential genes in the “gold standard” set.
Figure 2Comparison of three approaches in AB. (a) The distribution of AUC along with the different sizes of known essential genes in AB: red curve: same-organism approach “with no-DES”; black curve: same-organism approach “with DES”; blue curve: combined approach; dashed line: cross-organism approach. The bar chart of the correctly classified essential genes among the top 400 predictions with respect to the different sizes of known essential genes in AB using (b) “no-DES” model; (c) “with-DES” model; and (d) combined model. The black bar shows the correctly classified essential genes in the “gold standard” set.
Figure 3Comparison of three approaches in SC. (a) The distribution of AUC along with the different sizes of known essential genes in SC: red curve: same-organism approach “with no-DES”; black curve: same-organism approach “with DES”; blue curve: combined approach; dashed line: cross-organism approach. The bar chart of the correctly classified essential genes among the top 1200 predictions with respect to the different sizes of known essential genes in SC using (b) “no-DES” model; (c) “with-DES” model; and (d) combined model. The black bar shows the correctly classified essential genes in the “gold standard” set.
Figure 4Comparison of three approaches in NC. (a) The distribution of AUC along with the different sizes of known essential genes in NC: red curve: same-organism approach “with no-DES”; black curve: same-organism approach “with DES”; blue curve: combined approach; dashed line: cross-organism approach. The bar chart of the correctly classified essential genes among the top 1500 predictions with respect to the different sizes of known essential genes in NC using (b) “no-DES” model; (c) “with-DES” model; and (d) combined model. The black bar shows the correctly classified essential genes in the “gold standard” set.
Figure 5The distribution of features among true positives (TPs) and false negatives (FNs) in AB.
Figure S1Functional distribution of false negative genes according to the orthologous groups of proteins (COGs) classification in EC (a) and SC (b) respectively.