| Literature DB >> 23812979 |
Rodrigo Liberal1, John W Pinney.
Abstract
MOTIVATION: Misannotation in sequence databases is an important obstacle for automated tools for gene function annotation, which rely extensively on comparison with sequences with known function. To improve current annotations and prevent future propagation of errors, sequence-independent tools are, therefore, needed to assist in the identification of misannotated gene products. In the case of enzymatic functions, each functional assignment implies the existence of a reaction within the organism's metabolic network; a first approximation to a genome-scale metabolic model can be obtained directly from an automated genome annotation. Any obvious problems in the network, such as dead end or disconnected reactions, can, therefore, be strong indications of misannotation.Entities:
Mesh:
Substances:
Year: 2013 PMID: 23812979 PMCID: PMC3694667 DOI: 10.1093/bioinformatics/btt236
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Genome-scale model validation results
| KEGG ID | Species name | AUC | Citation |
|---|---|---|---|
| ani | 0.56 | ||
| ath | 0.57 | ||
| bsu | 0.61 | ||
| buc | 0.68 | ||
| det | 0.60 | ||
| eco | 0.55 | ||
| hsl | 0.60 | ||
| lpl | 0.64 | ||
| mge | 0.43 | ||
| nme | 0.58 | ||
| nph | 0.60 | ||
| pfa | 0.59 | ||
| pgi | 0.60 | ||
| pic | 0.48 | ||
| sau | 0.52 | ||
| sce | 0.56 | ||
| sce | 0.53 | ||
| sco | 0.64 | ||
| sco | 0.63 | ||
| son | 0.55 | ||
| syn | 0.57 | ||
| vvu | 0.52 | ||
| ypm | 0.55 | ||
| zmo | 0.61 |
Note: The final classifier was applied to KEGG metabolic models, and the results were compared with curated genome-scale metabolic models for these species.
Classification features
| Group | Feature | Definition |
|---|---|---|
| 1 | Number of compounds connected to >2 reactions. | |
| Number of unpaired compounds. | ||
| Reaction type: 1—unpaired compounds on both sides of the reaction, 2—unpaired compounds on only one side, 3—no unpaired compounds. | ||
| Number of chokepoint compounds. | ||
| Number of compounds. | ||
| Number of compounds connected to >2 and <10 reactions. | ||
| Number of compounds connected to 10–50 reactions | ||
| Number of compounds connected to >50 reactions. | ||
| Number of other reactions sharing a compound with this reaction. | ||
| Mean number of other reactions connected to each compound. | ||
| Number of connections of the least connected compound. | ||
| Number of connections of the second least connected compound. | ||
| Number of connections of the third least connected compound. | ||
| Number of connections of the fourth least connected compound. | ||
| 2 | Eccentricity using unweighted edges, | |
| Normalized eccentricity using unweighted edges. | ||
| Eccentricity using weighted edges | ||
| Normalized eccentricity using weighted edges | ||
| Betweenness using unweighted edges | ||
| Betweenness using weighted edges | ||
| Number of reactions in the connected component. | ||
| 3 | Fraction of reactions of type 1 or 2 in the network. | |
| 4 | Domain: 1—Bacteria, 2—Eukaryota, 3—Archaea. | |
| 1—species is related to disease, 0—species is not related to disease. |
Note: The features chosen were divided into four groups as shown: 1—local, 2—semi-local, 3—global and 4—non-topological.
Fig. 1.Feature histograms. Visualization of the potential value of each attribute in distinguishing the correct functional assignments from the incorrect ones (red—incorrect annotations; blue—correct annotations). The Kolmogorov–Smirnov test shows that each of these attributes has a significantly different distribution for the correct and incorrect annotations. The corresponding P-values are shown on each histogram. Similar histograms for the remaining features are shown in Supplementary Figure S2
The 5-fold cross-validation results
| Mean (SD) | |
|---|---|
| Accuracy | 0.86 (0.005) |
| Precision | 0.91 (0.009) |
| Recall | 0.88 (0.011) |
| AUC | 0.92 (0.007) |
Note: The predictive model performance was assessed by a 5-fold cross-validation. The table shows the accuracy, precision, recall and AUC of this analysis and their standard deviations.
Superfamily cross-validation results
| Superfamily | Accuracy | Precision | Recall | AUC |
|---|---|---|---|---|
| Enolase | 0.60 | 0.57 | 0.97 | 0.60 |
| Vicinal oxygen chelate | 0.52 | 0.86 | 0.51 | 0.59 |
| Haloacid dehalogenase | 0.60 | 0.77 | 0.46 | 0.67 |
| Amidohydrolase | 0.66 | 0.69 | 0.74 | 0.68 |
Note: To test performance on unseen enzyme classes, the classifier was assessed in a leave-one-out cross-validation at the superfamily level. The table shows the accuracy, precision, recall and the AUC of each analysis, where each superfamily in turn was used as the test dataset.
Fig. 2.Predicted quality of draft metabolic networks across a prokaryote phylogeny. The classifier was applied to all prokaryote species present in the iTOL phylogeny (Letunic and Bork, 2007, 2011). Coloured clades represent the different phyla present (only phyla with more than one species were coloured). The names of the phyla are shown to the right. Predicted annotation quality values are represented by grey bars next to the species name
Fig. 3.Variation of predicted annotation quality with phylogenetic distance to model organism. Left: Scatter-plot showing predicted annotation quality (precision of annotated reactions according to the classifier) for eukaryotes against phylogenetic distance to H.sapiens. Right: Scatter plot showing predicted annotation quality for prokaryotes against phylogenetic distance to E.coli (Ciccarelli ). The shaded region shows the 95% confidence interval for the regression line
Fig. 4.Variation of predicted annotation quality with genome size. Left: Scatter plot showing predicted annotation quality against genome size in eukaryotes: species are classified as animals, fungi, plants, protists and others. Right: Scatter plot showing predicted annotation quality against genome size in prokaryotes: orange—well-studied species (E.coli strains and the closely related species Salmonella and Yersinia); green—intracellular obligate species. The shaded region shows the 95% confidence interval for the regression line
Fig. 5.Variation of predicted annotation quality with organism type. Box plot of the distribution of quality scores in different sets of prokaryote species: orange—well-studied species (E.coli strains and the closely related species Salmonella and Yersinia); olive—species for which there is a GENRE (Price ) available; green—facultative intracellular species; blue—intracellular obligate species; magenta—all other species