| Literature DB >> 19935659 |
Tzu-Lin Hsiao1, Olga Revelles, Lifeng Chen, Uwe Sauer, Dennis Vitkup.
Abstract
With the increasing role of computational tools in the analysis of sequenced genomes, there is an urgent need to maintain high accuracy of functional annotations. Misannotations can be easily generated and propagated through databases by functional transfer based on sequence homology. We developed and optimized an automatic policing method to detect biochemical misannotations using context genomic correlations. The method works by finding genes with unusually weak genomic correlations in their assigned network positions. We demonstrate the accuracy of the method using a cross-validated approach. In addition, we show that the method identifies a significant number of potential misannotations in Bacillus subtilis, including metabolic assignments already shown to be incorrect experimentally. The experimental analysis of the mispredicted genes forming the leucine degradation pathway in B. subtilis demonstrates that computational policing tools can generate important biological hypotheses.Entities:
Mesh:
Substances:
Year: 2009 PMID: 19935659 PMCID: PMC2935526 DOI: 10.1038/nchembio.266
Source DB: PubMed Journal: Nat Chem Biol ISSN: 1552-4450 Impact factor: 15.040
Figure 1Illustration of the developed approach. In the figure network nodes represent metabolic genes and edges represent connections established by shared metabolites. Using sequence homology, genes X and Y from different organisms have been assigned to EC 1.2.3.4. Gene X displays strong context-based correlations (darker blue indicating stronger correlations) with neighboring network genes. Consequently, the annotation of X is likely to be correct. In contrast, gene Y does not fit well in the assigned network position and is likely to be misannotated.
Figure 2Performance on identifying misannotations. a) The ROC curves on different types of artificially generated misannotations in the yeast network. The True Negative set 1 (TN1) was generated by randomly assigning incorrect metabolic functions to a fraction of network genes. The TN2 set was generated by reassigning network genes to new metabolic activities only if they had at least 30% sequence identities to newly assigned (incorrect) activities. The TN3 was generated by assigning genes to new activities only if they had similar (within 10%) or higher sequence identities to the reassigned (incorrect) activities. In all cases the remaining (not reassigned) activities were used as true positive examples. For realistic misannotation models, simulated by the sets TN2 and TN3, the method correctly identifies about 70%–90% of misannotations at 5%–15% false positive rate. The red dot in the figure approximately corresponds to 70% true positives and 5% false positives. b) The cumulative distributions of the classification confidence scores for B. subtilis metabolic assignments. The B. subtilis annotations made simultaneously by all analyzed databases (KEGG, MetaCyc and Swiss-Prot) are shown in red, annotations unique to KEGG, MetaCyc, or Swiss-Prot, are shown in black. For comparison we also show the true negative set TN3 from S. cerevisiae in blue. The cumulative distributions demonstrate that the consensus annotations (red) are, on average, more accurate than the ones unique to individual databases (blue, Kolmogorov-Smirnov test P=2*10−19). However, on average, database-specific annotations still score significantly better than true misannotations (KS P=2*10−4).
Potential misannotations in the B. subtilis metabolic network. Annotation source: K: KEGG, M: MetaCyc, and S: Swiss-Prot. Homology score is the highest proteinprotein sequence identity to another Swiss-Prot protein with the target activity; the corresponding BLAST E-value is also shown. The context genomic correlations are represented as the relative percentile ranks. For example, the “expression profile” rank of 20% indicates that the target gene has better co-expression values in 20% of all other possible network locations compared to the location assigned in the database. Lower percentile ranks indicate better consistencies with genomic context correlations. For the protein fusion, “Y” (“N”) indicates that fusion event(s) between an ortholog of the candidate gene and a network neighbor was detected (not detected). The presence of a significantly better alternative location (“Y”/“N”) was determined by the ALR ratio as described in Supplementary Methods.
| Gene name | Annotated function (EC number) | Homology Score | Phylogenetic profile rank (%) | Expression profile rank (%) | Clustering profile rank(%) | Gene distance rank (%) | Protein fusion? | Significantly better alternative location? |
|---|---|---|---|---|---|---|---|---|
|
| 1.1.1.284 (K) | 40.7/3E-74 | 90 | 90 | 91 | 83 | N | Y |
|
| 2.6.1.17 (K) | 48.5/3E-98 | 58 | 23 | 28 | 72 | N | Y |
|
| 2.3.1.74 (K, S, M) | 29.7/1E-04 | 74 | 79 | 73 | 84 | N | Y |
|
| 1.11.1.9 (K, S, M) | 55/2E-51 | 47 | 57 | 55 | 52 | N | Y |
|
| 4.1.1.18 (M) | 22.8/2E-14 | 45 | 80 | 60 | 84 | N | Y |
|
| 2.7.1.107 (K, S, M) | 32.3/2E-11 | 64 | 21 | 55 | 81 | N | N |
|
| 3.5.1.32 (K, M) | 35.9/6E-59 | 45 | 64 | 54 | 8 | N | N |
|
| 2.7.9.2 (K, M) | 43.5/0.002 | 44 | 38 | 71 | 30 | N | Y |
|
| 2.4.2.7 (M) | 29.2/5E-07 | 4 | 1 | 7 | 12 | N | N |
|
| 3.2.1.52 (K) | 34.2/1E-27 | 49 | 1 | 54 | 36 | N | N |
|
| 1.8.1.9 (K) | 29.8/2E-25 | 50 | 21 | 33 | 16 | N | N |
|
| 1.1.1.205 (K) | 37/0.002 | 22 | 68 | 44 | 46 | Y | N |
|
| 2.6.1.1 (K) | 30.1/3E-30 | 2 | 1 | 9 | 25 | N | Y |
|
| 5.4.2.1 (K) | 38.3/1E-12 | 22 | 65 | 22 | 17 | N | N |
|
| 2.5.1.32 (K) | 27.8/8E-24 | 87 | 49 | 60 | 73 | N | Y |
|
| 3.1.3.71 (K, S) | 38.7/4E-18 | 87 | 42 | 72 | 10 | N | N |
|
| 1.1.1.37 (K) | 39.8/2E-60 | 68 | 30 | 48 | 37 | N | Y |
|
| 3.1.3.25 (K,S) | 38.1/2E-28 | 73 | 49 | 49 | 61 | N | N |
|
| 3.5.1.47 (K) | 35.6/3E-43 | 75 | 70 | 50 | 79 | N | Y |
|
| 6.4.1.3 (K) | 40.1/8E-92 | 1 | 4 | 2 | 12 | Y | Y |
|
| 4.2.1.17 (K, M) | 38.9/5E-39 | 1 | 2 | 2 | 14 | Y | Y |
|
| 6.2.1.3 (K) | 31/6E-63 | 1 | 10 | 56 | 31 | Y | Y |
|
| 1.1.1.95 (K) | 33.8/1E-39 | 1 | 1 | 24 | 74 | N | Y |
|
| 1.1.1.1 (K) | 29.7/2E-21 | 39 | 81 | 71 | 30 | N | Y |
|
| 3.4.11.9 (K) | 34.9/4E-22 | 50 | 54 | 11 | 78 | N | Y |
|
| 1.2.1.2 (K) | 37.5/1E-129 | 2 | 60 | 58 | 51 | Y | Y |
|
| 1.1.3.15 (K) | 27.3/4E-10 | 55 | 66 | 76 | 34 | N | Y |
|
| 1.6.99.3 (K) | 26.6/3E-25 | 1 | 18 | 26 | 18 | N | Y |
|
| 1.8.1.9 (K) | 29.3/1E-21 | 81 | 32 | 35 | 52 | N | Y |
|
| 2.3.1.5 (K) | 28.6/6E-13 | 78 | 36 | 77 | 78 | N | N |
|
| 1.1.1.215 (K, S) | 47.3/8E-79 | 53 | 59 | 47 | 49 | N | Y |
|
| 2.3.2.2 (K) | 31.4/9E-55 | 25 | 85 | 43 | 27 | N | N |
Figure 3Function of genes forming the yng cluster in B. subtilis. a) The genomic positions of the yng genes are shown in green. The detected misannotations are indicated in red. The predicted functions, forming the degradation pathway, are shown in blue. The expression of all yng gene is controlled by the σE transcription factor34; the gene mmgA is also under the σE control and is responsible for the last step of the leucine catabolism. b) Fractional 13C labeling of Acetyl-CoA in the wild type sporulating cells and in the sporulating yng mutants. The 13C labeling in the figure indicates the fraction of the Acetyl-CoA isotopomer generated from leucine in sporulating cells only (see Methods). The errors in the figure represent SEM. The background Acetyl-CoA isotopomer labeling is shown by the dash line.