| Literature DB >> 18261238 |
Ramy K Aziz1, Daniela Bartels, Aaron A Best, Matthew DeJongh, Terrence Disz, Robert A Edwards, Kevin Formsma, Svetlana Gerdes, Elizabeth M Glass, Michael Kubal, Folker Meyer, Gary J Olsen, Robert Olson, Andrei L Osterman, Ross A Overbeek, Leslie K McNeil, Daniel Paarmann, Tobias Paczian, Bruce Parrello, Gordon D Pusch, Claudia Reich, Rick Stevens, Olga Vassieva, Veronika Vonstein, Andreas Wilke, Olga Zagnitko.
Abstract
BACKGROUND: The number of prokaryotic genome sequences becoming available is growing steadily and is growing faster than our ability to accurately annotate them. DESCRIPTION: We describe a fully automated service for annotating bacterial and archaeal genomes. The service identifies protein-encoding, rRNA and tRNA genes, assigns functions to the genes, predicts which subsystems are represented in the genome, uses this information to reconstruct the metabolic network and makes the output easily downloadable for the user. In addition, the annotated genome can be browsed in an environment that supports comparative analysis with the annotated genomes maintained in the SEED environment. The service normally makes the annotated genome available within 12-24 hours of submission, but ultimately the quality of such a service will be judged in terms of accuracy, consistency, and completeness of the produced annotations. We summarize our attempts to address these issues and discuss plans for incrementally enhancing the service.Entities:
Mesh:
Substances:
Year: 2008 PMID: 18261238 PMCID: PMC2265698 DOI: 10.1186/1471-2164-9-75
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Figure 1Example Tricarballylate Utilization Subsystem. A) The subsystem is comprised of 4 functional roles. B) The Subsystem Spreadsheet is populated with genes from 5 organisms (simplified from the original subsystem) where each row represents one organism and each column one functional role. Genes performing the specific functional role in the respective organism populate the respective cell. Gray shading of cells indicates proximity of the respective genes on the chromosomes. There are two distinct variants of the subsystem: variant 1, with all 4 functional roles and variant 2 where the 3rd functional role is missing.
Figure 2Genes connected to subsystems and their distribution in different categories. The categories are expandable down to the specific gene (see Secondary Metabolism).
Figure 3Job Overview page. The colours in the progress bar have the following meaning: gray – not started, blue – queued for computation, yellow – in progress, red – requires user input, brown – failed with an error, green – successfully completed.
Figure 4Job Detail page. The RAST annotation progress can be monitored by each user.
Figure 5Genome Browser. The annotated genome can be browsed starting from a whole-genome view and zooming-in to a specific feature.
Figure 6Annotation Overview. For each annotated feature RAST presents an overview page, which includes comparative genomics views and the connections to a subsystem if one was asserted.
Figure 7Compare Metabolic Reconstruction tool. In the example the RAST metabolic reconstruction of the submitted genome of S. pyogenes Manfredo was compared to the metabolic reconstruction for S. pyogenes MGAS315, which is part of the comparative environment of the SEED. All three columns of subsystem categories are expandable. In cases where RAST was conservative in the assertion of a subsystem a manual attempt to retrieve the missing function/s can be made by clicking the find button.
Figure 8View Features page. All annotated features can be viewed and downloaded in table format. For each peg the location on the contig, the functional role assignment, its EC number (if present) and GO category, the connection to a subsystem and a KEGG reaction (if appropriate) are given.
Figure 9View Scenarios page. A genome-specific reaction network can be viewed on a scenario by scenario basis. The scenarios are organized on the left by subsystems, which are themselves organized by categories of metabolic function. If a path through a scenario was found in a given subsystem, the subsystem name is highlighted in blue. In this case, one path was found through the Uroporphyrinogen III generation scenario in the Porphyrin, Heme and Siroheme Biosynthesis subsystem. The table to the right shows the input and output compounds for the scenario, including their stoichiometry, and the reactions that make up the path through the scenario.
Figure 10Comparison of a set of genomes manually curated in the SEED and automatically annotated in RAST. The number of genes annotated as hypothetical and the number of genes linked to subsystems (our mechanism of manual curation) is shown to provide an initial assessment of the performance of RAST.
Differences in annotation
| 2814 | 92.8 | 93.7 | 164 | |
| 1613 | 91.7 | 87.5 | 185 | |
| 550 | 90.0 | 94.9 | 25 | |
| 1844 | 90.6 | 81.7 | 306 | |
| 2094 | 93.8 | 84.0 | 314 |
The total number of genes (genes) is the number annotated in SEED, percentage of matched genes (% matched) is the number generated by a sequence-based matching of genes. Of those matched genes the % identical subsets are annotated with an identical annotation. The last column gives the number of predicted genes with different annotations.
Analysis of the discrepancies in annotation between SEED and RAST for three genomes
| 164 | 111 | 53 | |
| 306 | 153 | 153 | |
| 314 | 159 | 155 |
A detailed manual analysis
| 51 | 113 | 38 | 49 | 16 | |
| 43 | 105 | 25 | 74 | 19 | |
| 45 | 40 | 44 | 98 | 22 | |
A detailed manual analysis of the genes called in RAST and in the SEED sheds some light on the differences in the respective predictions. As the matching was performed on protein sequences, RNAs could not be matched. Genes found in the SEED and not predicted in the RAST were split into two categories with and without hypothetical annotations. Additional predictions found in RAST but not in SEED were also included and again split into hypothetical and non-hypothetical.