| Literature DB >> 28324362 |
José P Faria1,2, Janaka N Edirisinghe1,3, James J Davis4,5, Terrence Disz1, Anna Hausmann6, Christopher S Henry1,3, Robert Olson1, Ross A Overbeek1,6, Gordon D Pusch6, Maulik Shukla7, Veronika Vonstein1,6, Alice R Wattam7.
Abstract
For many scientific applications, it is highly desirable to be able to compare metabolic models of closely related genomes. In this short report, we attempt to raise awareness to the fact that taking annotated genomes from public repositories and using them for metabolic model reconstructions is far from being trivial due to annotation inconsistencies. We are proposing a protocol for comparative analysis of metabolic models on closely related genomes, using fifteen strains of genus Brucella, which contains pathogens of both humans and livestock. This study lead to the identification and subsequent correction of inconsistent annotations in the SEED database, as well as the identification of 31 biochemical reactions that are common to Brucella, which are not originally identified by automated metabolic reconstructions. We are currently implementing this protocol for improving automated annotations within the SEED database and these improvements have been propagated into PATRIC, Model-SEED, KBase and RAST. This method is an enabling step for the future creation of consistent annotation systems and high-quality model reconstructions that will support in predicting accurate phenotypes such as pathogenicity, media requirements or type of respiration.Entities:
Year: 2014 PMID: 28324362 PMCID: PMC4327756 DOI: 10.1007/s13205-014-0202-4
Source DB: PubMed Journal: 3 Biotech ISSN: 2190-5738 Impact factor: 2.406
Brucella genomes used in this study with their SEED (Overbeek et al. 2005, 2013) and PATRIC (Gillespie et al. 2011; Wattam et al. 2013) identifiers, sizes, number of contigs and number of protein coding sequences (CDSs)
| Genome name | PubSEED ID | PATRIC genome ID | Genome size (bp) | Number of contigs | Number of CDSs |
|---|---|---|---|---|---|
| 262,698.4 | 15,061 | 3,286,445 | 2 | 3,413 | |
| 483,179.4 | 25,663 | 3,312,769 | 2 | 3,394 | |
| 595,497.3 | 28,239 | 3,389,269 | 7 | 3,578 | |
| 520,460.3 | 83,544 | 3,337,230 | 22 | 3,367 | |
| 224,914.11 | 92,729 | 3,294,931 | 2 | 3,446 | |
| 568,815.3 | 92,249 | 3,294,931 | 2 | 3,374 | |
| 520,456.3 | 114,381 | 3,329,623 | 11 | 3,383 | |
| 444,178.3 | 136,990 | 3,275,590 | 2 | 3,499 | |
| 520,462.3 | 74,143 | 3,373,519 | 15 | 3,356 | |
| 520,449.3 | 75,385 | 3,153,851 | 20 | 3,152 | |
| 470,735.4 | 109,945 | 3,366,774 | 55 | 3,361 | |
| 693,750.4 | 146,994 | 3,305,941 | 174 | 3,276 | |
| 520,448.3 | 103,899 | 3,297,137 | 17 | 3,442 | |
| 204,722.5 | 107,850 | 3,315,175 | 2 | 3,402 | |
| 520,489.3 | 73,489 | 3,323,676 | 19 | 3,316 |
The consistency of annotations across different resources
| Source | Number of pairs | Number of pairs inconsistently annotated | Percent of pairs inconsistently annotated |
|---|---|---|---|
| RefSeq | 562,597,217 | 383,808,122 | 68.2 |
| IMG | 101,525,838 | 52,434,525 | 51.6 |
| TrEMBL | 112,735,194 | 46,284,849 | 41.1 |
| SwissProt | 803,819 | 42,429 | 5.3 |
| SEED | 271,622,566 | 9,056,551 | 3.3 |
| Original RAST output | 16,349,603 | 102,097 | 0.6 |
| RAST after manual curation | 16,349,603 | 47,504 | 0.3 |
For each protein in a Brucella protein family used in this study, all of the proteins with identical sequences were found in various databases and the percentage of pairs that were inconsistently annotated was computed. Annotations were collected from RefSeq (Pruitt et al. 2007), UniProt Knowledgebase (UniProtKB)(Apweiler et al. 2010), the Translated EMBL Nucleotide Sequence Data Library (TrEMBL) (Boeckmann et al. 2003), the Integrated Microbial Genomes (IMG) system (Markowitz et al. 2012) and the SEED (Overbeek et al. 2005, 2013)