| Literature DB >> 18950477 |
Stéphane Descorps-Declère1, Matthieu Barba, Bernard Labedan.
Abstract
BACKGROUND: Curated databases of completely sequenced genomes have been designed independently at the NCBI (RefSeq) and EBI (Genome Reviews) to cope with non-standard annotation found in the version of the sequenced genome that has been published by databanks GenBank/EMBL/DDBJ. These curation attempts were expected to review the annotations and to improve their pertinence when using them to annotate newly released genome sequences by homology to previously annotated genomes. However, we observed that such an uncoordinated effort has two unwanted consequences. First, it is not trivial to map the protein identifiers of the same sequence in both databases. Secondly, the two reannotated versions of the same genome differ at the level of their structural annotation.Entities:
Mesh:
Year: 2008 PMID: 18950477 PMCID: PMC2596144 DOI: 10.1186/1471-2164-9-501
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Figure 1The different steps of the CorBank program. The main steps of the pipeline of Perl scripts are distinguished by different colors. The process of cross-referencing exact matching of the RefSeq and Genome Reviews versions of the same gene is indicated in yellow. The identification of inexact matches of genes that display a different structural annotation in both databases is made by the blue steps. Finally, disclosing the nature of the detected structural differences is made by the pink steps.
Figure 2Differentiating exact and inexact matches. A partial view of the output of the CorBank program obtained when comparing the two versions of the genome of the archeon Pyrococcus horikoshii OT3 is detailed in several tables. Table A recapitulates the respective database information about this species and its computed label. Table B shows a summary of the data obtained using CorBank to find what is either common to both databases or specific of each one. Table C illustrates a few instances of exact matches. Table D exemplifies a few inexact matches with detailed configuration of the difference in the structural annotations of each copy of the same gene. The definitions of these inexact configurations are given in the Additional file 1.
The reannotated copies of the same genome in independently curated databasesa are predominantly divergent
| 260 (40.5%) | 321 (50%) | ||
| 10 (1.5%) | 50 (8%) | ||
a RefSeq (Release 30) and Genome Reviews (Release 94.0) of July 2008
Top ten organisms having the highest number of CDS specific to RefSeq (RS) database
| RS1/GR7 | Pyrococcus horikoshii OT3 | 1955 | 2076 | 1806 | 124 | 0 | 270 | |
| RS2 | Neisseria meningitidis Z2491 | 2049 | 1991 | 1897 | 37 | 26 | 68 | |
| RS3/GR4 | Xanthomonas oryzae pv. oryzae KACC10331 | 4144 | 4540 | 4030 | 497 | 2 | 510 | |
| RS4 | Pyrococcus abyssi GE5 | 1896 | 1796 | 1783 | 68 | 0 | 3 | |
| RS5/GR6 | Shewanella oneidensis MR-1 | 4467 | 4779 | 4364 | 34 | 1 | 415 | |
| RS6/GR8 | Escherichia coli O157:H7 str. Sakai | 5318 | 5461 | 5227 | 391 | 2 | 232 | |
| RS7 | Deinococcus radiodurans R1 | 3181 | 1303 | 3099 | 91 | 1 | 4 | |
| RS8 | Pyrococcus furiosus DSM 3638 | 2125 | 2065 | 2065 | 115 | 8 | 0 | |
| RS9 | Lactococcus lactis subsp. lactis Il1403 | 2321 | 2266 | 2263 | 68 | 0 | 3 | |
| RS10 | Thermoplasma volcanium GSS1 | 1499 | 1526 | 1444 | 351 | 1 | 82 | |
The organisms are sorted by their respective rank that is computed as the number of CDS that are found only in RefSeq database (Release 30). The organism names standing in the top ten list of both databases (Tables 3 and 4) are in bold.
Figure 3Differentiating exact and inexact matches, following. Table E illustrates a few instances of genes found uniquely in RefSeq. Table F exemplifies a few genes specific to Genome Reviews. Table G lists the pseudo-CDS specific to Genome Reviews. Table H re-evaluates the data presented in Table B after identifying by their positions the pseudogenes and pseudo-CDS specific to RefSeq and Genome Reviews, respectively and assessing their exactitude.
Complete distributions of the divergences of curated databasesa in the case of closely related species
| P. horikoshii | 1955 | 2076 | 1806 | 1680 | 124 | 0 | 149 | 270 |
| P. furiosus | 2125 | 2065 | 2065 | 1942 | 115 | 8 | 60 | 0 |
| P. abyssi | 1896 | 1786 | 1783 | 1715 | 68 | 0 | 113 | 3 |
| T. kodakarensis | 2306 | 2306 | 2306 | 2303 | 2 | 1 | 0 | 0 |
a versions of May 2008
Top ten organisms having the highest number of CDS specific to Genome Reviews (GR) database
| GR1 | Mycobacterium leprae TN | 1605 | 2723 | 1605 | 77 | 1 | 0 | 1118 |
| GR2 | Orientia tsutsugamushi str. Boryong (Seoul National University) | 1182 | 2143 | 1182 | 3 | 0 | 0 | 961 |
| GR3 | Orientia tsutsugamushi str. Boryong (Kitasato University) | 1562 | 2085 | 1562 | 6 | 0 | 0 | 523 |
| GR4/RS3 | Xanthomonas oryzae pv. oryzae KACC10331 | 4144 | 4540 | 4030 | 497 | 2 | 114 | 510 |
| GR5 | Acinetobacter baumannii ATCC 17978 | 3368 | 3807 | 3368 | 77 | 0 | 0 | 439 |
| GR6/RS5 | Shewanella oneidensis MR-1 | 4467 | 4779 | 4364 | 34 | 1 | 103 | 415 |
| GR7/RS1 | Pyrococcus horikoshii OT3 | 1955 | 2076 | 1806 | 124 | 0 | 149 | 270 |
| GR8/RS6 | Escherichia coli O157:H7 str. Sakai | 5318 | 5461 | 5227 | 391 | 2 | 87 | 232 |
| GR9 | Prochlorococcus marinus subsp. pastoris str. CCMP1986 | 1717 | 1935 | 1714 | 4 | 2 | 3 | 221 |
| GR10 | Prochlorococcus marinus str. MIT 9312 | 1810 | 1962 | 1810 | 10 | 0 | 0 | 152 |
The organisms are sorted by their respective rank that is computed as the number of CDS that are found only in Genome Reviews database (Release 94.0). The organism names standing in the top ten list of both databases (Tables 3 and 4) are in bold.