| Literature DB >> 26042526 |
Friedhelm Pfeiffer1, Dieter Oesterhelt2.
Abstract
Genome annotation errors are a persistent problem that impede research in the biosciences. A manual curation effort is described that attempts to produce high-quality genome annotations for a set of haloarchaeal genomes (Halobacterium salinarum and Hbt. hubeiense, Haloferax volcanii and Hfx. mediterranei, Natronomonas pharaonis and Nmn. moolapensis, Haloquadratum walsbyi strains HBSQ001 and C23, Natrialba magadii, Haloarcula marismortui and Har. hispanica, and Halohasta litchfieldiae). Genomes are checked for missing genes, start codon misassignments, and disrupted genes. Assignments of a specific function are preferably based on experimentally characterized homologs (Gold Standard Proteins). To avoid overannotation, which is a major source of database errors, we restrict annotation to only general function assignments when support for a specific substrate assignment is insufficient. This strategy results in annotations that are resistant to the plethora of errors that compromise public databases. Annotation consistency is rigorously validated for ortholog pairs from the genomes surveyed. The annotation is regularly crosschecked against the UniProt database to further improve annotations and increase the level of standardization. Enhanced genome annotations are submitted to public databases (EMBL/GenBank, UniProt), to the benefit of the scientific community. The enhanced annotations are also publically available via HaloLex.Entities:
Keywords: Gold Standard Protein; Halobacteria; genome annotation; halophilic archaea; manual curation
Year: 2015 PMID: 26042526 PMCID: PMC4500146 DOI: 10.3390/life5021427
Source DB: PubMed Journal: Life (Basel) ISSN: 2075-1729
Genomes under annotation survey.
| Organism | Contribution 1 | Proteins 2 | Locus Tags | UniProt Organism Code | EMBL Accessions | Reference |
|---|---|---|---|---|---|---|
| seq, anno | 2845 | OE_ | _HALS3 | AM774415-AM774419 | [ | |
| seq, anno | 2864 | NP_ | _NATPD | CR936257-CR936259 | [ | |
| seq, anno | 2881 | Nmlp_ | _NATM8 | HF582854 | [ | |
| seq, anno | 2874 | HQ_ | _HALWD | AM180088-AM180089 | [ | |
| seq, anno | 2995 | Hqrw_ | _HALWC | FR746099-FR746102 | [ | |
| anno | 4040 | HVO_ | _HALVD | CP001953-CP001957 | [ | |
| anno | 4295 | Nmag_ | _NATMM | CP001932-CP001935 | [ | |
| anno | 3437 | Hhub_ | - | - | [ | |
| 3rd party | 3859 | HFX_ | _HALMT | CP001868-CP001871 | [ | |
| 3rd party | 4290 | rrnAC, rrnB, pNG | _HALMA | AY596290-AY596298 | [ | |
| 3rd party | 3859 | HAH_ | _HALHT | CP006884-CP006886 | [ | |
| 3rd party | 3350 | halTADL_ | - | - | [ |
1 Contribution refers to participation of our group in genome sequencing (seq), annotation (anno), or no participation (3rd party); 2 Based on HaloLex.
Figure 1Schematic illustration of homology-based function assignment, based on Gold Standard Proteins. Each row represents a set of orthologs from the haloarchaeal genomes under survey (colored dots). These may be absent from some of the genomes (empty places). Other proteins are represented with decreasing sequence similarity. Gold standard proteins (yellow) are proteins which have been reported to be functionally characterized. Supposed orthologs of Gold Standard Proteins are indicated in blue, proteins that are not considered to be orthologs in red. Grey dots are homologs for which no decision has been attempted. (a) a protein from the set of haloarchaeal genomes has been experimentally characterized; (b) a closely related Gold Standard Protein; and (c) a more distantly related Gold Standard Protein have been characterized and are considered orthologous to the haloarchaeal proteins; (d) the haloarchaeal proteins are rated to be orthologs in a transitive way. While they are too distant to the Gold Standard Protein to support orthology directly, there are “bridging” proteins (light blue) that are close enough to both; (e) a Gold Standard Protein is too distant to be considered an ortholog and a “bridging” homolog cannot be identified. Only a general annotation can be used in this case; (f) none of the homologs could be identified as a Gold Standard Protein.
Figure 2Schematic illustration of the procedure to identify publications describing experimental protein characterization. Two major routes to identify Gold Standard Proteins are illustrated. (Left column): Homologs in UniProt/SwissProt are identified by a blastP search. Many of the UniProt/SwissProt entries represent manually curated Gold Standard Proteins, with a reference describing (green triangle) and “cited for” (green half-circle) experimental characterization. If this approach is successful, the evaluated protein can be annotated accordingly (green circle “plus”). UniProt/SwissProt entries lacking such a reference may have been annotated via the HAMAP system, which may allow identification of a Gold Standard Protein ortholog using the “scope:function” search option (not illustrated). On the other hand, the UniProt/SwissProt entry may be incomplete insofar as a reference reports functional characterization but is not flagged accordingly (grey half circle). The haloarchaeal protein is annotated according to this Gold Standard Protein and an update of the UniProt entry is proposed via the UniProt feedback system (blue arrow). The publication may report just the sequence (grey triangle) and cite another publication for experimental characterization. Frequently, sequencing precedes function assignment. The publication reporting experimental characterization commonly cites the sequencing paper and might be detected e.g., via the “cited by” functionality in PubMed. (Right column): If the UniProt/SwissProt attempt is not successful, homologs in UniProt/TrEMBL are checked for an assigned domain in InterPro. The InterPro annotation may report experimental protein characterizations. If the underlying UniProt entry already reports this characterization (and thus is in UniProt/SwissProt), this protein must be too distant to be considered an ortholog. Otherwise it would already have been identified via the UniProt/SwissProt approach (left column). If characterization of the protein is not reported in UniProt (which will trigger feedback), the protein sequence is used to identify potential haloarchaeal orthologs. If the protein under evaluation is considered an ortholog, it is annotated accordingly. If the InterPro approach also fails, a more extended literature search may be launched. If a Gold Standard Protein ortholog can be identified by literature search, the corresponding UniProt entry is identified and evaluated for orthology. Haloarchaeal orthologs are annotated accordingly (and UniProt/InterPro, which must be incomplete if this approach is required, are informed via feedback). If a literature search is also not successful, only a general annotation is possible.
Figure 3Schematic illustration of the interaction with public databases. (a) Updating of one member of an ortholog set in HaloLex triggers updating of all the other members, which is validated via consistency checking. Updated genome features are submitted to EMBL. These updates are forwarded to UniProt via inter-database communication, leading to updated UniProt entries a few releases later. (b) Updated and improved annotation at UniProt may affect only a subset of the sequences from the ortholog set. The UniProt update is detected by the database correlation approach and triggers updating of the corresponding HaloLex entries and all haloarchaeal orthologs from the genomes under survey. The improved annotation is forwarded to EMBL (as in (a)) and may lead to updating of additional proteins in UniProt.