| Literature DB >> 18849571 |
Tania Lima1, Andrea H Auchincloss, Elisabeth Coudert, Guillaume Keller, Karine Michoud, Catherine Rivoire, Virginie Bulliard, Edouard de Castro, Corinne Lachaize, Delphine Baratin, Isabelle Phan, Lydie Bougueleret, Amos Bairoch.
Abstract
The growth in the number of completely sequenced microbial genomes (bacterial and archaeal) has generated a need for a procedure that provides UniProtKB/Swiss-Prot-quality annotation to as many protein sequences as possible. We have devised a semi-automated system, HAMAP (High-quality Automated and Manual Annotation of microbial Proteomes), that uses manually built annotation templates for protein families to propagate annotation to all members of manually defined protein families, using very strict criteria. The HAMAP system is composed of two databases, the proteome database and the family database, and of an automatic annotation pipeline. The proteome database comprises biological and sequence information for each completely sequenced microbial proteome, and it offers several tools for CDS searches, BLAST options and retrieval of specific sets of proteins. The family database currently comprises more than 1500 manually curated protein families and their annotation templates that are used to annotate proteins that belong to one of the HAMAP families. On the HAMAP website, individual sequences as well as whole genomes can be scanned against all HAMAP families. The system provides warnings for the absence of conserved amino acid residues, unusual sequence length, etc. Thanks to the implementation of HAMAP, more than 200,000 microbial proteins have been fully annotated in UniProtKB/Swiss-Prot (HAMAP website: http://www.expasy.org/sprot/hamap).Entities:
Mesh:
Substances:
Year: 2008 PMID: 18849571 PMCID: PMC2686602 DOI: 10.1093/nar/gkn661
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Example of a HAMAP protein family annotation template (family rule), MF_00074 (http://www.expasy.org/unirule/MF_00074). Annotation templates contain three sections: ‘General rule information’, ‘Propagated annotation’ and ‘Additional information’. General information comprises: family identification number (MF_xxxxx), dates of creation and revision, ‘Data class’, i.e. that the whole protein is annotated by the family rule and not only a specific domain, and ‘Predictors’, which contain the distribution of matches and the alignment that was used to generate the family profile. The ‘Propagated annotation’ section contains the information that is propagated to all members of a protein family, or to some, if the field is preceded by ‘cases’ or ‘conditions’. For MF_00074, the function field will be different depending on the taxonomic origin, but all proteins will have ‘Cytoplasm’ as subcellular location and all belong to the family ‘RNA methyltransferase rsmG’. It also contains cross-references to other protein family databases, such as Pfam and TIGRFAMS, and manually selected GO terms. Additional information includes the size range of members of this family, if there are protein families related to this one, the list of characterized protein(s) that were used to compile information for the creation of the protein family and its annotation template (for MF_00074, literature is found for the proteins of E. coli, Bacillus subtilis, Microbacterium tuberculosis and Streptomyces coelicolor), the scope, i.e. the taxonomic groups covered by this family, if in at least one member this protein is fused to another protein either in the N-terminal or C-terminal region, and whether there are duplicates or whether in some species the protein is encoded on a plasmid. In the ‘UniProtKB rule member sequences’ section, complete sets of member proteins can be retrieved, taxonomic distribution can be browsed, and specific sets of proteins can be retrieved.
Figure 2.Examples of uses of the conditional statements ‘case’ and ‘conditions’ in family annotation templates (family rules). MF_00112 (http://www.expasy.org/unirule/MF_00112): an example of ID/protein name/gene name propagation depending on taxonomic distinction. In archaea, no gene name has been assigned but enzyme function has been proven in several different species, whereas the gene name pcrB is used only in Bacillales, with a function that has only been suggested for B. subtilis. Note that the reaction catalyzed by the archaeal protein has no biological significance in bacteria, since GGGP is a specific precursor of archaeal membrane lipids. MF_01544 (http://www.expasy.org/unirule/MF_01544): Subcellular location is predicted based on the number of membranes the bacterium possesses. MF_01624 (http://www.expasy.org/unirule/MF_01624): an example of conditions used for active site and disulfide bond feature propagation. If the indicated amino acid(s) are not present in the appropriate position(s) in the sequence, the feature is not propagated and a warning is generated, necessitating manual intervention. MF_01339 (http://www.expasy.org/unirule/MF_01339): an example of active site, metal and modified residue feature propagation. In the last two examples, the template entry used to derive the information is also indicated.
Figure 3.The HAMAP annotation pipeline. UniProtKB/TrEMBL complete proteome entries matching a HAMAP family detection profile (derived from an alignment of manually selected family members; matches to those profiles are stored in a ‘Match database’, allowing assignment of family membership) are passed through a ‘template engine’ that applies the annotation found in the corresponding HAMAP annotation template (and resolving its conditional statements) to generate UniProtKB/Swiss-Prot annotation. If the system generates warnings, or if the matching score is low, the entry is channelled to manual annotation; entries without warnings are directly integrated into UniProtKB/Swiss-Prot. UniProtKB entries for which there is available literature are manually annotated.