| Literature DB >> 19494185 |
Arnold Kuzniar1, Ke Lin, Ying He, Harm Nijveen, Sándor Pongor, Jack A M Leunissen.
Abstract
Current protein sequence databases employ different classification schemes that often provide conflicting annotations, especially for poorly characterized proteins. ProGMap (Protein Group Mappings, http://www.bioinformatics.nl/progmap) is a web-tool designed to help researchers and database annotators to assess the coherence of protein groups defined in various databases and thereby facilitate the annotation of newly sequenced proteins. ProGMap is based on a non-redundant dataset of over 6.6 million protein sequences which is mapped to 240,000 protein group descriptions collected from UniProt, RefSeq, Ensembl, COG, KOG, OrthoMCL-DB, HomoloGene, TRIBES and PIRSF. ProGMap combines the underlying classification schemes via a network of links constructed by a fast and fully automated mapping approach originally developed for document classification. The web interface enables queries to be made using sequence identifiers, gene symbols, protein functions or amino acid and nucleotide sequences. For the latter query type BLAST similarity search and QuickMatch identity search services have been incorporated, for finding sequences similar (or identical) to a query sequence. ProGMap is meant to help users of high throughput methodologies who deal with partially annotated genomic data.Entities:
Mesh:
Substances:
Year: 2009 PMID: 19494185 PMCID: PMC2703891 DOI: 10.1093/nar/gkp462
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Database members and supported identifiers in the ProGMap database
| Database | Supported identifiers | Notes | URL |
|---|---|---|---|
| UniProt | • Protein ID (e.g. HBA_HUMAN) primary/secondary protein ACCESSION (e.g. P69905, P01922) | ftp://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/ knowledgebase/complete/ | |
| RefSeq (proteins) | • Protein ACCESSION (e.g. NP_000549) • Protein GI (e.g. 4504347) | ftp://ftp.ncbi.nih.gov/refseq/release/ | |
| Ensembl (proteins) | • Translation ID (e.g. ENSP00000251595) | ftp://ftp.ensembl.org/pub/ | |
| Ensembl Compara (families) | • Family ID (e.g. ENSF00000005499) | Protein families | |
| HomoloGene | • RefSeq protein ACCESSION (e.g. NP_000549) • Protein GI (e.g. 4504347) • Entrez GeneID (e.g. 3039) • Official gene symbol (e.g. HBA1) • Group ID (e.g. 469) | Orthologous clusters of 20 eukaryotic proteomes | ftp://ftp.ncbi.nih.gov/pub/HomoloGene/current/ |
| COG | • COG-specific protein ID (e.g. ampG) • Group ID (e.g. COG0477) | Orthologous clusters of 66 prokaryotic and eukaryotic (unicellular only) proteomes | ftp://ftp.ncbi.nih.gov/pub/COG/KOG/ |
| KOG | • KOG-specific protein ID (e.g. Hs4504345) • Group ID (e.g. KOG3378) | Orthologous clusters of seven eukaryotic proteomes | ftp://ftp.ncbi.nih.gov/pub/COG/KOG/ |
| OrthoMCL-DB | • DB-specific protein ID (e.g. hsa11326) • Group ID (e.g. OG1_7606) | Orthologous clusters of 87 proteomes (both eukaryotes and prokaryotes) | |
| TRIBES | • Tribes specific protein ID (e.g. MMUS-XXX-02-000372) • Group ID (e.g. TR-006821) | Protein families | |
| PIRSF | • UniProt ACCESSION (e.g. P68871) • Group ID (e.g. PIRSF500045) | Protein families, subfamilies and superfamilies | ftp://ftp.pir.georgetown.edu/databases/pirsf/ |
Figure 1.Comparing protein groups using the matrix comparison tool. Using an uncharacterized protein from M. jannaschii (RefSeq: NP_247002), ProGMap annotates this protein sequence as a ‘RNA polymerase subunit F’ on the basis of the manually curated PIRSF family (PIRSF005053). Although three other groups—wherein the protein is also found—do not provide plausible functional annotations (COG: COG1460; TRIBES: TR-009241; OrthoMCL-DB: OG2_105968), these, however, have more than one member in common as well as form either perfect (TR-009241 and OG2_105968) or nearly perfect subsets (COG1460) of the PIRSF family. The matrix comparison tool provides detailed information on set theoretic relations, per-group coverage (CA and CB, bars in red and green) and Jaccard index (J, bars in blue).
Figure 2.Comparing protein groups using the network visualization tool. The relationships among five orthologous groups of mannose-binding lectins (KOG: KOG4297; OrthoMCL-DB: OG2_78664, OG2_81338; HomoloGene: 55449, 88328). Groups sharing at least one protein are connected with an edge. In this particular example, the HomoloGene database (yellow) divides the lectins precisely into the two orthologous groups described in the literature (16,17), whereas the other databases either combine them into one group (KOG, blue), or divide them differently (OrthoMCL, orange).
Figure 3.Finding functional annotations with ProGMap. A hypothetical protein query is submitted to the BLAST server that shows significant similarities with an uncharacterized protein from M. jannaschii (RefSeq: NP_247002) (output not shown). By submitting this entry to ProGMap, all the synonymous protein identifiers along with protein descriptions and links to protein groups are retrieved from the underlying databases. Only one of the databases, PIRSF assigns this protein to a curated family annotated as ‘RNA polymerase subunit F’. The annotation of the PIRSF group indicates manual curation, which is an argument for accepting this tentative function. Although the group comparison view (Figure 1) shows that the databases are highly consistent with respect to this group (the groups are in nearly perfect agreement in all databases), the functional annotations are different for the groups compared.