| Literature DB >> 23514411 |
Damiano Piovesan1, Pier Luigi Martelli, Piero Fariselli, Giuseppe Profiti, Andrea Zauli, Ivan Rossi, Rita Casadio.
Abstract
BACKGROUND: In the genomic era a key issue is protein annotation, namely how to endow protein sequences, upon translation from the corresponding genes, with structural and functional features. Routinely this operation is electronically done by deriving and integrating information from previous knowledge. The reference database for protein sequences is UniProtKB divided into two sections, UniProtKB/TrEMBL which is automatically annotated and not reviewed and UniProtKB/Swiss-Prot which is manually annotated and reviewed. The annotation process is essentially based on sequence similarity search. The question therefore arises as to which extent annotation based on transfer by inheritance is valuable and specifically if it is possible to statistically validate inherited features when little homology exists among the target sequence and its template(s).Entities:
Mesh:
Substances:
Year: 2013 PMID: 23514411 PMCID: PMC3584929 DOI: 10.1186/1471-2105-14-S3-S4
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Discriminating among validated and not validated BAR+ clusters. The number of clusters containing GO terms of three main roots and Pfam terms is reported as a function of the Bonferroni-corrected P-value. The black vertical line sets the boundary among validated and not validated terms. It can be proven (data not shown) that that a P-value ≤ 0.01 is a discriminative value good enough to discriminate among the real and the random distribution of each type of GO and Pfam terms (for mathematical details see [15]. Green colour: Pfam terms; Blue colour: Molecular Function (MFO); Red colour: Biological Process (BPO); Pale blue: Cellular Component (CCO). For the different curves the number of validated clusters as compared to the total number of BAR+ clusters is: Pfam 197,826/455,309; MFO 84,506/321,748; BPO 75,147/265,164; CCO 31,042/145,677. The total number of cluster with at least a GO validated term is 100,791.
Annotating the CAFA set with BAR+
| 20,532 | 17,389 | 17,131 | 16,430 | 22,733 | 24,038 | 26,378 | 8,054 | ||
| [32,143]^ | 1,448 | ||||||||
| 9,660 | 8,915 | 8,202 | 4,723 | 9,843 | 10,772 | 11,088 | 5,924 | ||
| [12,295]^ | 224 | ||||||||
| 36 | 32 | 32 | 10 | 36 | 50 | 50 | 4 | ||
| [57]^ | 4 | ||||||||
| 30,228 | 26,336 | 25,365 | 21,163 | 32,612 | 34,860 | 13,982 | |||
| [44,495]^ | 2,047* | ||||||||
Cov: Coverage, the ratio of the length of the intersection of the aligned regions on the two sequences and the overall length of the alignment (namely the sum of the lengths of the two sequences minus the intersection length). For both Cov values Sequence Identity (SI) is ≥ 40%. MFO: Molecular Function Ontology; BPO: Biological Process Ontology; CCO: Cellular Component Ontology. ALL-O: number of sequences with predicted MFO OR BPO ORCCO. Pfam terms. ALL-O OR Pfam: the union of ALL-O and Pfam. °PDB: sequences that inherit a structural template from a cluster HMM within BAR+ [20]. ^ CAFA/BAR+ set sequences from Eukaryotes, Prokaryotes, and Unknown organisms. *Sequences with a corresponding PDB structure.
Figure 2Statistically validated GO ontologies of the CAFA/BAR+ set. Histograms of the main statistically validated GO Molecular Functions (MFO), Biological Processes (BPO), Cellular Component (CCO) ontologies are shown after annotation within validated BAR+ clusters. GO terms are included in main categories and listed with respect to Eukaryotes and Prokaryotes.
Figure 3Statistically validated Pfam terms of the CAFA/BAR+ set. Histograms of the most populated clans of Pfam terms are shown after annotation within validated BAR+ clusters. A clan is a collection of Pfam-A entries that are judged likely to be homologous [12]. Clans are sorted out discriminating among Prokaryotes (a) and Eukaryotes (b).
Comparing UniProtKB direct annotation with BAR+ annotation
| CAFA/UniProtKB* | BAR+ Validated° | ||||
|---|---|---|---|---|---|
| Total° | 34,065 | 10,628 | 34,065 | 13,558 | 3,659§ |
| Pfam^ | 30,767 | 5,293 | 31,190 | 5,365 | 423§ |
| MFO^ | 20,790 | 2,048 | 21,758 | 2,698 | 968§ |
| BPO^ | 19,739 | 2,719 | 21,585 | 4,879 | 1,846§ |
| CCO^ | 16,503 | 568 | 17,589 | 616 | 1,086§ |
| - | - | - | 3,451# | 5,886# | 3,451# |
| PDB+ | 2,047+ | - | 13,084+ | - | 11,935+ |
*The CAFA/UniProt KB set (the CAFA sequences that have a UniprotKB file) comprises 41,003 sequences, 3,767 of which do not contain any GO ontology and Pfam terms. °Here the CAFA/UniProtKB subset that can be validated in BAR+ is considered (BAR+validated). The number of sequences and the number of Pfam and GO terms are listed. Sequences that receive new validated terms are also listed according to Pfam, MFO, BPO and CCO. # Sequences of the CAFA set, out of a total of 7,295 that are not present in UniProtKB and are annotated in BAR+. +Number of sequences that have and also receive in BAR+ a PDB template.