| Literature DB >> 24065691 |
Damiano Piovesan1, Giuseppe Profiti, Pier Luigi Martelli, Piero Fariselli, Luca Fontanesi, Rita Casadio.
Abstract
Given the relevance of the pig proteome in different studies, including human complex maladies, a statistical validation of the annotation is required for a better understanding of the role of specific genes and proteins in the complex networks underlying biological processes in the animal. Presently, approximately 80% of the pig proteome is still poorly annotated, and the existence of protein sequences is routinely inferred automatically by sequence alignment towards preexisting sequences. In this article, we introduce SUS-BAR, a database that derives information mainly from UniProt Knowledgebase and that includes 26 206 pig protein sequences. In SUS-BAR, 16 675 of the pig protein sequences are endowed with statistically validated functional and structural annotation. Our statistical validation is determined by adopting a cluster-centric annotation procedure that allows transfer of different types of annotation, including structure and function. Each sequence in the database can be associated with a set of statistically validated Gene Ontologies (GOs) of the three main sub-ontologies (Molecular Function, Biological Process and Cellular Component), with Pfam functional domains, and when possible, with a cluster Hidden Markov Model that allows modelling the 3D structure of the protein. A database search allows some statistics demonstrating the enrichment in both GO and Pfam annotations of the pig proteins as compared with UniProt Knowledgebase annotation. Searching in SUS-BAR allows retrieval of the pig protein annotation for further analysis. The search is also possible on the basis of specific GO terms and this allows retrieval of all the pig sequences participating into a given biological process, after annotation with our system. Alternatively, the search is possible on the basis of structural information, allowing retrieval of all the pig sequences with the same structural characteristics.Entities:
Mesh:
Substances:
Year: 2013 PMID: 24065691 PMCID: PMC3781388 DOI: 10.1093/database/bat065
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Annotation of the PIG proteome in UniProtKB and Ensembl
| Dataset | MFO | BPO | CCO | All-GO | Pfam | Pfam and All-GO | PDB |
|---|---|---|---|---|---|---|---|
| SwissProt (1482) | |||||||
| Sequences | 1065 | 1111 | 1331 | 1395 | 1298 | 1402 | 104 |
| Terms | 811 | 2117 | 340 | 3268 | 986 | 4254 | – |
| TrEMBL (24 652) | |||||||
| Sequences | 12 451 | 11 537 | 12 349 | 16 357 | 16 559 | 19 284 | 1 |
| Terms | 1936 | 6443 | 913 | 9292 | 4138 | 13 430 | – |
| Ensembl (72) | |||||||
| Sequences | 47 | 22 | 25 | 49 | 45 | 52 | 2 |
| Terms | 58 | 107 | 35 | 200 | 43 | 243 | – |
| Total (26 206) | |||||||
| Sequences | 13 563 | 12 670 | 13 705 | 17 801 | 17 902 | 20 738 | 107 |
| Terms | 2190 | 6678 | 941 | 9809 | 4225 | 14 034 | – |
UniProtKB release: 2013_01; Ensembl release: Ensembl 70 genebuild based on Sus scrofa 10.2 pig genome assembly.
ALL-GO: number of sequences with MF OR BP OR CC.
aPfam domains. Union of ALL-GO and Pfam.
bPDB: protein pig sequences with a correspondent PDB structure.
cNumber of PIG protein sequences. Numbering considers only unique GO terms and Pfam domains.
MFO, molecular function ontology; BPO, biological process ontology; CCO, cellular component ontology.
Statistically validated annotation of the pig proteome in SUS-BAR
| Dataset | MFO | MFO | BPO | BPO | CCO | CCO | All-GO | All-GOa | Pfam | Pfam and All-GO | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Cluster (25 989) | |||||||||||
| Sequences | 12 755 | 9147 | 13 611 | 11 323 | 13 749 | 11 480 | 15 500 | 12 918 | 15 488 | 16 675 | 7284 |
| Clusters | 6491 | 3835 | 7064 | 5344 | 7155 | 5496 | 8597 | 6523 | 8598 | 9552 | 3421 |
| Terms | 3902 | 3215 | 12 520 | 12 020 | 1517 | 1370 | 17 939 | 16 605 | 3962 | 21 901 | – |
| Singleton (217) | |||||||||||
| Sequences | 121 | 8 | 131 | 9 | 167 | 9 | 179 | 9 | 133 | 181 | 0 |
| Terms | 132 | 18 | 222 | 19 | 79 | 9 | 433 | 46 | 118 | 551 | – |
| Total (26 206)d | |||||||||||
| Sequences | 12 876 | 9155 | 13 742 | 11 332 | 13 916 | 11 489 | 15 679 | 12 927 | 15 621 | 16 856 | 7284 |
| Terms | 3904 | 3218 | 12 521 | 12 020 | 1517 | 1370 | 17 942 | 16 608 | 3968 | 21 910 | – |
aTerms that are statistically validated and have an experimental evidence code with the corresponding number of sequences that inherit them in a given number of clusters.
bPig protein sequences in clusters that inherit a structure.
cNumbering considers only unique GO terms and Pfam domains.
dClusters are generated as described in the SUS-BAR section. Singletons are pig sequences that do not belong to clusters and carry along only their original UniProtKB or Ensembl annotation.
SUS- BAR annotation of pig protein sequences not annotated with GO terms and/or Pfam domains in UniProtKB and Ensembl
| Dataset | MFO | MFa | BPO | BPa | CCO | CCa | All-GO | All-GOa | Pfam | Pfam and All-GO | PDB |
|---|---|---|---|---|---|---|---|---|---|---|---|
| UniProtKB (5448) | |||||||||||
| Sequences | 795 | 539 | 908 | 716 | 952 | 732 | 1118 | 859 | 903 | 1209 | 558 |
| Clusters | 541 | 345 | 645 | 492 | 670 | 493 | 806 | 594 | 658 | 892 | 365 |
| Terms | 1306 | 1044 | 6684 | 6303 | 798 | 685 | 8788 | 8032 | 581 | 9369 | – |
| Ensembl (171) | |||||||||||
| Sequences | 10 | 6 | 21 | 10 | 29 | 18 | 34 | 20 | 62 | 74 | 9 |
| Clusters | 10 | 6 | 20 | 9 | 28 | 18 | 32 | 19 | 50 | 62 | 8 |
| Terms | 37 | 20 | 306 | 264 | 117 | 89 | 460 | 373 | 45 | 505 | – |
| Total (5619) | |||||||||||
| Sequences | 805 | 545 | 929 | 726 | 981 | 750 | 1152 | 879 | 965 | 1283 | 567 |
| Clusters | 547 | 349 | 660 | 500 | 692 | 507 | 832 | 609 | 706 | 948 | 372 |
| Terms | 1311 | 1046 | 6694 | 6313 | 811 | 697 | 8816 | 8056 | 621 | 9437 | – |
aTerms that are statistically validated and have an experimental evidence code with the corresponding number of sequences that inherit them in a given number of clusters.
bInherited with cluster HMMs.
cNumber of pig protein sequences in the two databases.
Figure 1.The SUS-BAR interface. Query requires both a search term and the selection of the corresponding search key (http://bar.biocomp.unibo.it/pig).
Pig sequences in clusters with other organisms
| Organism | Number of clusters | Number of Pig sequences | Number of Pig sequences (SI < 30%) | Number of clusters with PDB | Number of Pig sequences inheriting PDBs |
|---|---|---|---|---|---|
| 9409 | 16 480 | 1288 | 2732 | 6013 | |
| 8804 | 15 771 | 579 | 823 | 2461 | |
| 8413 | 15 340 | 148 | 213 | 758 |
SI, sequence identity.
Figure 2.SUS–BAR sequences endowed with statistically validated GO terms corresponding to the sensory perception of smell. The left panel lists all the pig sequences that are retrieved when the database search is done with GO:0007608. In all, 1163 pig sequences in SUS–BAR inherit the same statistically validated GO term by entering clusters where the term is statistically validated by computing its Bonferroni-corrected P-value (19). Characteristics of the cluster (#3909) where one of these 1163 pig protein sequences (F1S737, red box in the left) is located are shown in the right panel. Structural alignment of the four PDBs contained in the clusters is done with Mustang (25) and visualized with PyMol (http://www.pymol.org). The inset is manually computed for the figure and is not present in the corresponding page of the Web site. Each pig protein sequence in the cluster can be, however, modelled in house after downloading of the corresponding alignment with the templates in the cluster. This is provided with the cluster-specific HMM. See text for further details. Yellow stars indicate that the GO term is statistically validated and endowed with an experimental evidence code. The red dot indicates that the cluster contains SwissProt annotated protein/s.