| Literature DB >> 27338101 |
Hugo P Bastos1, Lisete Sousa2, Luka A Clarke3, Francisco M Couto4.
Abstract
BACKGROUND: Biological sequences, such as proteins, have been provided with annotations that assign functional information. These functional annotations are associations of proteins (or other biological sequences) with descriptors characterizing their biological roles. However, not all proteins are fully (or even at all) annotated. This annotation incompleteness limits our ability to make sound assertions about the functional coherence within sets of proteins. Annotation incompleteness is a problematic issue when measuring semantic functional similarity of biological sequences since they can only capture a limited amount of all the semantic aspects the sequences may encompass.Entities:
Mesh:
Substances:
Year: 2016 PMID: 27338101 PMCID: PMC4917928 DOI: 10.1186/s13326-016-0076-y
Source DB: PubMed Journal: J Biomed Semantics
Fig. 1Hypothetical GO graph. Terms are represented by nodes where the number within is the number of proteins (of a given set of 100) annotated to that term. There are three situations represented: a annotation incompleteness, b annotation agreement and c annotation coherence
Fisher exact test’s 2x2 contingency table
| Set | Background | |
|---|---|---|
| annotated | nt | mt-nt |
| not Annotated | N-nt | (M-N)-(mt-nt) |
Number of protein UniProt identifiers (size) in each of the classes in the CAZy database (ver. c7-2011)
| Size | |
|---|---|
| GH | 70227 |
| GT | 55461 |
| CBM | 10907 |
| CE | 8110 |
| PL | 1766 |
List of the protein families belonging to the PL class in the CAZy database and their respective size (in number of UniProt identifiers)
| Family | Size | Family | Size |
|---|---|---|---|
| PL1 | 491 | PL12 | 80 |
| PL2 | 34 | PL13 | 7 |
| PL3 | 229 | PL14 | 38 |
| PL4 | 45 | PL15 | 10 |
| PL5 | 37 | PL16 | 22 |
| PL6 | 24 | PL17 | 33 |
| PL7 | 82 | PL18 | 5 |
| PL8 | 184 | PL20 | 6 |
| PL9 | 148 | PL21 | 9 |
| PL10 | 84 | PL22 | 42 |
| PL11 | 84 | unassigned | 80 |
Fig. 2Protein set degeneration procedure. For each set (family) a chosen percentage of the set original proteins is replaced with proteins drawn randomly from outside the set
Fig. 3Plots of the average similarity as measured by six different metrics. For the first eight PL protein families (from the CAZy database) and their derived sets. These sets were made by replacing the original proteins with increasing amounts (of 10 % increments; 100 iterations) of random proteins (taken from the CAZy database)
Difference between maximum and minimum values reported for each tested metric (Agreement, simUI, simGIC, mUI, mGIC, GS2) against each PL family and iterations of derived respective sets created by insertion of increasing amounts of random proteins (from CAZy) into the original families
| Metrics | PL1 | PL2 | PL3 | PL4 | PL5 |
| Agreement | 0.122 | 0.391 | 0.260 | 0.368 | 0.874 |
| simUI | 0.298 | 0.497 | 0.620 | 0.376 | 0.650 |
| simGIC | 0.356 | 0.539 | 0.825 | 0.458 | 0.853 |
| mUI | 0.139 | 0.214 | 0.671 | 0.353 | 0.801 |
| mGIC | 0.147 | 0.216 | 0.672 | 0.343 | 0.802 |
| GS2 | 0.137 | 0.224 | 0.238 | 0.177 | 0.246 |
| Metrics | PL6 | PL7 | PL8 | PL9 | PL10 |
| Agreement | 0.405 | 0.201 | 0.180 | 0.058 | 0.178 |
| simUI | 0.386 | 0.432 | 0.548 | 0.207 | 0.368 |
| simGIC | 0.429 | 0.542 | 0.660 | 0.27 0 | 0.484 |
| mUI | 0.469 | 0.501 | 0.329 | 0.368 | 0.564 |
| mGIC | 0.474 | 0.505 | 0.285 | 0.372 | 0.559 |
| GS2 | 0.175 | 0.129 | 0.224 | 0.080 | 0.146 |
| Metrics | PL11 | PL12 | PL16 | PL17 | PL22 |
| Agreement | 0.229 | 0.771 | 0.831 | 0.869 | 0.400 |
| simUI | 0.108 | 0.644 | 0.613 | 0.649 | 0.443 |
| simGIC | 0.122 | 0.838 | 0.829 | 0.853 | 0.521 |
| mUI | 0.378 | 0.744 | 0.903 | 0.831 | 0.494 |
| mGIC | 0.373 | 0.741 | 0.905 | 0.831 | 0.501 |
| GS2 | 0.054 | 0.247 | 0.211 | 0.248 | 0.191 |
Fig. 4GRYFUN annotation graph. Annotation of GO molecular function ontology graph generated by the GRYFUN web application for a set of proteins from the PL10 family