| Literature DB >> 30453875 |
Seyed Ziaeddin Alborzi1, David W Ritchie1, Marie-Dominique Devignes2.
Abstract
BACKGROUND: Families of related proteins and their different functions may be described systematically using common classifications and ontologies such as Pfam and GO (Gene Ontology), for example. However, many proteins consist of multiple domains, and each domain, or some combination of domains, can be responsible for a particular molecular function. Therefore, identifying which domains should be associated with a specific function is a non-trivial task.Entities:
Keywords: Gene ontology; Protein domain; Protein function; Protein structure; Vector similarity
Mesh:
Substances:
Year: 2018 PMID: 30453875 PMCID: PMC6245584 DOI: 10.1186/s12859-018-2380-2
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Graphical representation of the different kinds of relationships that may exist between GO terms and protein domains. S1: A protein with one domain providing one function; S2: Two domains of the same protein provide different functions; S3: A protein with two domains, where one domain provides two different functions, and the second domain has no known function; S4: A protein having one domain that provides one function, and a second domain which acts as a co-factor with the first domain to provide an additional function
Fig. 2Schematic illustration of edge discovery. In a typical instantiation, X is a set of MF GO terms, Y a set of Pfam domains, and Z a set of UniProtKB/SwissProt sequences. E1 are edges derived from the MF GO annotation of UniProtKB/SwissProt sequences, E2 are edges derived from the domain contents of UniProtKB/SwissProt sequences, is the enriched set of edges, derived from initial that included a limited number of edges (represented here by (x1,y1)), derived from the InterPro manually curated MF GO annotations of Pfam domains. contains all newly discovered MF GO-Pfam associations represented here by (x2,y2)
Fig. 3Edge enrichment using an ontology. Here, edge (x2,z3) is added (right, dashed link) because z3 has an existing association with x3, and x2 is a parent term of x3 in the ontology (left)
Fig. 4Clustering identical or highly similar items in Z. a Clustering of items z1 and z2 of initial degree 1 induces a new association between x and y. b Clustering reduces the complexity of initial multiple associations. In both cases, clustering will increase the cosine similarity scores of the associated items x and y
Calculated AUCs, dataset weights, F-measures, and score thresholds for GO-domain associations for the 3 GO ontologies and 3 domain classifications studied here
| Optimal Weights | F-measure | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Dataset | AUC | SIFTS | SP | TR | SIFTS-IEA | SP-IEA | TR-IEA | Training | Test | Threshold | |
| GO-Pfam | 0.9605 | 1 | 1 | 6 | 10 | 10 | 10 | 0.926 | 0.924 | 0.005 | |
| MF | GO-CATH | 0.9710 | 1 | 1 | 10 | 10 | 1 | 9 | 0.935 | 0.943 | 0.004 |
| GO-SCOP | 0.9693 | 1 | 1 | 10 | 10 | 1 | 2 | 0.954 | 0.931 | 0.004 | |
| GO-Pfam | 0.9546 | 1 | 1 | 1 | 10 | 1 | 8 | 0.898 | 0.903 | 0.008 | |
| BP | GO-CATH | 0.9726 | 1 | 1 | 1 | 10 | 1 | 5 | 0.922 | 0.938 | 0.007 |
| GO-SCOP | 0.9756 | 1 | 1 | 1 | 10 | 1 | 3 | 0.943 | 0.939 | 0.007 | |
| GO-Pfam | 0.9228 | 1 | 1 | 6 | 10 | 1 | 10 | 0.871 | 0.866 | 0.003 | |
| CC | GO-CATH | 0.9741 | 1 | 1 | 1 | 10 | 1 | 9 | 0.955 | 0.932 | 0.003 |
| GO-SCOP | 0.9684 | 1 | 1 | 1 | 10 | 1 | 6 | 0.927 | 0.906 | 0.005 | |
Data source abbreviations are: SP for UniProtKB/SwissProt and TR for UniProtKB/TrEMBL
The numbers of given and predicted MF GO-domain associations in thousands (× 103)
| Dataset | GO-Domain Associations | MF GO Terms | Domain Entries | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Pfam | CATH | SCOP | Pfam | CATH | SCOP | Pfam | CATH | SCOP | |
| SIFTS | 31 | 16 | 9.9 | 44 | 22 | 17 | 2.8 | 1.1 | 0.8 |
| SIFTS-IEA | 69 | 36 | 23 | 26 | 29 | 23 | 4.8 | 2.0 | 1.5 |
| SwissProt | 194 | 72 | 73 | 6.3 | 5.4 | 5.6 | 7.4 | 1.2 | 1.1 |
| SwissProt-IEA | 225 | 79 | 79 | 4.8 | 4.2 | 4.3 | 8.1 | 1.4 | 1.2 |
| TrEMBL | 215 | 104 | 96 | 4.0 | 3.4 | 3.5 | 7.4 | 1.2 | 1.0 |
| TrEMBL-IEA | 756 | 240 | 208 | 6.4 | 5.7 | 5.8 | 13 | 1.6 | 1.4 |
| Merged | 917 | 306 | 266 | 7.9 | 7.2 | 7.3 | 14 | 2.5 | 1.8 |
|
|
|
|
|
|
|
|
|
|
|
| InterPro | 4.226 | 0.607 | 0.743 | 1.076 | 0.273 | 0.301 | 3.300 | 0.466 | 0.584 |
| Overlap | 3.968 | 0.594 | 0.713 | 1.059 | 0.273 | 0.300 | 3.101 | 0.457 | 0.560 |
The numbers of given and predicted BP GO-domain associations in thousands (× 103)
| Dataset | GO-Domain Associations | BP GO Terms | Domain Entries | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Pfam | CATH | SCOP | Pfam | CATH | SCOP | Pfam | CATH | SCOP | |
| SIFTS | 182 | 90 | 53 | 9.8 | 8.5 | 6.8 | 2.7 | 1.1 | 0.7 |
| SIFTS-IEA | 197 | 109 | 70 | 7.6 | 6.8 | 5.7 | 4.9 | 2.1 | 1.5 |
| SwissProt | 1336 | 461 | 465 | 20 | 18 | 19 | 8.6 | 1.2 | 1.2 |
| SwissProt-IEA | 844 | 267 | 302 | 14 | 12.5 | 13 | 9.4 | 1.4 | 1.3 |
| TrEMBL | 837 | 360 | 337 | 13 | 12 | 12 | 8.3 | 1.2 | 1.1 |
| TrEMBL-IEA | 1756 | 623 | 548 | 18 | 17 | 17 | 12 | 1.6 | 1.3 |
| Merged | 2436 | 872 | 764 | 21 | 20 | 20 | 13 | 2.4 | 1.8 |
|
|
|
|
|
|
|
|
|
|
|
| InterPro | 3.829 | 0.461 | 0.586 | 1.094 | 0.206 | 0.244 | 3.265 | 0.388 | 0.491 |
| Overlap | 3.518 | 0.448 | 0.572 | 1.077 | 0.205 | 0.244 | 3.028 | 0.376 | 0.480 |
The numbers of given and predicted CC GO-domain associations in thousands (× 103)
| Dataset | GO-Domain Associations | CC GO Terms | Domain Entries | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Pfam | CATH | SCOP | Pfam | CATH | SCOP | Pfam | CATH | SCOP | |
| SIFTS | 37 | 17 | 10 | 1.4 | 1.1 | 0.9 | 2.6 | 1.0 | 0.7 |
| SIFTS-IEA | 38 | 19 | 13 | 1.0 | 0.8 | 0.7 | 3.9 | 1.6 | 1.2 |
| SwissProt | 251 | 74 | 74 | 2.5 | 2.3 | 2.4 | 8.4 | 1.2 | 1.2 |
| SwissProt-IEA | 185 | 55 | 54 | 1.8 | 1.6 | 1.7 | 10 | 1.4 | 1.3 |
| TrEMBL | 179 | 67 | 61 | 1.7 | 1.6 | 1.6 | 7.9 | 1.2 | 1.1 |
| TrEMBL-IEA | 360 | 111 | 94 | 2.3 | 2.1 | 2.1 | 14 | 1.6 | 1.4 |
| Merged | 479 | 151 | 129 | 2.7 | 2.5 | 2.6 | 15 | 2.3 | 1.8 |
|
|
|
|
|
|
|
|
|
|
|
| InterPro | 2.289 | 0.192 | 0.237 | 0.336 | 0.058 | 0.064 | 2.042 | 0.163 | 0.208 |
| Overlap | 2.085 | 0.191 | 0.230 | 0.335 | 0.058 | 0.064 | 1.878 | 0.163 | 0.202 |
Fig. 5Distribution of GO-Pfam associations for the 3 GO ontologies (MF: top; BP: middle; CC: bottom). a Average number of GO-Pfam associations per GO term and per Pfam entry for InterPro (green), and GODomainMiner (purple). b Numbers of GO terms (orange) according to their numbers of associations with Pfam entries, and numbers of Pfam entries (blue) according to their numbers of associations with GO terms
The distribution of all most-specific associations from GODomainMiner and their overlap with InterPro, in the Gold, Silver, and Bronze categories
| GODomainMiner | Overlap with InterPro | |||||
|---|---|---|---|---|---|---|
| Class | MF | BP | CC | MF | BP | CC |
| Gold | 15,605 | 24,782 | 12,967 | 1815 | 1378 | 887 |
| Silver | 11,098 | 31,920 | 17,062 | 778 | 865 | 628 |
| Bronze | 6178 | 18,060 | 8939 | 64 | 116 | 124 |
| Total | 32,881 | 74,762 | 38,968 | 2657 | 2239 | 1679 |
Fig. 6Venn diagram showing the intersections between leaf GO-Pfam associations from Pfam2GO (62,779 associations), GODomainMiner (79,589), and manually curated associations from InterPro (2,799). Region A (2,744 associations) is the overlap between GODomainMiner and InterPro. Region B (11,138 associations) is the overlap between GODomainMiner and Pfam2GO. Region C (724 associations) is the overlap between Pfam2GO and InterPro
Selected examples of new one-to-one MF GO-Pfam associations
| MF GO ID | MF GO term | Pfam ID | Pfam description | Consensus Score | Class |
|---|---|---|---|---|---|
|
| |||||
| GO:0008437 | thyrotropin-releasing hormone activity | PF05438 | Thyrotropin-releasing hormone (TRH) | 0.0638 | gold |
|
| |||||
| GO:0098640 | integrin binding involved in cell-matrix adhesion | PF09085 | Adhesion molecule, immunoglobulin-like | 0.0752 | gold |
|
| |||||
| GO:1990919 | nuclear membrane proteasome anchor | PF08559 | Cut8, nuclear proteasome tether protein | 0.0309 | gold |
|
| |||||
| GO:0047991 | hydroxylamine oxidase activity | PF13447 | Seven times multi-haem cytochrome CxxCH | 0.2654 | gold |
|
| |||||
| GO:1990838 | poly(U)-specific exoribonuclease, activity producing 3’ uridine cyclic phosphate ends | PF09749 | Uncharacterised conserved protein | 0.0235 | gold |
| GO:0030144 | alpha-1,6-mannosylglycoprotein 6-beta-N-acetylglucosaminyl transferase activity | PF15027 | Domain of unknown function (DUF4525) | 0.5273 | silver |
| GO:0030735 | carnosine N-methyltransferase activity | PF07942 | N2227-like protein | 0.2705 | silver |
| GO:0010340 | carboxyl-O-methyltransferase activity | PF04301 | Protein of unknown function (DUF452) | 0.0201 | silver |
| GO:0016772 | transferase activity, transferring phosphorus-containing groups | PF01989 | Protein of unknown function DUF126 | 0.0137 | silver |
| GO:0071617 | lysophospholipid acyltransferase activity | PF10998 | Protein of unknown function (DUF2838) | 0.0072 | silver |
| GO:0015666 | restriction endodeoxyribonuclease activity | PF12102 | Domain of unknown function (DUF3578) | 0.0111 | bronze |
| GO:0016841 | ammonia-lyase activity | PF11807 | Domain of unknown function (DUF3328) | 0.0066 | bronze |
All of these examples are absent in InterPro; additional examples are available from the GODomainMiner website for cases (i) to (iv)