| Literature DB >> 28193156 |
Seyed Ziaeddin Alborzi1,2, Marie-Dominique Devignes1,3, David W Ritchie4.
Abstract
BACKGROUND: Many entries in the protein data bank (PDB) are annotated to show their component protein domains according to the Pfam classification, as well as their biological function through the enzyme commission (EC) numbering scheme. However, despite the fact that the biological activity of many proteins often arises from specific domain-domain and domain-ligand interactions, current on-line resources rarely provide a direct mapping from structure to function at the domain level. Since the PDB now contains many tens of thousands of protein chains, and since protein sequence databases can dwarf such numbers by orders of magnitude, there is a pressing need to develop automatic structure-function annotation tools which can operate at the domain level.Entities:
Keywords: Content-based filtering; Enzyme commission number; Pfam domain; Protein domain; Protein function
Mesh:
Substances:
Year: 2017 PMID: 28193156 PMCID: PMC5307852 DOI: 10.1186/s12859-017-1519-x
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1a) One domain provides one enzyme function; (b) two domains on the same chain each provide a different enzyme function; (c) one domain provides two different enzyme functions; (d) one domain provides one enzyme function, while a second domain acts as a co-factor with the first domain to provide an additional enzyme function
Fig. 2A graphical illustration of calculating raw EC-Pfam association scores from existing SIFTS EC-CID and Pfam-CID associations
Statistics on the source datasets and calculated EC-Pfam associations
| Dataset | EC-Pfam associations | Distinct 4-digit EC numbers | Distinct Pfam entries | |
|---|---|---|---|---|
| Source | SIFTS | 6306 | 2648 | 2611 |
| Datasets | SwissProt | 18,917 | 4013 | 3101 |
| TrEMBL | 124,699 | 3751 | 5703 | |
| UniRule | 141,990 | 1020 | 2907 | |
| Merged | 262,571 | 4648 | 6639 | |
| Reference | InterPro | 1515 | 688 | 1284 |
| ECDomainMiner | With CS above threshold |
|
|
|
| Results | (Overlap with InterPro) | ( | ( | ( |
| Including low CS |
|
|
| |
| (Overlap with InterPro) | ( | ( | ( |
CS is the Confidence Score
All italicized entries are calculated by ECDomainMiner
Fig. 3Scale-up factors for ECDomainMiner compared with InterPro. Ratios between the numbers in ECDomainMiner and in Interpro have been calculated for associations (red), EC numbers (yellow), and Pfam domains (green) after dividing the dataset according to each EC branch represented in the associations (1 to 6) and for all the dataset (All). 1: oxydoreductases; 2: transferases; 3: hydrolases; 4: lyases; 5: isomerases; 6: ligases
Fig. 4Venn diagram showing the intersection between a Pfam2EC (2500 associations) from dcGO, b All-Merged (262,571 associations), and c ECDomainMiner (20,728 associations). Region I (480 associations) is the portion of (a) for which there is no data in any of our four source datasets. Region II (128 associations) is the portion of (a) that exists in (b) but is not retained in ECDomainMiner (c). Region III (1892 associations) is the overlap between (a) and (c). Region IV (18,836 associations) is the portion of ECDomainMiner associations that are not available from SCOP2EC. Region V (241,363 associations) is the rest of the merged set of EC-Pfam source associations that are absent from (a) and not retained as Gold, Silver, or Bronze associations by ECDomainMiner
Fig. 5Distribution of EC numbers (a) and Pfam domains (b) in multiple associations. Numbers (1 to 10 and >10) represent the arity of the association in which a given EC number, respectively Pfam domain, is involved. In addition, for each arity, the normalized number of Gold, Silver, and Bronze associations is plotted. It can be observed that for arities equal to or greater than 4, the proportion of Silver associations is always the highest but significant numbers of Gold associations remain present even for high arity numbers
(A) Fourteen one-to-one EC-Pfam associations found by ECDomainMiner and involving domains of unknown function, (B) an example of one-to-one EC-Pfam association with very similar EC and Pfam descriptions, and (C) two examples of obligate Pfam pairs associated with an EC number
| EC | Pfam | Score | EC name | Pfam name | Quality | PDBs (SIFTS) | |
|---|---|---|---|---|---|---|---|
| A | 2.7.8.28 | PF01933 | 0.972 | 2-phospho-L-lactate transferase | Uncharacterised protein family UPF0052 | Gold | 9/0/11 |
| 4.1.99.5 | PF11266 | 0.944 | Aldehyde oxygenase (deformylating) | Protein of unknown function DUF3066 | Gold | 18/0/0 | |
| 2.1.1.286 | PF11968 | 0.889 | 25S rRNA (adenine(2142)-N(1))- methyltransferase | Putative methyltransferase DUF3321 | Gold | 0/0/0 | |
| 1.13.99.1 | PF05153 | 0.667 | Inositol oxygenase | Family of unknown function DUF706 | Gold | 4/0/0 | |
| 2.4.1.155 | PF15027 | 0.611 | Alpha-1,6-mannosyl-glycoprotein 6-beta-N-acetylglucosaminyltransferase | Domain of unknown function DUF4525 | Gold | 0/0/0 | |
| 4.2.3.130 | PF10776 | 0.611 | Tetraprenyl-beta-curcumene synthase | Protein of unknown function DUF2600 | Gold | 0/0/0 | |
| 2.3.1.78 | PF07786 | 0.609 | Heparan-alpha-glucosaminide N-acetyltransferase | Protein of unknown function DUF1624 | Gold | 0/0/0 | |
| 3.1.4.45 | PF09992 | 0.584 | N-acetylglucosamine-1-phosphodiester alpha-N-acetylglucosaminidase | Predicted periplasmic protein DUF2233 | Gold | 0/0/1 | |
| 1.13.12.20 | PF08592 | 0.556 | Noranthrone monooxygenase | Domain of unknown function DUF1772 | Gold | 0/0/0 | |
| 2.1.1.312 | PF11312 | 0.556 | 25S rRNA (uracil(2843)-N(3))- methyltransferase. | Protein of unknown function DUF3115 | Gold | 0/0/0 | |
| 2.1.1.313 | PF10354 | 0.556 | 25S rRNA (uracil(2634)-N(3))- methyltransferase | Domain of unknown function DUF2431 | Gold | 0/0/0 | |
| 2.5.1.128 | PF01861 | 0.556 | N4-bis(aminopropyl) spermidine synthase | Protein of unknown function DUF43 | Gold | 0/0/1 | |
| 5.2.1.14 | PF13225 | 0.556 | Beta-carotene isomerase | Domain of unknown function DUF4033 | Gold | 0/0/0 | |
| 1.14.99.29 | PF04248 | 0.333 | Deoxyhypusine monooxygenase | Domain of unknown function DUF427 | Silver | 0/0/5 | |
| B | 6.3.2.25 | PF03133 | 0.610 | Tubulin–tyrosine ligase | Tubulin-tyrosine ligase family | Gold | 0/2/21 |
| C |
| PF00370 | 0.847 | Glycerol kinase | FGGY family of carbohydrate kinases, N-terminal domain | Gold | 85/32/9 |
| PF02782 | 0.828 | FGGY family of carbohydrate kinases, C-terminal domain | Gold | 85/32/7 | |||
|
| PF06973 | 0.997 | Formate-phosphoribosyl-amino- imidazol | DUF1297 | Gold | 16/3/0 | |
| PF06849 | 0.997 | carboxamide ligase | DUF1246 | Gold | 16/3/0 |
The ‘PDBs (SIFTS)’ column contains 3 counts of PDB chains containing the mentioned Pfam domain and having either the same EC annotation in SIFTS as calculated by ECDomainMiner (first position), or different EC annotations between SIFTS and ECDomainMiner (second position), or no EC annotations in SIFTS (third position). Complete lists of PDB identifiers may be retrieved from the ECDomainMiner web server
The numbers of PDB protein chains that could be annotated by ECDomainMiner associations
| Association type | ECDM associations concerned | PDB chains concerned |
|---|---|---|
| Any | 14,573 | 58,722 |
| Gold | 3591 | 41,246 |
| Silver | 7796 | 44,406 |
| Bronze | 3186 | 34,820 |
| One-to-One | 44 | 1334 |