| Literature DB >> 21453511 |
Alejandro Ochoa1, Manuel Llinás, Mona Singh.
Abstract
BACKGROUND: Identifying domains in protein sequences is an important step in protein structural and functional annotation. Existing domain recognition methods typically evaluate each domain prediction independently of the rest. However, the majority of proteins are multidomain, and pairwise domain co-occurrences are highly specific and non-transitive.Entities:
Mesh:
Substances:
Year: 2011 PMID: 21453511 PMCID: PMC3090354 DOI: 10.1186/1471-2105-12-90
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Illustration of the dPUC framework using Pfam to identify initial domains. A. We gather candidate domain predictions using Pfam with a permissive threshold. Domains are arranged in the x-axis by their amino acid coordinates, but the y-axis arrangement is arbitrary (there may be overlapping initial predictions). B. We build a network between candidate domains. Node weights are the normalized Pfam HMM scores of the corresponding domains (raw score minus the domain threshold). Edge weights between non-overlapping domains are set to our context scores. C. The Standard Pfam will make limited predictions, while dPUC may boost weak domains over the thresholds if they are in the correct context. The dPUC solution maximizes the sum of the node and edge weights, without overlaps, and each node must satisfy the Pfam thresholds. The final normalized domain scores are shown for each framework.
Figure 2dPUC predicts more domains over a range of FDRs. A. Illustration of the FDR estimation procedure. For each original protein sequence, we make predictions on it and on twenty shuffled sequences concatenated to the original sequence, to allow "real" domains (Y, Z) to boost false predictions on the shuffled sequence (domains V, W, X) when using context. The estimated FDR is the ratio of false predictions per protein to the total number of predictions per protein. In this illustration, FDR ≈ (3/20)/(2) = 7.5%. B. The y-axis is the number of predicted domains per protein ("signal"), while the x-axis is the FDR ("noise"), so better performing methods have higher curves (more signal for a given noise threshold). dPUC (green circles) outperforms all non-context Pfam variations tested and the context method CODD.
Figure 3dPUC predicts more domains over a range of Ortholog Coherence scores on Plasmodium species. A. Illustration of scores. Domain predictions are made on hypothetical aligned orthologs and in-paralogs (Pf1, Pf2, Pv1, and Pc1). Color denotes domain family. Domain S overlaps T of the same family, so their scores are 1/3 (since they lack predictions in Pv1 and Pc1). In contrast, U is predicted 100% in its orthologs and in-paralogs. Y overlaps V but is not of the same family, so its score is zero. Similarly, Z does not overlap any domains. The score of this method is the average domain score on all proteins, ~0.58, while the average number of domains per protein is 2. B. The y-axis is the number of predicted domains per protein ("signal"), while the x-axis is the ortholog coherence score (inversely related with "noise"), so better performing methods have higher curves (more signal for a given noise threshold). dPUC (green circles) outperforms the other methods. Symbols and colors are as in Figure 2.
dPUC increases domain predictions and amino acid coverage
| Domains | 4.30 | 6.00 | 10.30 | 11.46 | 6.21 | 8.07 | 9.08 | 7.15 |
| Domains unique families | 2.62 | 2.91 | 5.73 | 5.60 | 3.38 | 3.66 | 4.38 | 3.43 |
| Domains repeated families | 26.56 | 39.61 | 23.83 | 37.48 | 23.79 | 19.95 | 18.26 | 12.34 |
| Amino acids | 2.38 | 4.13 | 7.25 | 7.66 | 3.14 | 4.74 | 5.63 | 3.63 |
| Proteins | 0.16 | 0.08 | 1.80 | 1.31 | 0.38 | 0.70 | 0.59 | 0.56 |
Percent increases of dPUC predictions relative to the Standard Pfam are given for each organism (E. c., E. coli; M. t., M. tuberculosis; P. f., P. falciparum; P. v., P. vivax; S. c., S. cerevisiae; C. e., C. elegans; D. m., D. melanogaster; H. s., H. sapiens) when considering all domains, first appearance of domains in a protein, subsequent occurrences of domains in a protein, amino acids covered by a domain, and all proteins with domain predictions. (See text.)
dPUC predictions lead to novel or more specific Gene Ontology terms on proteins
| Same | 98.51 | 98.27 | 96.86 | 96.01 | 98.28 | 97.07 | 96.26 | 96.68 |
| New or more specific | 1.80 | 1.98 | 5.05 | 5.00 | 2.04 | 3.07 | 3.40 | 3.13 |
| Deleted or less specific | 0.39 | 0.52 | 0.88 | 1.25 | 0.32 | 0.90 | 1.12 | 0.86 |
| Mixed | 0.28 | 0.49 | 1.07 | 1.53 | 0.72 | 1.17 | 1.61 | 1.36 |
Comparison of dPUC-based GO predictions with those based on the Standard Pfam. Values are percents relative to the number of proteins with GO terms in the Standard Pfam per organism. Each category is mutually exclusive, with "Mixed" specifying that both "new or more specific" and "deleted or less specific" GO terms occurred in the same proteins.
Completely novel P. falciparum dPUC predictions lead to refined protein annotations
| Protein ID | Standard Pfam domains | Additional dPUC domains | Current annotation (PlasmoDB 6.0) | Suggested reannotation (this study) |
|---|---|---|---|---|
| PFL0980w | CwfJ_C_1 | CwfJ_C_2 | conserved | Debranching enzyme-associated ribonuclease (DRN1 ortholog), putative |
| PF13_0222 | Metallophos | DBR1 | phosphatase, putative | RNA lariat debranching enzyme (DBR1 ortholog), putative |
| PF11_0086 | MIF4G | PAM2 | MIF4G domain containing protein | Poly(A)-binding protein-interacting protein 1 (PAIP1 ortholog), putative |
| PFE1390w | DEAD, Helicase_C | zf-CCHC | RNA helicase-1 | Post-translational mRNA regulation (ABSTRAKT ortholog), putative |
| PF08_0130 | WD40 | Utp13 | WD-repeat protein, putative | U3 ribonucleoprotein component (PWP2 ortholog), putative |
| PF14_0456 | WD40 | Utp12 | conserved | U3 ribonucleoprotein component (DIP2 ortholog), putative |
| PF10_0128 | WD40 | Utp13 | WD-repeat protein, putative | U3 ribonucleoprotein component (UTP13 ortholog), putative |
| PFI1025w | RRM_1 | Lsm_interact | RNA binding protein, putative | U4/U6 snRNA-associated-splicing factor (PRP24 ortholog), putative |
| PFL0985c | DUF367 | RLI | conserved protein, unknown function | Ribosome biogenesis regulator (TSR3 ortholog), putative |
| MAL8P1.19 | DEAD, Helicase_C | DBP10CT | RNA helicase, putative | Ribosomal biogenesis RNA helicase protein (DBP10 ortholog), putative |
| PFE0560c | MORN | Avl9 | MORN repeat protein, putative | Atypical Golgi transport protein (AVL9 ortholog) with MORN domains, putative |
| PFL1455w | DUF202, SPX | VTC | conserved | Vacuolar transporter chaperone (VTC2/3/4 ortholog), putative |
| PFL2255w | TPR_2 | F-box | conserved | DNA replication origin binding protein (DIA2 ortholog), putative |
| PFF1070c | UPF0004, Radical_SAM | TRAM | radical SAM protein, putative | Ribosome or tRNA methylthiotransferase (RIMO or MIAB ortholog) or CDK5 regulatory subunit-associated protein 1, putative |
| PFL1045w | DUF814 | FbpA | conserved protein, unknown function | FbpA domain protein, putative |
| MAL13P1.182 | RanBPM_CRA | LisH | conserved | GID8 ortholog, putative |
| MAL13P1.79 | zf-CCCH, WD40 | conserved | CCCH zinc finger protein, putative | |
| MAL13P1.37 | zf-B_box | conserved | Tripartite motif protein, putative |
These dPUC domain predictions were novel relative to the Standard Pfam, SMART, and Superfamily, and coherent predictions were present in OrthoMCL orthologs or in-paralogs. dPUC predictions always contained the Standard Pfam domains, so only the additional domain predictions are listed. The number of repeats per family is not shown.
Additional P. falciparum dPUC predictions lead to refined protein annotations
| Protein ID | Standard Pfam domains | Additional dPUC domains | Suggested reannotation (this study) |
|---|---|---|---|
| PFE1240w | Radical_SAM, Wyosine_form | Flavodoxin_1 | Wybutosine synthesis protein (TYW1 ortholog), putative |
| PFF1490w | THF_DHG_CYH_C | THF_DHG_CYH | Tetrahydrofolate dehydrogenase/cyclohydrolase (MTD1 ortholog, MIS1/ADE3 homolog without FTHFS domain), putative |
| MAL8P1.139 | DDA1* | WD40 | Regulator of (H+)-ATPase in Vacuolar membrane (RAV1 ortholog), putative |
| PF08_0124 | CactinC_cactus | Cactin_mid | CACTIN homolog, putative |
| PF10_0152 | NTP_transf_2, PAP_assoc | Non-canonical cytoplasmic specific poly(A) RNA polymerase protein (CID13 ortholog), putative | |
| MAL13P1.170 | NTP_transf_2 | PAP_assoc | Non-canonical poly(A) RNA polymerase protein (PAP2/TRF5 ortholog), putative |
| PFI1560c | DUF21 | CBS, cNMP_binding | Required for mitochondrial morphology (MAM3 ortholog), putative |
| PF10_0126 | WD40 | Phosphoinositide binding protein (HSV2/ATG18 ortholog), putative | |
| PFI0510c | BRCT | IMS | DNA repair protein (REV1 ortholog), putative |
| MAL13P1.54 | WD40 | LisH | Alternative splicing regulator (SMU-1 ortholog), putative |
| PF14_0052 | cobW | CobW_C | COBW domain-containing protein 1 (CBWD1 ortholog), putative |
| PF08_0012 | SET, Pre-SET | YDG_SRA | Histone lysine N-methyltransferase, putative |
| PFE1445c | FG-GAP | T-cell immunomodulatory protein (human TIP homolog), putative | |
| PFL0975w | IQ | RCC1 | Unconventional myosin fused to IQ and RCC1 domains, putative |
| PF11_0276 | Abhydro_lipase | Abhydrolase_1 | Steryl ester hydrolase (TGL1/YEH1/YEH2 ortholog), putative |
| PF13_0190 | Aha1_N | TPR_2, TPR_1 | Chaperone binding protein, putative |
| PF11_0287 | CRAL_TRIO | CRAL_TRIO_N | CRAL/TRIO protein, putative |
| PF11_0197 | Ank | ACBP | Acyl-CoA-binding protein, putative |
| PF14_0647 | TLD | TBC | Rab GTPase activator, putative |
| PFL0575w | Amino_oxidase, Thi4* | PHD | PHD finger and flavin containing amine oxidoreductase, putative |
| MAL13P1.246 | E1-E2_ATPase | Cation_ATPase_C | E1-E2 ATPase, putative |
| PF11_0116 | Nol1_Nop2_Fmu | Nol1/Nop2/Fmu-like protein, putative | |
| MAL7P1.127 | Pkinase | Rab GTPase activator and protein kinase, putative | |
| PFC0425w | zf-C3HC4, PHD | PHD finger protein, putative | |
| PFI0975c | RCC1 | Regulator of chromosome condensation, putative | |
| PFD0900w | RCC1 | Regulator of chromosome condensation, putative | |
| MAL7P1.132 | Pkinase | Protein kinase, putative | |
| PFF0810c | Ras | Ras GTPase, putative | |
| PFL1990c | zf-CCHC, RRM_1 | RNA binding protein, putative | |
| PF07_0066 | RRM_1 | RNA binding protein, putative | |
| PF13_0147 | RRM_1 | RNA binding protein, putative | |
| PFF1120c | EGF | EGF-like membrane protein, putative | |
| PF14_0262 | WD40 | TPR_1 | WD40 and TPR repeats protein, putative |
| PFI0275w | WD40 | WD40 repeat and EF hand protein, putative | |
| PF10_0285 | WD40 | WD40 repeat protein, putative | |
| PF11_0195 | WD40 | WD40 repeat protein, putative | |
| PF14_0640 | WD40 | WD40 repeat protein, putative | |
| MAL13P1.308 | Arm | ARM repeat protein, putative |
These dPUC predictions were novel compared to the Standard Pfam, and were consistent with existing domain predictions from SMART or Superfamily (and often present in orthologs too). The number of repeats per family is not shown.
dPUC predictions always contained the Standard Pfam domains, so only the additional domains are listed, except when marked with an asterisk (*; MAL8P1.139 has DDA1 in Standard Pfam but not in dPUC Pfam; PFL0575w has a Thi4 in Standard Pfam but it is replaced by another Amino_oxidase domain [belonging to the same Pfam clan] in dPUC Pfam).
All proteins have the current PlasmoDB 6.0 annotation of "conserved Plasmodium protein, unknown function" except: MAL8P1.139, PFI1560c, MAL13P1.246, MAL7P1.127 "conserved Plasmodium membrane protein, unknown function"; PFE1240w, PF11_0287 "conserved protein, unknown function"; MAL13P1.170 "nucleotidyltransferase, putative"; PF08_0012 "SET domain protein, putative"; PFF1120c "conserved Apicomplexan protein, unknown function"; PF14_0262 "probable protein, unknown function".