| Literature DB >> 20686689 |
Wing-Cheong Wong1, Sebastian Maurer-Stroh, Frank Eisenhaber.
Abstract
Large-scale genome sequencing gained general importance for life science because functional annotation of otherwise experimentally uncharacterized sequences is made possible by the theory of biomolecular sequence homology. Historically, the paradigm of similarity of protein sequences implying common structure, function and ancestry was generalized based on studies of globular domains. Having the same fold imposes strict conditions over the packing in the hydrophobic core requiring similarity of hydrophobic patterns. The implications of sequence similarity among non-globular protein segments have not been studied to the same extent; nevertheless, homology considerations are silently extended for them. This appears especially detrimental in the case of transmembrane helices (TMs) and signal peptides (SPs) where sequence similarity is necessarily a consequence of physical requirements rather than common ancestry. Thus, matching of SPs/TMs creates the illusion of matching hydrophobic cores. Therefore, inclusion of SPs/TMs into domain models can give rise to wrong annotations. More than 1001 domains among the 10,340 models of Pfam release 23 and 18 domains of SMART version 6 (out of 809) contain SP/TM regions. As expected, fragment-mode HMM searches generate promiscuous hits limited to solely the SP/TM part among clearly unrelated proteins. More worryingly, we show explicit examples that the scores of clearly false-positive hits, even in global-mode searches, can be elevated into the significance range just by matching the hydrophobic runs. In the PIR iProClass database v3.74 using conservative criteria, we find that at least between 2.1% and 13.6% of its annotated Pfam hits appear unjustified for a set of validated domain models. Thus, false-positive domain hits enforced by SP/TM regions can lead to dramatic annotation errors where the hit has nothing in common with the problematic domain model except the SP/TM region itself. We suggest a workflow of flagging problematic hits arising from SP/TM-containing models for critical reconsideration by annotation users.Entities:
Mesh:
Substances:
Year: 2010 PMID: 20686689 PMCID: PMC2912341 DOI: 10.1371/journal.pcbi.1000867
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Summary of predicted/validated non-globular segments and supporting evidence for the 18 SMART version 6 domains.
| Domain name | Type | Predicted segments | Validated Segments | Comments |
| SM00019 : SF_P (Pulmonary surfactant protein) | TM | 33–58 | 1–58# | The N-terminal propeptide 1–58 of NP_003009 forms a TM when induced by a Brichos domain |
| SM00157 : PRP (Major prion protein) | TM | 117–140 | 112–135# | Latent transmembrane region in human prion protein BAG32277 |
| SM00665 : B561 (Cytochrome B561/ferric reductase TM domain) | TM | 4–146 | N/a | Intrinsic membrane protein |
| SM00714 : LITAF (LPS-induced tumor necrosis factor α factor) | TM | 38–61 | N/a | The LITAF domain appears to have a membrane-inserted motif (although without transmembrane segment) |
| SM00724 : TLC (TRAM, LAG1 and CLN8 homology domains) | TM | 10–76; 216–238; 287–307 | N/a | Proof for 8 membrane-spanning segments in Lag1p (NP_011860) and Lac1p (NP_012917) |
| SM00730 : PSN (Presenilin, signal peptide peptidase, family) | TM | 5–27; 113–134; 214–285; 600–649 | 4–25#; 115–133#; 214–231#; 241–257#; 260–283#; 602–621#; 628–644# | Out of 10 TM regions shown for human presenilin-1 (AAB46371), 9 are in the domain alignment out of which 7 are predicted here |
| SM00752 : HTTM Horizontally transferred transmembrane domain | TM | 12–25; 75–95; 275–294; 338–357 | N/a | Domain is known to have 4 TM regions |
| SM00756 : VKc (catalytic subunit of vitamin K epoxide reductase) | TM | 12–30; 104–192 | 13–32#; 142–189# | VKORC1 (Q9BQB6) is a membrane protein |
| SM00780 : PIG-X (Mammalian PIG-X and yeast PBN1) | TM | 230–248 | 230–252# | PBN1 (CAA42392) is a type I transmembrane protein in the endoplasmic reticulum |
| SM00786 : SHR3_chaperone (ER membrane protein SH3) | TM | 7–111; 167–186 | N/a | Shr3p (NP_010069) has 4 membrane segments |
| SM00793 : AgrB (Accessory gene regulator B) | TM | 42–204 | N/a |
|
| SM00815 : AMA-1 (Apical membrane antigen 1) | TM | 522–527 | 515–602# | Segment missing in structure 1W81_A |
| SM00831 : Cation_ATPase_N (Cation transporter/ATPase, N-terminus) | TM | 72–90 | 65–94# | Segment maps to a TM helix of the ß-domain of 1KJU_A |
| SM00190 : IL4_13 (Interleukin 4/13) | SP | 1–20 | 1–23# | Annotated as secreted. Segment missing in structure 1ITL_A |
| SM00476 : DNaseIc (deoxyribonuclease I) | SP | 1–19 | 1–17# | Annotated as secreted. Segment missing in structure 1DNK_A |
| SM00770 : Zn_dep_PLPC (Zn-dependent phospholipase C, α toxin) | SP | 4–26 | 1–64# | Annotated as secreted. Segment missing in structure 1OLP_A |
| SM00792 : Agouti | SP | 1–19 | 1–89# | Annotated as secreted. Segment missing in structure 1Y7J_A |
| SM00817 : Amelin (Ameloblastin precursor) | SP | 11–28 | 1–26# | Protein AAG27036 |
Both the predicted and, if explicitly available in the literature, the validated segments of TM regions or signal peptides are provided in the sequence count of the respective SMART domain alignment. In cases marked with “#”, the sequence positions are with respect to the reference sequence given in the comments.
Figure 1Cumulative plots of SMART version 6 and Pfam release 23 problematic domains.
In SMART version 6, the total number of domains with predicted SP/TM segments peaks at 18, which made up 2.2% of 809 SMART domains (see top). Red triangles mark time points for the years 1998, 2002 and 2009 when the total number of domain models was 86, 600 and 809 respectively. In Pfam, the total number of problematic domains peaks at 1214, which made up 11.8% of 10340 Pfam domains (see bottom). Likewise, red triangles marked the years 1999, 2002 and 2008 with 1465, 3360 and 10340 Pfam entries respectively.
Figure 2Histograms of average log probability per predicted transmembrane helix and per predicted signal peptide in Pfam release 23.
The top part shows the histogram of average log probability per predicted transmembrane helix; the bottom part shows the same per predicted signal peptide. The log probability provided on the x-axis is calculated with equations 5 and 6. At the TMcutoff of ≥−12 (false-positive rate 4.67%) and SPcutoff of ≥−1 (false-positive rate 4.02%), the number of predicted TM helices and signal peptides are 3849 and 164 respectively.
Figure 3Average log probability plot of transmembrane helix and signal peptide predictions per domain.
The top part shows the average log probability per predicted transmembrane helix calculated per domain; the bottom part shows the same per predicted signal peptide. Whereas the y-axis shows the log probability in accordance with equation 6 applied over all predicted segments for a given domain, the x-axis represents their cumulative length. At the TMcutoff of ≥−12 and SPcutoff of ≥−1 (horizontal dashed lines), the number of problematic TM and SP domains are 1079 and 164 respectively. The total number of problematic domains is 1214 (1050 TM, 135 SP and 29 concurrent TM and SP).
Figure 4Examples of domain architectures of false-positive HMM hits caused by TM helices in the fragment-mode search.
We show illustrative examples for six Pfam release 23 models: Herpes_glycop_D (PF01537.9), CDC50 (PF03381.7), Cation_ATPase_N (PF00690.18), GSPII_F (PF00482.11), PAP2 (PF01569.13) and HCV_NS4b (PF01001.11). The black boxes denote the problematic domain annotations in the respective sequences. Additional material such as hmmpfam outputs and alignments are available at the associated BII WWW site for this work. Domain architecture illustrations were created with DOG 1.5 [98].
Figure 5Examples of domain architectures of false-positive HMM hits caused by TM helices/signal peptdes in the global-mode search.
Findings for nine Pfam release 23 models Pig-P (PF08510.4), PAP2(PF 01569.13), EMP24_GP25L (PF01105.15), PTPLA (PF04387.6), Lamp (PF01299.9), MttA_Hcf106 (PF02416.8), HAMP (PF00672.17), Nodulin_late (PF07127.3) and GRP (PF07172.3) are shown. The black boxes denote the problematic domain annotations in the respective sequences. Additional material such as hmmpfam outputs and alignments are available at the associated BII WWW site for this work. Domain architecture illustrations were created with DOG 1.5 [98].
Unjustified annotation percentage of validated problematic domains in protein information resource (PIR) iproclass v3.74 (Global-mode search).
| Domain Name | Type, validated region of model (size) | No. of retrieved sequences | No. of FP hits where | No. of annotations without hmmpfam hits (E>10) | Total No. of unjustified hits (%) |
| PF00690.18 : Cation_ATPase_N (Cation transporter/ATPase, N-terminus), | TM,66–87 (87), ref. | 3684 | 74 | 3 | 77 (2.1%) |
| PF01105.15 : EMP24_GP25L (Endoplasmic reticulum and golgi apparatus trafficking proteins), | TM,141–167 (167), ref. | 1029 | 8 | 33 | 41 (4.0%) |
| PF01299.9 : Lamp (Lysosome-associated membrane glycoprotein), | TM,304–340 (340), ref. | 164 | 2 | 12 | 14 (8.5%) |
| PF01544.10 : CorA (CorA-like Mg2+ transporter protein) | TM,341–407 (407), ref. | 2717 | 15 | 71 | 86 (3.2%) |
| PF01569.13 : PAP2 (type 2 phosphatidic acid phosphatase) | TM,102–177 (177), ref. | 5231 | 108 | 19 | 127 (2.4%) |
| PF02416.8 : MttA_Hcf106 (sec-independent translocation mechanism protein) | TM,1–19 (74), refs. | 2085 | 283 | 0 | 283 (13.6%) |
| PF04387.6 : PTPLA (protein tyrosine phosphatase-like protein), | TM,89–168 (168), refs. | 277 | 3 | 3 | 6 (2.2%) |
| PF04612.4 : Gsp_M (General secretion pathway, M protein) | TM,1–40 (165), ref. | 401 | 19 | 6 | 25 (6.2%) |
| PF07127.3 : GRP (plant glycine rich proteins) | SP,1–49 (134), ref. | 207 | 12 | 4 | 16 (7.7%) |
| PF08294.3 : TIM21 (Mitochondrial import protein), | TM,1–36 (157), ref. | 118 | 7 | 1 | 8 (6.8%) |
| PF08510.4 : PIG-P (phosphatidylinositol N-acetyl-glucosaminyl transferase subunit P), | TM,1–67 (153), ref. | 143 | 4 | 0 | 4 (2.8%) |
In the first column, we list selected Pfam domains with their accession, identifier, description and their gathering score (as in Pfam release 23) that have TM and/or SP regions included into the model. The region in the domain alignment that includes the validated SP/TM segments (together with interlinking loops as described in Methods) and the corresponding references are provided in the second column. The number of retrieved sequences from iProClass v3.74 with respect to each domain is given in the third column. The number of unjustified hits that returns results (and also satisfied the criteria) and without results are given in the next two columns. The last column gives the total and percentage of the unjustified hits with respect to the number of retrieved sequences. In addition, the log odd scores were re-derived from the match/insert/state transition scores provided by the respective HMM model. The reproduced scores varied from the original scores at 0.57±0.34. and (see equations 19 and 20) denote the domain gathering score threshold and the expected non-SP/TM-specific gathering score threshold respectively.
Additional material such as hmmpfam outputs and alignments are available at the associated BII WWW site for this work.
Figure 6Relationship between the gathering score and the corresponding E-value threshold for Pfam domain library release 23.
Whereas the y-axis shows the gathering score threshold (GA) for the global-mode search, x-axis shows the corresponding E-value threshold (in decimal log scale) calculated with the domain-specific extreme-value function with parameters provided in the corresponding HMM file (for an NR database size of 7365651 sequences) for this score. The upper plot represents the distribution for 9126 domains without detected SP/TM region, the middle part shows the same for the 1214 domains with SP/TM problems. Effectively, there is no clear correlation between gathering score and E-value threshold. If E-values close to 0.1 are considered significant, all dots should be close to the “−1” line (horizontal dashed lines) in this graph and, indeed, there is some agglomeration of data points in that area; yet, there are numerous outliers. Note that the E-values are computed using the equationwhere is the database size, and are the extreme value distribution (EVD) parameters of the domain model. The bottom plot depicts the histogram of the 10340 domains in Pfam rel.23. The median of all log E-values that corresponded to the domain-specific GAs is found to be −1.16. This translates to an E-value of 0.07.
Figure 7Histograms of average log probability per predicted transmembrane helix for SCOP v1.75 α-proteins class and membrane protein class.
The top (average log probability per predicted transmembrane helix for SCOP v1.75 α-proteins class) and bottom (average log probability per predicted transmembrane helix for SCOP v1.75 membrane protein class) histograms represent the false-positive and true-positive distributions for TM predictions respectively. The total number of predicted structural and membrane helices is 2293 and 5592 respectively.
FP and FN rates of TM predictions based on different TM cutoffs.
| Average log probability of TM prediction | No. of FP | FP rate (%) | No. of FN | FN rate (%) |
| ≥−6 | 21 | 0.91 | 4519 | 80.81 |
| ≥−7 | 37 | 1.61 | 3401 | 60.82 |
| ≥−8 | 45 | 1.96 | 2520 | 45.06 |
| ≥−9 | 47 | 2.04 | 1593 | 28.49 |
| ≥−10 | 72 | 3.14 | 910 | 16.27 |
| ≥−11 | 84 | 3.66 | 526 | 9.41 |
| ≥−12 | 107 | 4.67 | 418 | 7.48 |
| ≥−13 | 125 | 5.45 | 381 | 6.81 |
| ≥−14 | 206 | 8.98 | 362 | 6.47 |
The first column gives the various cutoffs for the average log probability of TM helix prediction (refer to equations 5 and 6). The next two columns denote the number and percentage of false-positive TM helices with respect to 2293 predicted helices from SCOP α-proteins based on the corresponding cutoff rate. Similarly, the last two columns describe the number and percentage of false-negative TM helices with respect to 5592 predicted helices from SCOP membrane proteins.
Figure 8Histograms of average log probability per predicted signal peptide for SCOP v1.75 α- and membrane protein class and SMART version 6.
The top (average log probability per predicted signal peptide for SCOP v1.75 α- and membrane protein class) and bottom (average log probability per predicted signal peptide for SMART version) histograms represent the false-positive and true-positive distributions for the SP predictions respectively. The total number of predicted signal peptides for SCOP α- and membrane proteins is 193 and 379 respectively, while the total number for SMART is 45. All except SM00817 Amelin (no available structure) were validated against their respective PDB entries.
FP and FN rates of SP predictions based on different SP cutoffs.
| Average log probability of SP prediction | No. of FP | FP rate (%) | No. of FN | FN rate (%) |
| ≥−0.5 | 20 | 3.50 | 8 | 17.78 |
| ≥−1 | 23 | 4.02 | 1 | 2.2 |
| ≥−2 | 38 | 6.64 | 1 | 2.2 |
| ≥−3 | 38 | 6.64 | 1 | 2.2 |
| ≥−4 | 44 | 7.69 | 1 | 2.2 |
The first column gives the various cutoffs for the average log probability of SP prediction (refer to equation 5). The next two columns denote the number and percentage of false-positive SP with respect to 572 predicted SP from SCOP α- and membrane proteins based on the corresponding cutoff rate. Similarly, the last two columns describe the number and percentage of false-negative SP with respect to 45 predicted SP in seed sequences from SMART version 6 alignments.