| Literature DB >> 23598997 |
Jaina Mistry1, Robert D Finn, Sean R Eddy, Alex Bateman, Marco Punta.
Abstract
Detection of protein homology via sequence similarity has important applications in biology, from protein structure and function prediction to reconstruction of phylogenies. Although current methods for aligning protein sequences are powerful, challenges remain, including problems with homologous overextension of alignments and with regions under convergent evolution. Here, we test the ability of the profile hidden Markov model method HMMER3 to correctly assign homologous sequences to >13,000 manually curated families from the Pfam database. We identify problem families using protein regions that match two or more Pfam families not currently annotated as related in Pfam. We find that HMMER3 E-value estimates seem to be less accurate for families that feature periodic patterns of compositional bias, such as the ones typically observed in coiled-coils. These results support the continued use of manually curated inclusion thresholds in the Pfam database, especially on the subset of families that have been identified as problematic in experiments such as these. They also highlight the need for developing new methods that can correct for this particular type of compositional bias.Entities:
Mesh:
Substances:
Year: 2013 PMID: 23598997 PMCID: PMC3695513 DOI: 10.1093/nar/gkt263
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Observed number of overlapping domains (dark grey) and expected number of false positives (light grey) at three different E-value significance thresholds for the 13 356 Pfam families considered here.
Figure 2.(A) Cumulative proportion of overlapping domains in Pfam families. Families are ranked according to the number of their domains that overlap (in descending order) after applying a winner-takes-all greedy algorithm that assigns overlapping domains to families (see ‘Materials and Methods’ section). Data shown for three E-value significance thresholds. (B) Same as 2A red line, with additional plots for families that overlap with two or more, and three or more clans only (E-value = 0.01).
Figure 3.(A) Venn diagram with overlap between families predicted to be coiled-coil, disordered and transmembrane (see ‘Materials and Methods’ section) as observed in 13 356 total families. Coiled-coil: consecutive coiled-coil regions of 20 residues predicted in ≥50% of seed member regions. Disordered: consecutive intrinsic disordered regions of 20 residues predicted in ≥50% of seed member regions. Transmembrane helices: ≥2 transmembrane helices predicted in ≥50% of seed members regions. (B) Overrepresentation of predicted coiled-coil, transmembrane helices and instrinsic disorder when considering Pfam families with overlapping domains versus all Pfam families. Overlaps are calculated with respect to an E-value significance threshold of 0.01. Families are sorted by the number of clans they overlap with (descending) after a winner-takes-all greedy algorithm for assigning overlapping domains to families is applied. Note: two/three or more means two/three or more clans other than the one the family belongs to. Overrepresentation at each point x in the x axis is obtained by calculating the proportion of families with a given label (e.g. coiled-coil) among the first x families and dividing by the proportion of all families (n = 13 356) with that label. Note that for the sake of simplicity, we truncated the x-axis at 400 families. Labels assigned to families as described in 3A.
Figure 4.Comparison between the proportion of residues predicted to be in coiled-coil and in disordered regions (dark and light grey, respectively) in overlaps versus the proportion in UniProtKB (version 2011_06).
The 20 Pfam-A families (Pfam 26.0) with the highest number of overlapping clans at an E-value threshold of 0.01, after a winner-takes-all greedy algorithm for assigning overlapping domains to families was applied (see ‘Materials and Methods’ section)
| Pfam-A ID (accession) | Description | Number of domains in the family | Number of domains with overlaps | Number of clans the family overlaps with | TMH | CC | DIS | Proportion of residues in overlap regions predicted to be TMH, CC or DIS (%) | Pfam sequence E-value threshold |
|---|---|---|---|---|---|---|---|---|---|
| IncA (PF04156) | Inclusion membrane protein A | 1231 | 308 | 61 | ✓ | ✓ | ✗ | 62.6 | 1.5e-10 |
| Myosin_tail_1 (PF01576) | This family consists of the coiled-coil myosin heavy chain tail region | 2383 | 271 | 49 | ✗ | ✓ | ✓ | 82.4 | 3.0e-11 |
| AAA_13 (PF13166) | Many of the proteins in this family are conjugative transfer protein | 986 | 170 | 45 | ✗ | ✓ | ✗ | 77.6 | 3.0e-09 |
| Reo_sigmaC (PF04582) | A homo trimer, this family features an α-helical triple coiled-coil | 456 | 116 | 41 | ✗ | ✓ | ✗ | 61.0 | 9.8e-08 |
| Baculo_PEP_C (PF04513) | Baculovirus polyhedron envelope protein, C-terminus | 189 | 52 | 35 | ✗ | ✓ | ✗ | 62.9 | 9.9e-07 |
| EzrA (PF06160) | This family represents proteins involved in bacterial cell division | 669 | 28 | 25 | ✗ | ✓ | ✗ | 72.7 | 1.8e-05 |
| DUF827 (PF05701) | Domain of unknown function | 474 | 172 | 23 | ✗ | ✓ | ✓ | 43.6 | 3.3e-05 |
| Filament (PF00038) | This family represents the coiled-coil central rod domain of intermediate filament proteins | 2315 | 35 | 22 | ✗ | ✓ | ✗ | 76.3 | 2.4e-07 |
| DUF869 (PF05911) | Domain of unknown function | 246 | 15 | 20 | ✗ | ✓ | ✓ | 78.8 | 9.6e-08 |
| MFS_1 (PF07690) | This family belongs to the major facilitator superfamily of membrane transporters | 175 494 | 30 | 19 | ✓ | ✗ | ✗ | 62.1 | 2.6e-05 |
| Tropomyosin (PF00261) | Tropomyosin is an α-helical protein that forms a coiled-coil structure of two parallel helices | 1389 | 39 | 19 | ✗ | ✓ | ✓ | 80.9 | 6.5e-06 |
| ATG16 (PF08614) | Autophagy proteins 16 have a dimeric coiled-coil structure | 298 | 26 | 17 | ✗ | ✓ | ✓ | 74.3 | 7.6e-06 |
| MCPsignal (PF00015) | This family represents the coiled-coil bacterial cytoplasmic domain of chemotaxis receptors | 20 808 | 50 | 17 | ✗ | ✓ | ✗ | 41.2 | 3.0e-4 |
| ERM (PF00769) | This family represents the coiled-coil domain of ERM proteins | 311 | 11 | 15 | ✗ | ✓ | ✓ | 77.7 | 3.1e-4 |
| Cenp-F_leu_zip (PF10473) | This family represents a microtubule-binding protein consisting of long coiled-coil regions | 138 | 76 | 14 | ✗ | ✓ | ✗ | 60.7 | 3.4e-4 |
| Spc7 (PF08317) | This family represents proteins involved in cell division in Fungi | 183 | 24 | 14 | ✗ | ✓ | ✓ | 69.3 | 9.2e-06 |
| Tropomyosin_1 (PF12718) | Tropomyosin_1, in the same clan as Tropomyosin aforementioned | 964 | 28 | 14 | ✗ | ✓ | ✓ | 92.4 | 2.9e-4 |
| ABC2_membrane_3 (PF12698) | ABC-2 membrane transporter family | 17 995 | 60 | 10 | ✓ | ✗ | ✗ | 34.8 | 6.3e-05 |
| TPR_MLP1_2 (PF07926) | Proteins in this family feature coiled-coil regions | 193 | 12 | 10 | ✗ | ✓ | ✓ | 58.5 | 2.8e-4 |
| DUF3584 (PF12128) | Domain of unknown function | 173 | 14 | 10 | ✓ | ✓ | ✓ | 68.3 | 1.4e-4 |
For the purposes of counting the number of clans that a family overlaps with, a family that is not in a clan was counted as being in a clan by itself. Note that ‘number of domains in the family’ is the total number of regions aligned to the family profile HMM with E-value ≤0.01, before clan competition. Columns ‘TMH’, ‘CC’ and ‘DIS’ show whether the families have ≥2 transmembrane helices (TMH), or consecutive coiled-coil regions of 20 residues (CC), or consecutive disordered regions of 20 residues (DIS), predicted in ≥50% of seed member regions. E-values in the last column are calculated from the sequence gathering threshold, i.e. the family-specific bit score significance threshold (15), according to the following formula: E = N × exp[−λ·(x − τ)], where x is the bit score gathering threshold, λ and τ are parameters derived from the profile-HMM model (λ is the slope parameter, τ is the location parameter) and N is the database size (in this case the size of UniProtKB).
Figure 5.Proportion of residues in predicted transmembrane helices, coiled-coil regions and intrinsically disordered regions in different sets of sequences: UniProtKB (version 2011_06), all domains in the 13 356 Pfam families that we consider in this study, all overlapping regions, overlapping regions of families that overlap with two or more and three or more clans, overlapping regions of the top 20 families in Table 1. Both Pfam domains and overlapping regions are calculated based on an E-value threshold of 0.01.