| Literature DB >> 23603847 |
Jaina Mistry1, Penny Coggill, Ruth Y Eberhardt, Antonio Deiana, Andrea Giansanti, Robert D Finn, Alex Bateman, Marco Punta.
Abstract
It is a worthy goal to completely characterize all human proteins in terms of their domains. Here, using the Pfam database, we asked how far we have progressed in this endeavour. Ninety per cent of proteins in the human proteome matched at least one of 5494 manually curated Pfam-A families. In contrast, human residue coverage by Pfam-A families was <45%, with 9418 automatically generated Pfam-B families adding a further 10%. Even after excluding predicted signal peptide regions and short regions (<50 consecutive residues) unlikely to harbour new families, for ∼38% of the human protein residues, there was no information in Pfam about conservation and evolutionary relationship with other protein regions. This uncovered portion of the human proteome was found to be distributed over almost 25 000 distinct protein regions. Comparison with proteins in the UniProtKB database suggested that the human regions that exhibited similarity to thousands of other sequences were often either divergent elements or N- or C-terminal extensions of existing families. Thirty-four per cent of regions, on the other hand, matched fewer than 100 sequences in UniProtKB. Most of these did not appear to share any relationship with existing Pfam-A families, suggesting that thousands of new families would need to be generated to cover them. Also, these latter regions were particularly rich in amino acid compositional bias such as the one associated with intrinsic disorder. This could represent a significant obstacle toward their inclusion into new Pfam families. Based on these observations, a major focus for increasing Pfam coverage of the human proteome will be to improve the definition of existing families. New families will also be built, prioritizing those that have been experimentally functionally characterized. Database URL: http://pfam.sanger.ac.uk/Entities:
Mesh:
Substances:
Year: 2013 PMID: 23603847 PMCID: PMC3630804 DOI: 10.1093/database/bat023
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1.Pfam-A coverage of H. sapiens, S. cerevisiae and E. coli. Sequence coverage (blue) is calculated as the percentage of the proteome (Methods) that matches at least one Pfam-A family. Residue coverage (red) is the percentage of amino acids in the proteome that are covered by a Pfam-A family.
Figure 2.(a) Proportion of the human proteome contained in regions that are not part of Pfam-A, Pfam-B or of a predicted signal peptide versus region length. (b) Length distribution of human regions in (a).
Top ten largest clusters of human regions not covered by Pfam
| Cluster number | Number of regions in the cluster | Region length in amino acids (mean) | Phmmer UniProtKB matches (mean) | Number of phmmer matches with overlaps to Pfam-A families (mean) | Likely annotation |
|---|---|---|---|---|---|
| 1 | 395 | 138 | 27 337 | 4023 | Olfactory receptors |
| 2 | 154 | 121 | 58 170 | 57 777 | Zinc fingers |
| 3 | 104 | 134 | 20 299 | 19 748 | Zinc fingers |
| 4 | 85 | 151 | 17 578 | 14 380 | Collagen repeat |
| 5 | 76 | 127 | 3836 | 3033 | YWTD motifs |
| 6 | 62 | 130 | 2190 | 1958 | Leucine-rich repeats |
| 7 | 56 | 123 | 9662 | 8958 | EGF domains |
| 8 | 46 | 158 | 23 652 | 23 285 | Zinc fingers |
| 9 | 41 | 203 | 701 | 297 | PRY domain |
| 10 | 40 | 178 | 677 | 2 | Cadherin cytoplasmic domains |
Mean number of UniProtKB matches is based on running each region in the cluster against UniProtKB with phmmer. The number of matches with E-value <10−3 is collected, and the average is taken over all regions in a cluster. Overlaps with existing Pfam-A families are calculated based on sequences that match simultaneously a cluster member (according to alignment co-ordinates in phmmer output) and a Pfam-A family (according to alignment co-ordinates in Pfam 27.0). ‘Likely annotation’ is assigned based on analysis of overlapping Pfam clans (when a family is not in a clan, it is considered as being in a clan by itself) and on manual inspection of region annotation in UniProtKB, InterPro (18) and Pfam.
Percentage of residues predicted to be compositionally biased in Pfam-A families, Pfam-B families and in regions that are not Pfam-A, not Pfam-B, not predicted to be signal peptides and of at least 50 consecutive amino acids in length
| Pfam-A | Pfam-B | Not (Pfam-A, Pfam-B, signal peptide), ≥50aa | |
|---|---|---|---|
| Coiled-coil | 2.1 | 4.9 | 3.8 |
| Disorder | 9.3 | 42.0 | 38.5 |
| Low complexity | 5.1 | 13.8 | 13.2 |
| Signal peptide | 0.2 | 1.2 | 0 |
| Transmembrane | 6.2 | 1.9 | 2.5 |
Figure 3.(a) Pfam-A residue coverage of consensus predicted human disordered regions as a function of region length and number of predictors considered for consensus. Predicted disordered regions of different lengths are considered. Disorder X stands for regions of at least X consecutive disordered amino acids. Only regions of length X that are predicted in full by N predictors are considered (N= 1 , .. , 9; x-axis). For example, 15% of the residues found in regions of 10 consecutive predicted disordered residues by at least five different methods are found in Pfam-A families. (b) Comparison of human residue coverage of predicted disordered regions of length 30 for Pfam and not Pfam. ‘not Pfam*’ represents all regions of length ≥50 residues that are not in Pfam-A, not in Pfam-B and not predicted to be signal peptides.