| Literature DB >> 24189229 |
Jaina Mistry1, Edda Kloppmann, Burkhard Rost, Marco Punta.
Abstract
High-resolution structural knowledge is key to understanding how proteins function at the molecular level. The number of entries in the Protein Data Bank (PDB), the repository of all publicly available protein structures, continues to increase, with more than 8000 structures released in 2012 alone. The authors of this article have studied how structural coverage of the protein-sequence space has changed over time by monitoring the number of Pfam families that acquired their first representative structure each year from 1976 to 2012. Twenty years ago, for every 100 new PDB entries released, an estimated 20 Pfam families acquired their first structure. By 2012, this decreased to only about five families per 100 structures. The reasons behind the slower pace at which previously uncharacterized families are being structurally covered were investigated. It was found that although more than 50% of current Pfam families are still without a structural representative, this set is enriched in families that are small, functionally uncharacterized or rich in problem features such as intrinsically disordered and transmembrane regions. While these are important constraints, the reasons why it may not yet be time to give up the pursuit of a targeted but more comprehensive structural coverage of the protein-sequence space are discussed.Entities:
Keywords: Pfam families; protein-sequence space; structural coverage
Mesh:
Substances:
Year: 2013 PMID: 24189229 PMCID: PMC3817691 DOI: 10.1107/S0907444913027157
Source DB: PubMed Journal: Acta Crystallogr D Biol Crystallogr ISSN: 0907-4449
Figure 1The number of structures (black) and the number of chains (grey) released each year in the PDB, from 1976 to 2012.
Figure 2The number of structurally covered Pfam families and MCL clusters for each year between 1976 and 2012.
Figure 3(a) The number of Pfam families (dark blue) and the number of MCL clusters (light blue) that gained their first structural representative each year from 1976 to 2012. pfam_scan was used to map Pfam families to PDB chains. (b) As in (a), but using PDBfam to map Pfam families to PDB chains. (c) The number of newly covered families/clusters (i.e. the sum of the two) per 100 structures released in the PDB each year from 1993 to 2012. Newly covered families/clusters were calculated using pfam_scan (blue) and PDBfam (red).
Figure 4(a) Proportion of Pfam families that have a structural representative (using pfam_scan). (b) Mean size of Pfam families with and without a structural representative. (c) Proportion of Pfam family members that have a structural representative (pfam_scan). (d) Proportion (%) of Pfam families with and without a structural representative that are domains of unknown function (DUFs). (e) Proportion (%) of residues in the seed alignment of Pfam families, with and without a structural representative, that are predicted to be coiled-coil, disordered and transmembrane residues (see §2).
Figure 5Proportion (%) of X-ray structures that have been solved using molecular replacement each year from 1976 to 2012.
The ten families with the highest number of structures released in 2012 (families from Pfam release 27.0, matches according to pfam_scan; see §2)
Note that if a structure had multiple chains that matched the same family then all of these matching chains counted as one structure in the last column of the table.
| Pfam family accession No. | Pfam clan accession No. | Pfam family description | No. of structures released in 2012 |
|---|---|---|---|
| PF00069 | CL0016 | Protein kinase domain | 510 |
| PF07714 | CL0016 | Protein tyrosine kinase | 505 |
| PF07654 | CL0011 | Immunoglobulin C1-set domain | 227 |
| PF07686 | CL0011 | Immunoglobulin V-set domain | 196 |
| PF13895 | CL0011 | Immunoglobulin domain | 174 |
| PF14531 | CL0016 | Kinase-like | 145 |
| PF08205 | CL0011 | CD80-like C2-set immunoglobulin domain | 140 |
| PF13927 | CL0011 | Immunoglobulin domain | 137 |
| PF00089 | CL0124 | Trypsin | 128 |
| PF00047 | CL0011 | Immunoglobulin domain | 105 |