| Literature DB >> 23157439 |
Bhanu Rekapalli1, Kristin Wuichet, Gregory D Peterson, Igor B Zhulin.
Abstract
BACKGROUND: The currently known protein sequence space consists of millions of sequences in public databases and is rapidly expanding. Assigning sequences to families leads to a better understanding of protein function and the nature of the protein universe. However, a large portion of the current protein space remains unassigned and is referred to as its "dark matter".Entities:
Mesh:
Substances:
Year: 2012 PMID: 23157439 PMCID: PMC3557196 DOI: 10.1186/1471-2164-13-634
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Figure 1Different levels of computational coverage in protein sequences. Three representative proteins from the human genome are shown: (1) a tyrosine kinase (GI: 307508) has a comprehensive coverage by five Pfam domains (shown as colored rectangles with their respective names). Sequence regions that are less than 50 aa long are shown as grey lines; (2) a hypothetical protein (GI: 341913853) has no matches to any known protein domain or region and is considered part of “dark matter” (shown as a black line with a question mark above); (3) a leucine-rich repeat-containing protein is characterized only partly by a match to the LLR_8 (leucine-rich repeat) domain; however two large portions of its sequence (90% of total amino acid residues) show no matches to any domain or region, and therefore should be considered a part of “dark matter” (black lines with question marks above).
Figure 2Example of Pfam and CDD coverage of a protein sequence. A protein sequence RcsC from Escherichia coli (GI: 145698285) is covered by four Pfam domain profiles: HisKA, HATPase_c, RcsC and Response_reg. Two transmembrane regions (TM) identified in this sequence by the TMHMM program are shown as grey rectangles. Small (<50 a.a.) interdomain regions are shown as grey lines. Large (>50 a.a.) interdomain regions are shown as black lines with a question mark. CDD profiles constructed from corresponding Pfam and SMART [14] domain models are confirmatory (redundant) and the only new information is provided by one additional profile, PRK10841, which covers the entire sequence.
Computational coverage of the protein sequence space
| Total sequence space | aa | 5.64E + 09 | 4.14E + 08 | 3.74E + 08 | 6.78E + 07 | 5.43E + 07 | 9.10E + 08 |
| % | 100 | 7.3 | 6.6 | 1.2 | 1.0 | 16.1 | |
| Domain space | aa | 2.90E + 09 | 2.72E + 08 | 1.20E + 08 | 4.65E + 07 | 4.62E + 07 | 4.84E + 08 |
| % | 51.4 | 9.4b | 4.1b | 1.6b | 1.6b | 16.7b | |
aData for nr December 2011 is shown. Abbreviations: LC, regions of low complexity; TM, transmembrane regions; CC, coiled coils; SP, signal peptides.
bShown as relative percentage with respect to the 51.4% of domain space.
Figure 3Computational domain coverage of the protein sequence space from 2009 to 2011. From April 2009 to December 2011, the NR database grew twice: from 2.8 to 5.6 billion aa. Three Pfam releases represent both model improvements and an increase in the number of domain models (shown in parentheses).