| Literature DB >> 21556138 |
Chuming Chen1, Darren A Natale, Robert D Finn, Hongzhan Huang, Jian Zhang, Cathy H Wu, Raja Mazumder.
Abstract
The accelerating growth in the number of protein sequences taxes both the computational and manual resources needed to analyze them. One approach to dealing with this problem is to minimize the number of proteins subjected to such analysis in a way that minimizes loss of information. To this end we have developed a set of Representative Proteomes (RPs), each selected from a Representative Proteome Group (RPG) containing similar proteomes calculated based on co-membership in UniRef50 clusters. A Representative Proteome is the proteome that can best represent all the proteomes in its group in terms of the majority of the sequence space and information. RPs at 75%, 55%, 35% and 15% co-membership threshold (CMT) are provided to allow users to decrease or increase the granularity of the sequence space based on their requirements. We find that a CMT of 55% (RP55) most closely follows standard taxonomic classifications. Further analysis of this set reveals that sequence space is reduced by more than 80% relative to UniProtKB, while retaining both sequence diversity (over 95% of InterPro domains) and annotation information (93% of experimentally characterized proteins). All sets can be browsed and are available for sequence similarity searches and download at http://www.proteininformationresource.org/rps, while the set of 637 RPs determined using a 55% CMT are also available for text searches. Potential applications include sequence similarity searches, protein classification and targeted protein annotation and characterization.Entities:
Mesh:
Substances:
Year: 2011 PMID: 21556138 PMCID: PMC3083393 DOI: 10.1371/journal.pone.0018910
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Flow chart of the method used to select Representative Proteomes.
For details please see materials and methods section.
Representative Proteomes computed at different thresholds.
| Threshold | #RPG | %reduction in #proteomes | %reduction in #sequences | %species in multiple RPGs | %RPG has multiple genus proteomes | RP Sequence Coverage (%) | RP UniRef50 Coverage (%) |
| 15 | 278 | 75.6993 | 53.5041 | 0.3713 | 31.6547 | 25.1806 | 51.2873 |
| 35 | 499 | 56.3811 | 45.5558 | 0.8663 | 6.6132 | 43.3308 | 75.8166 |
| 55 | 637 | 44.3182 | 37.9064 | 1.3614 | 1.2559 | 56.8638 | 88.2188 |
| 75 | 763 | 33.3042 | 30.3225 | 3.3416 | 0.3932 | 67.1059 | 93.1352 |
Based on UniProt: 2010_09; # of organisms: 1144; # of species: 808; # of genus: 453; # of sequences: 4335476; # of UniRef50 clusters: 1566987.
Figure 2Stability and characteristics of RP55.
RPGs and RPs were determined for previous releases of UniProtKB. Histograms show the growth in the number of RPs relative to the number of complete proteomes. The percentage of species with strains found in multiple RPGs is given by the green line, while the percentages of RPGs and RPs that remained unchanged between the indicated release and the 2010_09 release are given by the orange and blue lines, respectively.
Figure 3Sequence similarity searches against Representative Proteome sets.
3a) Time required to perform phmmer searches on 1000 randomly chosen UniParc sequences against RP15 (purple), RP35 (orange), RP55 (blue) and RP75 (red) or UniParc (green solid lines). The subset of sequences with no Representative Proteome (RP) hits were searched against the whole of UniParc and the two search times where summed (broken lines). 3b) Taxonomic breakdown of the subset of sequences without RP hit.
Assessing Representative Proteomes in different ways (phmmer score and coverage of query sequence in terms of amino acid overlap).
| Category | RP15 | RP35 | RP55 | RP75 |
|
|
| |||
| RP = UniParc | 100 | 147 | 182 | 215 |
| RP good enough | 444 | 468 | 459 | 433 |
| Full search only | 398 | 327 | 301 | 294 |
|
|
| |||
| RP = UniParc | 100 | 147 | 182 | 215 |
| RP good enough | 544 | 539 | 526 | 499 |
| Full search only | 298 | 256 | 234 | 228 |
Figure 4Browsing the Representative Proteome Groups (RPGs) and Representative Proteomes (RPs) at different thresholds.