| Literature DB >> 25307260 |
Emanuele Alpi1, Johannes Griss, Alan Wilter Sousa da Silva, Benoit Bely, Ricardo Antunes, Hermann Zellner, Daniel Ríos, Claire O'Donovan, Juan Antonio Vizcaíno, Maria J Martin.
Abstract
In this article, we provide a comprehensive study of the content of the Universal Protein Resource (UniProt) protein data sets for human and mouse. The tryptic search spaces of the UniProtKB (UniProt knowledgebase) complete proteome sets were compared with other data sets from UniProtKB and with the corresponding International Protein Index, reference sequence, Ensembl, and UniRef100 (where UniRef is UniProt reference clusters) organism-specific data sets. All protein forms annotated in UniProtKB (both the canonical sequences and isoforms) were evaluated in this study. In addition, natural and disease-associated amino acid variants annotated in UniProtKB were included in the evaluation. The peptide unicity was also evaluated for each data set. Furthermore, the peptide information in the UniProtKB data sets was also compared against the available peptide-level identifications in the main MS-based proteomics repositories. Identifying the peptides observed in these repositories is an important resource of information for protein databases as they provide supporting evidence for the existence of otherwise predicted proteins. Likewise, the repositories could use the information available in UniProtKB to direct reprocessing efforts on specific sets of peptides/proteins of interest. In summary, we provide comprehensive information about the different organism-specific sequence data sets available from UniProt, together with the pros and cons for each, in terms of search space for MS-based bottom-up proteomics workflows. The aim of the analysis is to provide a clear view of the tryptic search space of UniProt and other protein databases to enable scientists to select those most appropriate for their purposes. The Authors. PROTEOMICS Published by Wiley-VCH Verlag GmbH & Co. KGaA.Entities:
Keywords: Bioinformatics; Protein isoforms; Sequence redundancy; Trypsin digestion; Variation
Mesh:
Substances:
Year: 2014 PMID: 25307260 PMCID: PMC4298651 DOI: 10.1002/pmic.201400227
Source DB: PubMed Journal: Proteomics ISSN: 1615-9853 Impact factor: 3.984
Details of the protein data sets used in this study
| DB | Species | Release | Set specification | Abbreviation |
|---|---|---|---|---|
| UniProtKB | 2012_10 | UniProtKB/Swiss-Prot sequences (canonical and isoforms) | SPI | |
| UniProtKB | 2012_10 | UniProtKB/Swiss-Prot sequences (canonical and isoforms) disease-related variant expanded | SPID | |
| UniProtKB | 2012_10 | UniProtKB/Swiss-Prot sequences (canonical only, without isoforms) | SP | |
| UniProtKB | 2012_10 | UniProtKB/Swiss-Prot sequences (canonical and isoforms) variant expanded | SPIV | |
| UniProtKB | 2012_10 | UniProtKB complete proteome set sequences (UniProtKB/Swiss-Prot canonical and isoforms plus UniProtKB/TrEMBL, all with KW-0181). The keyword KW-0181 refers to complete proteomes ( | CPI | |
| UniProtKB | 2012_10 | UniProtKB complete proteome set sequences (UniProtKB/Swiss-Prot canonical and isoforms plus UniProtKB/TrEMBL, all with KW-0181) disease-related variant expanded | CPID | |
| UniProtKB | 2012_10 | UniProtKB complete proteome set sequences (UniProtKB/Swiss-Prot canonical only, without isoforms plus UniProtKB/TrEMBL, all with KW-0181) | CP | |
| UniProtKB | 2012_10 | UniProtKB complete proteome set sequences (UniProtKB/Swiss-Prot canonical and isoforms plus UniProtKB/TrEMBL, all with KW-0181) variant expanded | CPIV | |
| UniProtKB | 2012_10 | UniProtKB whole (UniProtKB/Swiss-Prot canonical and isoforms plus UniProtKB/TrEMBL sequences); this DB is equivalent to the merging of SPI + TR | UPI | |
| UniProtKB | 2012_10 | UniProtKB whole (UniProtKB/Swiss-Prot canonical and isoforms plus UniProtKB/TrEMBL sequences) disease-related variant expanded; this DB is equivalent to the merging of SPID + TR | UPID | |
| UniProtKB | 2012_10 | UniProtKB whole (UniProtKB/Swiss-Prot canonical only, without isoforms plus UniProtKB/TrEMBL sequences); this DB is equivalent to the merging of SP + TR | UP | |
| UniProtKB | 2012_10 | UniProtKB whole (UniProtKB/Swiss-Prot canonical and isoforms plus UniProtKB/TrEMBL sequences) variant expanded; this DB is equivalent to the merging of SPIV + TR | UPIV | |
| UniProtKB | 2012_10 | UniProtKB/TrEMBL sequences | TR | |
| UniRef100 | 2012_10 | UniRef100 clustered sequences from CPI | CPIR | |
| UniRef100 | 2012_10 | UniRef100 clustered sequences from CPID | CPIDR | |
| UniRef100 | 2012_10 | UniRef100 clustered sequences from CPIV | CPIVR | |
| UniRef100 | 2012_10 | UniRef100 sequences from UPI | UPIR | |
| UniRef100 | 2012_10 | UniRef100 sequences from UPID | UPIDR | |
| UniRef100 | 2012_10 | UniRef100 sequences from UPIV | UPIVR | |
The relationships among the UniProtKB data sets are as follows: SP
Pairwise comparisons of UniProt data sets tryptic search spaces for human and mouse
| DB | Peptides | DB | Peptides | DB | Peptides | DB | Peptides | DB | Peptides |
|---|---|---|---|---|---|---|---|---|---|
| SPI | 25 972 (4.7) | SPI | 0 (0) | SPI | 3 (0.00) | SP | 3 (0.00) | SPIV | 61 477 (0.9) |
| SP | 0 (0) | SPIV | 61 830 (1.0) | CPI | 73 687 (1.0) | CPI | 99 659 (2.0) | CPI | 73 331 (0.9) |
| Com. | 596 385 (27.9) | Com. | 622 357 (26.9) | Com. | 622 354 (26.9) | Com. | 596 382 (27.9) | Com. | 622 710 (26.9) |
| CP | 0 (0) | CPIR | 0 (0) | CPIV | 61 474 (0.9) | CPIVR | 0 (0) | UPI | 85 453 (1.2) |
| CPI | 19 012 (3.2) | CPI | 5777 (1.4) | CPI | 0 (0) | CPIV | 9494 (1.4) | CPI | 0 (0) |
| Com. | 677 029 (24.8) | Com. | 690 264 (24.4) | Com. | 696 041 (24.2) | Com. | 748 021 (22.5) | Com. | 696 041 (24.2) |
| UPI | 81 566 (1.0) | UPI | 16 554 (2.5) | UPI | 0 (0) | UP | 85 453 (1.2) | UPIR | 0 (0) |
| CPIV | 57 587 (0.6) | UP | 0 (0) | UPIV | 57 587 (0.6) | CPI | 16 554 (2.5) | UPI | 16 243 (1.5) |
| Com. | 699 928 (24.1) | Com. | 764 940 (22.1) | Com. | 781 494 (21.6) | Com. | 679 487 (24.7) | Com. | 765 251 (22.1) |
| UPIV | 143 040 (1.0) | UPIVR | 0 (0) | TR | 159 137 (1.1) | TR | 154 894 (1.0) | SPI | 0 (0) |
| CPI | 0 (0) | UPIV | 13 859 (1.7) | SPI | 181 615 (17.7) | SPIV | 239 202 (13.6) | SPID | 22 531 (0.4) |
| Com. | 696 041 (24.2) | Com. | 825 222 (20.5) | Com. | 440 742 (30.7) | Com. | 444 985 (30.5) | Com. | 622 357 (26.9) |
| SPID | 22 523 (0.4) | CPID | 22 520 (0.4) | CPIDR | 0 (0) | UPI | 85 271 (1.2) | UPI | 0 (0) |
| CPI | 73 676 (1.0) | CPI | 0 (0) | CPID | 9584 (1.4) | CPID | 22 338 (0.4) | UPID | 22 338 (0.4) |
| Com. | 622 365 (26.9) | Com. | 696 041 (24.2) | Com. | 708 977 (23.7) | Com. | 696 223 (24.2) | Com. | 781 494 (21.6) |
| UPID | 107791 (1.1) | UPIDR | 0 (0) | ||||||
| CPI | 0 (0) | UPID | 13 823 (1.6) | ||||||
| Com. | 696 041 (24.2) | Com. | 790 009 (21.4) | ||||||
| SPI | 11 533 (5.2) | SPI | 0 (0) | SPI | 0 (0) | SP | 0 (0) | SPIV | 854 (1.0) |
| SP | 0 (0) | SPIV | 953 (2.4) | CPI | 136 411 (6.3) | CPI | 147 944 (6.2) | CPI | 136 312 (6.3) |
| Com. | 502 478 (20.1) | Com. | 514 011 (19.8) | Com. | 514 011 (19.8) | Com. | 502 478 (20.1) | Com. | 514 110 (19.8) |
| CP | 0 (0) | CPIR | 0 (0) | CPIV | 854 (1.0) | CPIVR | 0 (0) | UPI | 58 057 (2.0) |
| CPI | 9222 (4.3) | CPI | 2268 (2.9) | CPI | 0 (0) | CPIV | 3457 (2.5) | CPI | 0 (0) |
| Com. | 641 200 (17.1) | Com. | 648 154 (19.7) | Com. | 650 422 (16.9) | Com. | 647 819 (17.0) | Com. | 650 422 (16.9) |
| UPI | 57 925 (2.0) | UPI | 7998 (3.4) | UPI | 0 (0) | UP | 58 057 (2.0) | UPIR | 0 (0) |
| CPIV | 722 (1.0) | UP | 0 (0) | UPIV | 722 (0.7) | CPI | 7998 (3.4) | UPI | 9263 (1.9) |
| Com. | 650 554 (16.9) | Com. | 700 481 (15.8) | Com. | 708 479 (15.7) | Com. | 642 424 (17.1) | Com. | 699 216 (15.9) |
| UPIV | 58 779 (2.0) | UPIVR | 0 (0) | TR | 194 468 (5.0) | TR | 194 237 (5.0) | ||
| CPI | 0 (0) | UPIV | 7691 (1.9) | SPI | 170 033 (15.6) | SPIV | 170 755 (15.6) | ||
| Com. | 650 422 (16.9) | Com. | 701 510 (15.8) | Com. | 343 978 (21.8) | Com. | 344 209 (21.8) |
Each pairwise comparison is delimited by wider spacing after each “Com.” occurrence, and the two data sets (DB) being compared are indicated next to the numbers of peptides unique to each of them. “Peptides” indicate the number of tryptic peptides for each of the three compartments of the comparisons (I, II, and III in Supporting Information Fig. 2). “Com.” indicates the number of tryptic peptides shared by both data sets in each pairwise comparison. Corresponding percentages of peptides that are found in MS proteomics repositories are reported in brackets for each of the three compartments of the comparisons. Mouse comparisons are highlighted with a light gray background.
Pairwise comparisons of UniProt CPI data sets tryptic search spaces versus other data sets
| Organism | DB | Peptides | DB | Peptides | DB | Peptides |
|---|---|---|---|---|---|---|
| Ensembl | 3638 (5.6) | IPI | 78 490 (1.1) | RefSeq | 9023 (1.7) | |
| CPI | 19 479 (4.2) | CPI | 15 437 (0.9) | CPI | 95 201 (1.7) | |
| Com. | 676 562 (24.7) | Com. | 680 604 (24.7) | Com. | 600 840 (27.7) | |
| Ensembl | 3072 (7.6) | IPI | 53 994 (2.5) | RefSeq | 19 927 (1.4) | |
| CPI | 13 140 (4.5) | CPI | 6943 (1.7) | CPI | 51 481 (3.3) | |
| Com. | 637 282 (17.2) | Com. | 643 479 (17.1) | Com. | 598 941 (18.1) |
Each pairwise comparison is delimited by wider spacing after each “Com.” occurrence, and the corresponding two data sets (DB) are indicated next to the numbers of peptides unique to each of them. “Peptides” indicate the number of tryptic peptides for each of the three compartments of the comparisons (I, II, and III in Supporting Information Fig. 2). “Com.” indicates the number of tryptic peptides shared by both data sets in each pairwise comparison. Corresponding percentages of peptides that are found in MS proteomics repositories are reported in brackets for each of the three compartments of the comparisons.
Pairwise comparisons of human and mouse UniProt UPI data sets tryptic search spaces versus other data sets
| Organism | DB | Peptides | DB | Peptides | DB | Peptides |
|---|---|---|---|---|---|---|
| Ensembl | 2605 (3.0) | IPI | 31 888 (0.8) | RefSeq | 6823 (0.8) | |
| UPI | 103 899 (1.7) | UPI | 54 288 (1.1) | UPI | 178 454 (1.5) | |
| Com. | 677 595 (24.7) | Com. | 727 206 (23.2) | Com. | 603 040 (27.6) | |
| Ensembl | 2196 (5.9) | IPI | 19 000 (1.5) | RefSeq | 16 268 (1.0) | |
| UPI | 70 321 (2.4) | UPI | 30 006 (0.8) | UPI | 105 879 (2.6) | |
| Com. | 638 158 (17.2) | Com. | 678 473 (16.4) | Com. | 602 600 (18.0) |
Each pairwise comparison is delimited by wider spacing after each “Com.” occurrence, and the corresponding two data sets (DB) are indicated next to the numbers of peptides unique to each of them. “Peptides” indicate the number of tryptic peptides for each of the three compartments of the comparisons (I, II, and III in Supporting Information Fig. 2). “Com.” indicates the number of tryptic peptides shared by both data sets in each pairwise comparison. Corresponding percentages of peptides that are found in MS proteomics repositories are reported in brackets for each of the three compartments of the comparisons.
Peptide unicity table for the other UniProt human and mouse data sets
| CPID | CP | CPIV | CPIR | CPIDR | CPIVR | SPI | SPID | SP | |
|---|---|---|---|---|---|---|---|---|---|
| 718 561 (33.5) | 677 029 (52.5) | 757 515 (21.6) | 690 264 (40.6) | 708 977 (40.8) | 748 021 (23.0) | 622 357 (53.5) | 644 888 (47.5) | 596 385 (96.5) | |
| 641 200 (62.2) | 651 276 (49.6) | 648 154 (53.5) | 647 819 (53.5) | 514 011 (67.7) | 502 478 (97.6) | ||||
| SPIV | UPI | UPID | UP | UPIV | UPIR | UPIDR | UPIVR | TR | |
| 684 187 (19.8) | 781 494 (33.0) | 803 832 (32.2) | 764 940 (39.5) | 839 081 (24.8) | 765 251 (37.0) | 790 009 (35.8) | 825 222 (25.9) | 599 879 (44.5) | |
| 514 964 (66.2) | 708 479 (33.0) | 700 481 (39.3) | 709 201 (32.8) | 699 216 (42.0) | 701 510 (41.6) | 538 446 (46.9) |
For each organism and each data set, the total number of tryptic peptides is reported together with the percentage of unique peptides in brackets.