| Literature DB >> 21700957 |
Johannes Griss1, Richard G Côté, Christopher Gerner, Henning Hermjakob, Juan Antonio Vizcaíno.
Abstract
In proteomics, protein identifications are reported and stored using an unstable reference system: protein identifiers. These proprietary identifiers are created individually by every protein database and can change or may even be deleted over time. To estimate the effect of the searched protein sequence database on the long-term storage of proteomics data we analyzed the changes of reported protein identifiers from all public experiments in the Proteomics Identifications (PRIDE) database by November 2010. To map the submitted protein identifier to a currently active entry, two distinct approaches were used. The first approach used the Protein Identifier Cross Referencing (PICR) service at the EBI, which maps protein identifiers based on 100% sequence identity. The second one (called logical mapping algorithm) accessed the source databases and retrieved the current status of the reported identifier. Our analysis showed the differences between the main protein databases (International Protein Index (IPI), UniProt Knowledgebase (UniProtKB), National Center for Biotechnological Information nr database (NCBI nr), and Ensembl) in respect to identifier stability. For example, whereas 20% of submitted IPI entries were deleted after two years, virtually all UniProtKB entries remained either active or replaced. Furthermore, the two mapping algorithms produced markedly different results. For example, the PICR service reported 10% more IPI entries deleted compared with the logical mapping algorithm. We found several cases where experiments contained more than 10% deleted identifiers already at the time of publication. We also assessed the proportion of peptide identifications in these data sets that still fitted the originally identified protein sequences. Finally, we performed the same overall analysis on all records from IPI, Ensembl, and UniProtKB: two releases per year were used, from 2005. This analysis showed for the first time the true effect of changing protein identifiers on proteomics data. Based on these findings, UniProtKB seems the best database for applications that rely on the long-term storage of proteomics data.Entities:
Mesh:
Substances:
Year: 2011 PMID: 21700957 PMCID: PMC3186200 DOI: 10.1074/mcp.M111.008490
Source DB: PubMed Journal: Mol Cell Proteomics ISSN: 1535-9476 Impact factor: 5.911
Fig. 1.Flowchart representing the protein mapping algorithm(s). All public experiments available in PRIDE were used as initial test data set. This data set was then manually curated before the protein identifier mappings were performed. The box represents the protein mapping algorithms and depicts the different steps of the mapping process.
Searched database release date curation
| Status | Number of Experiments |
|---|---|
| Successfully mapped | 6956 |
| No database version given in publication | 1349 |
| No publication found | 102 |
| Data not public | 15 |
| Publication not accessible | 2 |
| Custom database used | 76 |
Protein identifier and searched database mapping
| Status | Number of Identifications |
|---|---|
| Successfully Mapped | 1,307,505 |
| Decoy Database Entries | 594 |
| Exotic Search Database | 93,334 |
| Invalid Protein Identifiers | 960 |
Mapped protein identifiers per searched database and mapping algorithm. “PRIDE PICR Mapped Ident.” refers to the mapping service as it is performed in the PRIDE database (see main text)
| Search Database | PICR Service Mapped Ident. | Logical Mapped Ident. | PRIDE PICR Mapped Ident. | Total Number of Ident. | |||
|---|---|---|---|---|---|---|---|
| ENSEMBL | 73,559 | (100%) | – | 67,057 | (91.2%) | 73,559 | |
| IPI | 777,848 | (100%) | 777,848 | (100%) | 609,345 | (78.3%) | 777,848 |
| NCBI gi | – | 54,225 | (97.8%) | – | 55,423 | ||
| TAIR | 137,958 | (100%) | – | 114,410 | (82.9%) | 137,958 | |
| UniProtKB | 253,658 | (96.6%) | 253,658 | (96.6%) | 211,111 | (80.4%) | 262,717 |
Total number of identification for every status and database per mapping algorithm
| Database | Active | Deleted | Changed | Demerged | ||||
|---|---|---|---|---|---|---|---|---|
| Logical | PICR | Logical | PICR | Logical | PICR | Logical | PICR | |
| ENSEMBL | – | 49,597 | – | 12,695 | – | 8,060 | – | – |
| (67.4%) | (17.3%) | (11.0%) | ||||||
| IPI | 605,376 | 508,926 | 78,344 | 233,787 | 94,128 | 35,135 | – | – |
| (77.8%) | (65.4%) | (10.1%) | (30.1%) | (12.1%) | (4.5%) | |||
| NCBI gi | 41,945 | – | 4,512 | – | 7,768 | – | – | – |
| (77.4%) | (8.3%) | (14.3%) | ||||||
| UniProtKB | 238,418 | 236,613 | 3,149 | 2,333 | 10,427 | 13,400 | 1,664 | 1,312 |
| (94.0%) | (93.3%) | (1.2%) | (0.9%) | (4.1%) | (5.3%) | (0.7%) | (0.5%) | |
Fig. 2.The combined protein identifier mapping result for all species. Some of the outliers were caused by very small submissions only consisting of a limited number of identifications. The number of identifications available from certain database release dates was normalized. Thus, the graph represents the proportion of active, replaced, deleted, and demerged identifications for specific release dates and databases. Different scales were used for the different searched databases to provide a more detailed view on the data.
Fig. 3.The combined protein identifier mapping result for human data only. For a detailed description see Fig. 2. Different scales were used for the different searched databases to provide a more detailed view on the data as in Fig. 2.
Portion of peptides fitting the protein sequence for active entries
| Database | Active Entries | Missing Peptide Mapping | |
|---|---|---|---|
| UniProtKB | 248,404 | 18,899 | (7.6%) |
| NCBI gi | 49,680 | 10,162 | (20.5%) |
| IPI | 698,023 | 42,976 | (6.2%) |
Number of peptides and average peptide scores (± standard deviation) of peptides fitting and not-fitting the protein sequence per search engine. Numbers refer to peptides where the search engine score was retrieved successfully
| Mascot | Peptide Prophet | Sequest | SpectrumMill | X!Tandem | |
|---|---|---|---|---|---|
| No. Fitting peptides | 1380884 | 38008 | 21827 | 47860 | 95231 |
| No. Nonfitting pep. | 48440 | 4031 | 21536 | 498 | 24168 |
| Av. score fitting peptides | 40.63 ± 19.80 | 7.87 ± 11.56 | 2.40 ± 1.28 | 13.34 ± 3.24 | 21.22 ± 8.81 |
| Av. score Nonfitting peptides | 40.78 ± 22.36 | 2.35 ± 6.06 | 1.66 ± 0.52 | 11.66 ± 3.75 | 13.31 ± 5.18 |
Fig. 4.Rate of change of identifiers in complete releases of UniProtKB, IPI, and Ensembl for human and mouse (PICR mappings). Two releases per year were considered from 2005. The UniProtKB database contains the species specific identifiers from UniProtKB/SwissProt and UniProtKB/TrEMBL.