| Literature DB >> 27581094 |
Alba Garin-Muga1, Leticia Odriozola1,2, Ana Martínez-Val3, Noemí Del Toro4, Rocío Martínez1, Manuela Molina1, Laura Cantero5, Rocío Rivera6, Nicolás Garrido6, Francisco Dominguez7, Manuel M Sanchez Del Pino8, Juan Antonio Vizcaíno4, Fernando J Corrales1,2,9, Victor Segura1,2.
Abstract
The current catalogue of the human proteome is not yet complete, as experimental proteomics evidence is still elusive for a group of proteins known as the missing proteins. The Human Proteome Project (HPP) has been successfully using technology and bioinformatic resources to improve the characterization of such challenging proteins. In this manuscript, we propose a pipeline starting with the mining of the PRIDE database to select a group of data sets potentially enriched in missing proteins that are subsequently analyzed for protein identification with a method based on the statistical analysis of proteotypic peptides. Spermatozoa and the HEK293 cell line were found to be a promising source of missing proteins and clearly merit further attention in future studies. After the analysis of the selected samples, we found 342 PSMs, suggesting the presence of 97 missing proteins in human spermatozoa or the HEK293 cell line, while only 36 missing proteins were potentially detected in the retina, frontal cortex, aorta thoracica, or placenta. The functional analysis of the missing proteins detected confirmed their tissue specificity, and the validation of a selected set of peptides using targeted proteomics (SRM/MRM assays) further supports the utility of the proposed pipeline. As illustrative examples, DNAH3 and TEPP in spermatozoa, and UNCX and ATAD3C in HEK293 cells were some of the more robust and remarkable identifications in this study. We provide evidence indicating the relevance to carefully analyze the ever-increasing MS/MS data available from PRIDE and other repositories as sources for missing proteins detection in specific biological matrices as revealed for HEK293 cells.Entities:
Keywords: C-HPP; MS/MS proteomics; PRIDE database; missing proteins
Mesh:
Substances:
Year: 2016 PMID: 27581094 PMCID: PMC5099979 DOI: 10.1021/acs.jproteome.6b00437
Source DB: PubMed Journal: J Proteome Res ISSN: 1535-3893 Impact factor: 4.466
Figure 1(A) Overall scheme of the analysis pipeline developed to identify missing proteins using the PRIDE database. (B) Summary of the numbers of proteins and peptides in each step of the analysis pipeline developed.
Figure 2Number of proteotypic peptides of chromosome 16 missing proteins in the neXtProt database that were detected in the shotgun MS/MS experiments stored in the PRIDE database. The experiments selected for further analyses are highlighted in red.
Project Accessions of the PRIDE Database Selected for the Identification of Missing Proteinsa
| Project Accession | Tissue | Instrument | ⧧ samples | ⧧ fractions |
|---|---|---|---|---|
| PXD001468 | HEK293 | Q Exactive | 1 | 24 |
| PXD002367 | Spermatozoid | LTQ Orbitrap | 1 | 21 |
| PXD001242 | Retina | LTQ Orbitrap Elite | 5 | 60 |
| PXD000754 | Placenta | LTQ Orbitrap | 2 | 47 |
| PXD000605 | Blood plasma | LTQ Orbitrap | 3 | 146 |
| PXD000004 | Frontal cortex | Q Exactive | 5 | 14 |
| PRD000269 | Aorta thoracica | LTQ Orbitrap | 1 | 108 |
| PXD002145 | Seminal plasma | LTQ Orbitrap Elite | 2 | 96 |
The number of samples and fractions analyzed in this study are shown.
Figure 3(A) Distribution of tryptic peptides deduced from in silico digestion of the neXtProt database (release 20150901) along chromosomes. (B) Distribution of proteotypic peptides deduced from the in silico digestion of the neXtProt database (release 20150901) along chromosomes. (C) Distribution of proteins with at least one tryptic peptide after the in silico digestion of the neXtProt database (release 20150901) along chromosomes. (D) Distribution of proteins with at least one proteotypic peptide after the in silico digestion of the neXtProt database (release 20150901) along chromosomes.
Number of PSMs, Peptides, and Proteins Identified Using the HPP Guidelines (PSM FDR < 1%, protein FDR < 1%) in the Samples Selected from PRIDE for the Analysis of the Missing Proteinsa
| PXD001468 | PXD002367 | PXD001242 | PXD000754 | PXD000605 | PXD000004 | PRD000269 | PXD002145 | Total | |
|---|---|---|---|---|---|---|---|---|---|
| Spectra | 836145 | 114970 | 452880 | 519326 | 1299378 | 357899 | 370218 | 1198042 | 5148858 |
| Total PSMs | 328554 | 48609 | 110624 | 80213 | 19086 | 136506 | 21969 | 6676 | 752237 |
| FP PSMs | 161 | 34 | 136 | 201 | 5 | 154 | 11 | 116 | 818 |
| Total Peptides | 68377 | 9848 | 14413 | 10122 | 1228 | 16679 | 2001 | 199 | 93012 |
| Total Peptides (proteotypic) | 24510 | 3990 | 5393 | 4226 | 788 | 5737 | 746 | 56 | 33756 |
| Total Peptides (nonproteotypic) | 43867 | 5858 | 9020 | 5896 | 440 | 10942 | 1255 | 143 | 59256 |
| FP Peptides | 70 | 12 | 20 | 46 | 2 | 41 | 3 | 8 | 202 |
| Total Proteins | 7206 | 1437 | 2681 | 2127 | 363 | 2340 | 351 | 54 | 8712 |
| Total Conclusive Prot | 4539 | 909 | 1501 | 1140 | 146 | 1069 | 193 | 29 | 5626 |
| FP Proteins | 33 | 8 | 15 | 11 | 2 | 33 | 1 | 8 | 111 |
| Total Assigned Spectra | 191095 | 24736 | 66707 | 51392 | 18091 | 133602 | 14936 | 2495 | 503054 |
| Missing PSMs | 798 | 473 | 117 | 0 | 0 | 0 | 0 | 0 | 1388 |
| Missing Peptides | 83 | 258 | 25 | 0 | 0 | 0 | 0 | 0 | 357 |
| Missing Proteins | 10 | 47 | 5 | 0 | 0 | 0 | 0 | 0 | 60 |
| Missing Assigned Spectra | 479 | 367 | 68 | 0 | 0 | 0 | 0 | 0 | 914 |
| Total Proteins HPP (≥1 peptide) | 4276 | 888 | 1450 | 1115 | 146 | 1053 | 188 | 28 | 5284 |
| Total Proteins HPP (≥2 peptides) | 3326 | 750 | 1260 | 1000 | 120 | 924 | 169 | 22 | 3950 |
| Missing Proteins HPP (≥1 peptide) | 10 | 45 | 5 | 0 | 0 | 0 | 0 | 0 | 58 |
| Missing Proteins HPP (≥2 peptides) | 5 | 27 | 1 | 0 | 0 | 0 | 0 | 0 | 32 |
FP = false positives.
Parameters Used in the Mascot Search Engine for the Analysis of Each Downloaded Project from the PRIDE Database
| Project Accession | Precursor mass tolerance (ppm) | Fragment mass tolerance (Da) | Missed cleavages | Fixed modifications | Variable modifications |
|---|---|---|---|---|---|
| PXD001468 | 20 | 0.05 | 2 | Carbamidomethyl (C) | Oxidation (M) |
| PXD002367 | 10 | 0.5 | 2 | Carbamidomethyl (C) | Oxidation (M) |
| Acetyl (Protein N-term) | |||||
| PXD001242 | 20 | 0.05 | 2 | Carbamidomethyl (C) | Oxidation (M) |
| PXD000754 | 20 | 1 | 2 | Carbamidomethyl (C) | Oxidation (M) |
| PXD000605 | 20 | 0.05 | 2 | iTRAQ4plex114 (K) | iTRAQ4plex114 (Y) |
| Methylthio (C) | Oxidation (M) | ||||
| PXD000004 | 20 | 0.05 | 2 | Carbamidomethyl (C) | Oxidation (M) |
| Label: 13C(6) (K) | |||||
| PRD000269 | 20 | 0.05 | 2 | Carbamidomethyl (C) | Oxidation (M) |
| PXD002145 | 10 | 0.5 | 2 | Carbamidomethyl (C) | Oxidation (M) |
| Acetyl (Protein N-term) |
Number of PSMs, Peptides, and Proteins Observed Using the Identifications of Proteotypic Peptides from the neXtProt Database (PSM FDR < 1%) in the Samples Selected from PRIDE for the Analysis of the Missing Proteins
| PXD001468 | PXD002367 | PXD001242 | PXD000754 | PXD000605 | PXD000004 | PRD000269 | PXD002145 | Total | |
|---|---|---|---|---|---|---|---|---|---|
| Spectra | 836145 | 114970 | 452880 | 519326 | 1299378 | 357899 | 370218 | 1198042 | 5148858 |
| Total PSMs | 332417 | 49100 | 115861 | 82856 | 19182 | 138704 | 23199 | 6676 | 767995 |
| Total Peptides | 71277 | 10311 | 16329 | 11739 | 1271 | 17570 | 2521 | 199 | 98319 |
| Total Peptides (proteotypic) | 25734 | 4187 | 6259 | 4848 | 804 | 6092 | 988 | 56 | 35922 |
| Total Peptides (nonproteotypic) | 45543 | 6124 | 10070 | 6891 | 467 | 11478 | 1533 | 143 | 62397 |
| Total Proteins (≥1 peptide) | 5341 | 1293 | 2420 | 2208 | 245 | 1929 | 569 | 41 | 6333 |
| Total Proteins (≥2 peptides) | 3326 | 750 | 1260 | 1000 | 120 | 924 | 169 | 22 | 3950 |
| Total Assigned Spectra | 193971 | 25083 | 71118 | 53398 | 18187 | 135285 | 15969 | 2495 | 515506 |
| Missing PSMs | 96 | 246 | 33 | 22 | 0 | 14 | 4 | 0 | 415 |
| Missing Peptides | 48 | 163 | 14 | 10 | 0 | 8 | 4 | 0 | 242 |
| Missing Proteins (≥1 peptide) | 30 | 67 | 14 | 10 | 0 | 8 | 4 | 0 | 122 |
| Missing Proteins (≥2 peptides) | 8 | 30 | 3 | 2 | 0 | 2 | 2 | 0 | 39 |
| Missing Assigned Spectra | 62 | 195 | 29 | 16 | 0 | 14 | 4 | 0 | 320 |
Figure 4(A) Distribution of tryptic and proteotypic peptide candidates detected in the analyzed samples along the different chromosomes. (B) Boxplot with the distribution of Mascot ion scores obtained for the PSMs assigned to missing and nonmissing proteins. The difference between these distributions is statistically significant with a p-value < 1 × 10–12. (C) Distribution of missing and nonmissing proteins potentially detected in the analyzed samples using the identification of proteotypic peptides along chromosomes. (D) Venn diagram with the missing proteins observed using the HPP guidelines and the workflow proposed here and with the missing proteins in neXtProt database release 20150901.
Figure 5(A) Heat map with the missing proteins potentially detected in each sample and the missing proteins shared between each pair of samples analyzed. (B) Network representation of the results obtained for the study of the missing proteins using the PRIDE database. Nodes represent the database of experiments used (green), the tissue (orange), the proteins observed (red), and the identified peptides (blue). (C) Network for the missing proteins potentially observed in the HEK293 sample. Nodes represent the sample selected (green), the chromosome (blue), and the identified protein (red). (D) Network for the missing proteins potentially detected in chromosome 16. Nodes represent the sample (orange), the proteins observed (red), and the identified peptides (blue).
Missing Proteins Potentially Identified Using Proteotypic Peptide Candidates in the HEK293 Cell Line or in Chromosome 16
| Protein | Name | Chr | no. PSMs | no. Peptides | Ion score | HPP guidelines (2 proteotypic peptides) | Sample |
|---|---|---|---|---|---|---|---|
| NX_A6NJT0 | UNCX | 7 | 8 | 4 | 113.22 | HEK | |
| NX_B2RXH8 | HNRNPCL2 | 1 | 276 | 15 | 102.61 | HEK,Retina | |
| NX_Q9BQ87 | TBL1Y | Y | 76 | 10 | 100.66 | HEK | |
| NX_Q2VIQ3 | KIF4B | 5 | 46 | 19 | 99.77 | √ | HEK |
| NX_Q6IS14 | EIF5AL1 | 10 | 298 | 17 | 95.03 | √ | HEK |
| NX_Q5T2N8 | ATAD3C | 1 | 56 | 8 | 85.06 | √ | HEK,Retina |
| NX_Q56UQ5 | - | X | 55 | 4 | 81.79 | √ | HEK |
| NX_Q8TD57 | DNAH3 | 16 | 27 | 25 | 80.77 | √ | Spermatozoa,Retina |
| NX_Q6URK8 | TEPP | 16 | 17 | 10 | 79.62 | Spermatozoa | |
| NX_Q9NRJ5 | PAPOLB | 7 | 10 | 3 | 77.63 | √ | HEK |
| NX_Q6ZR08 | DNAH12 | 3 | 34 | 23 | 75.04 | √ | Placenta,HEK,Spermatozoa |
| NX_A8K0S8 | MEIS3P2 | 17 | 6 | 1 | 58.75 | √ | HEK |
| NX_Q6ZMV8 | ZNF730 | 19 | 3 | 3 | 58.21 | √ | HEK |
| NX_Q14585 | ZNF345 | 19 | 1 | 1 | 57.3 | √ | HEK |
| NX_Q52M93 | ZNF585B | 19 | 1 | 1 | 54.08 | √ | HEK |
| NX_Q9UJN7 | ZNF391 | 6 | 4 | 3 | 53.79 | √ | HEK |
| NX_P58180 | OR4D2 | 17 | 3 | 1 | 52.28 | √ | HEK,Spermatozoa |
| NX_Q8NGL6 | OR4A15 | 11 | 3 | 1 | 52.28 | Spermatozoa,HEK | |
| NX_P59817 | ZNF280A | 22 | 1 | 1 | 48.17 | HEK | |
| NX_A6NHN6 | NPIPB15 | 16 | 7 | 5 | 47.7 | √ | Spermatozoa |
| NX_Q9Y2H8 | ZNF510 | 9 | 1 | 1 | 45.02 | HEK | |
| NX_Q96KX1 | C4orf36 | 4 | 1 | 1 | 44.7 | √ | HEK |
| NX_Q96M86 | DNHD1 | 11 | 1 | 1 | 44.16 | HEK | |
| NX_Q5VTU8 | ATP5EP2 | 13 | 1 | 1 | 43.65 | √ | HEK |
| NX_Q8N0W5 | IQCK | 16 | 1 | 1 | 43.57 | √ | Spermatozoa |
| NX_Q4AC99 | ACCSL | 11 | 1 | 1 | 43.57 | √ | HEK |
| NX_A6NNF4 | ZNF726 | 19 | 2 | 1 | 43.39 | √ | HEK |
| NX_P0CW27 | CCDC166 | 8 | 1 | 1 | 40.88 | √ | HEK |
| NX_A6NCM1 | IQCA1L | 7 | 1 | 1 | 40.63 | √ | HEK |
| NX_Q8NDH2 | CCDC168 | 13 | 1 | 1 | 40.58 | √ | HEK |
| NX_Q6R2W3 | ZBED9 | 6 | 1 | 1 | 40.51 | √ | HEK |
| NX_A6NN73 | GOLGA8CP | 15 | 1 | 1 | 40.35 | √ | HEK |
| NX_Q9H2H0 | CXXC4 | 4 | 1 | 1 | 39.19 | √ | HEK |
| NX_Q9BXX2 | ANKRD30B | 18 | 3 | 2 | 39.01 | √ | Aorta,HEK |
Figure 6(A) Spectra assignment of peptide LYSSLLDEIR from protein NX_Q8TD57 (DNAH3, chromosome 16) detected with Mascot ion score 75.99 in spermatozoa. (B) Spectra assignment of peptide TQTISLGQGQGPIAAK from protein NX_Q8TD57 (DNAH3, chromosome 16) detected with Mascot ion score 80.77 in spermatozoa. (C) Spectrum assignment of peptide DAASCGPGAAVAAVER from protein NX_A6NJT0 (UNCX, chromosome 7) detected with Mascot ion score 113.22 in HEK293 cell line.
Peptides Selected for Validation Using Targeted Proteomics (SRM/MRM)
| Protein | Name | Peptide | Chr | Sample | Ion score | Missing in neXtProt20160111 | HPP guidelines (2 proteotypic peptides) |
|---|---|---|---|---|---|---|---|
| NC_A6NJT0 | UNCX | DAASCGPGAAVAAVER | 7 | HEK | 113.22 | √ | √ |
| NC_Q9BQ87 | TBL1Y | IWTENGNLASTLGQHK | Y | HEK | 93.62 | √ | √ |
| NC_Q8TD57 | DNAH3 | TQTISLGQGQGPIAAK | 16 | Spermatozoa | 80.77 | √ | |
| NC_Q8TD57 | DNAH3 | LYSSLLDEIR | 16 | Spermatozoa | 75.99 | √ | |
| NC_Q2VIQ3 | KIF4B | EMCDMEQVLSK | 5 | HEK | 67.29 | √ | √ |
| NC_Q5T2N8 | ATAD3C | AAGTLFGEGFR | 1 | HEK | 66.45 | √ | |
| NC_Q2VIQ3 | KIF4B | NLELEVINLQK | 5 | HEK | 64.73 | √ | √ |
| NC_A8K0S8 | MEIS3P2 | MVQPMIDQSNR | 17 | HEK | 58.75 | √ | |
| NC_Q8TD57 | DNAH3 | EANVAAAIAQGIK | 16 | Spermatozoa | 49.37 | √ | |
| NC_A6NHN6 | NPIPB15 | ADEVEQSPKPK | 16 | Spermatozoa | 47.7 | √ | |
| NC_Q8N0W5 | IQCK | AGEPFTEFFSIPFVEER | 16 | Spermatozoa | 43.57 | √ | |
| NC_B2RXH8 | HNRNPCL2 | MIASQVAVINLAAEPK | 1 | HEK | 43.42 | √ | √ |
| NC_Q8TD57 | DNAH3 | VESVLFPELK | 16 | Spermatozoa | 39.34 | √ | |
| NC_Q8TD57 | DNAH3 | DFDLEEVMK | 16 | Spermatozoa | 37.96 | √ | |
| NC_Q8TD57 | DNAH3 | AVVFVDDLNMPAK | 16 | Spermatozoa | 36.67 | √ | |
| NC_Q8TD57 | DNAH3 | GNILEDETAIK | 16 | Spermatozoa | 36.09 | √ | |
| NC_Q6URK8 | TEPP | YCLSQNPSLDR | 16 | Spermatozoa | 31.36 |
Results of the Mascot Search of the Heavy Peptide Sample Using the neXtProt Databasea
| Peptide | Protein | Chr | Name | Max ion score | Missing in (neXtProt20160111) |
|---|---|---|---|---|---|
| DAASCGPGAAVAAVER | NX_A6NJT0 | 7 | UNCX | 78.49 | √ |
| VESVLFPELK | NX_Q8TD57 | 16 | DNAH3 | 75.29 | |
| AAGTLFGEGFR | NX_Q5T2N8 | 1 | ATAD3C | 56.93 | √ |
| MIASQVAVINLAAEPK | NX_B2RXH8 | 1 | HNRNPCL2 | 55.83 | √ |
| ADEVEQSPKPK | NX_A6NHN6 | 16 | NPIPB15 | 54.39 | √ |
| EANVAAAIAQGIK | NX_Q8TD57 | 16 | DNAH3 | 52.72 | |
| EMCDMEQVLSK | NX_Q2VIQ3 | 5 | KIF4B | 48.67 | √ |
| YCLSQNPSLDR | NX_Q6URK8 | 16 | TEPP | 46.26 | |
| MVQPMIDQSNR | NX_A8K0S8 | 17 | MEIS3P2 | 45.03 | √ |
| LYSSLLDEIR | NX_Q8TD57 | 16 | DNAH3 | 44.68 | |
| TQTISLGQGQGPIAAK | NX_Q8TD57 | 16 | DNAH3 | 44.15 | |
| AVVFVDDLNMPAK | NX_Q8TD57 | 16 | DNAH3 | 43.37 | |
| DFDLEEVMK | NX_Q8TD57 | 16 | DNAH3 | 39.71 | |
| GNILEDETAIK | NX_Q8TD57 | 16 | DNAH3 | 39.6 |
Conclusive proteins according to PAnalyzer were selected (PSM FDR < 1%, Protein FDR < 1%).
Figure 7(A) Comparison of the MS/MS spectrum of peptide EANVAAAIAQGIK from DNAH3 protein obtained in the shotgun experiment (lower) and the MS/MS spectrum for its synthetic heavy peptide (upper) obtained in the LC–SRM experiment (SDPScore = 0.88). (B) Comparison of the MS/MS spectrum of peptide VESVCFPELK from DNAH3 protein obtained in the shotgun experiment (lower) and the MS/MS spectrum for its synthetic heavy peptide (upper) obtained in the LC–SRM experiment (SDPScore = 0.90). (C) Comparison of the MS/MS spectrum of peptide AAGTLFGEGFR from ATAD3C protein obtained in the shotgun experiment (lower) and the MS/MS spectrum for its synthetic heavy peptide (upper) obtained in the LC–SRM experiment (SDPScore =0.89).
Figure 8(A) Venn diagram with the peptides selected for detection and the results of the different stages of the validation analysis. (B) Endogenous (upper) and synthetic heavy peptide (lower) LC-SRM signals measured for the peptide AAGTLFGEGFR from ATAD3C in the HEK293 cell line. (C) Endogenous (upper) and synthetic heavy peptide (lower) LC-SRM signals measured for the peptide EANVAAAIAQGIK from DNAH3 in the spermatozoa sample. (D) Endogenous (upper) and synthetic heavy peptide (lower) LC-SRM signals measured for the peptide GNILEDETAIK from DNAH3 in the spermatozoa sample. (E) Endogenous (upper) and synthetic heavy peptide (lower) LC-SRM signals measured for the peptide VESVLFPELK from DNAH3 in the spermatozoa sample.