Literature DB >> 31510687

pNovo 3: precise de novo peptide sequencing using a learning-to-rank framework.

Hao Yang^1,2, Hao Chi^1,2, Wen-Feng Zeng^1,2, Wen-Jing Zhou^1,2, Si-Min He^1,2.

Abstract

MOTIVATION: De novo peptide sequencing based on tandem mass spectrometry data is the key technology of shotgun proteomics for identifying peptides without any database and assembling unknown proteins. However, owing to the low ion coverage in tandem mass spectra, the order of certain consecutive amino acids cannot be determined if all of their supporting fragment ions are missing, which results in the low precision of de novo sequencing.
RESULTS: In order to solve this problem, we developed pNovo 3, which used a learning-to-rank framework to distinguish similar peptide candidates for each spectrum. Three metrics for measuring the similarity between each experimental spectrum and its corresponding theoretical spectrum were used as important features, in which the theoretical spectra can be precisely predicted by the pDeep algorithm using deep learning. On seven benchmark datasets from six diverse species, pNovo 3 recalled 29-102% more correct spectra, and the precision was 11-89% higher than three other state-of-the-art de novo sequencing algorithms. Furthermore, compared with the newly developed DeepNovo, which also used the deep learning approach, pNovo 3 still identified 21-50% more spectra on the nine datasets used in the study of DeepNovo. In summary, the deep learning and learning-to-rank techniques implemented in pNovo 3 significantly improve the precision of de novo sequencing, and such machine learning framework is worth extending to other related research fields to distinguish the similar sequences.
AVAILABILITY AND IMPLEMENTATION: pNovo 3 can be freely downloaded from http://pfind.ict.ac.cn/software/pNovo/index.html. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
Peptides

Year: 2019 PMID： 31510687 PMCID： PMC6612832 DOI： 10.1093/bioinformatics/btz366

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Shotgun proteomics research based on mass spectrometry data focuses on high-throughput peptide and protein identification. The main method is using dedicated sequence databases to identify peptides and proteins, such as SEQUEST (Eng ), Mascot (Perkins ), MaxQuant/Andromeda (Cox and Mann, 2008), PEAKS DB (Zhang ) and pFind (Chi , 2018). Despite its indisputable popularity, database search still needs reference databases to retrieve peptide candidates, hence it cannot search for species without any proteome databases such as microbial communities (Hettich ) or unknown proteins such as monoclonal antibodies (Reichert ). Even for searching against known sequences, amino acid mutations, post-translational modifications (Chick ) and splice variants (Zhu ) are still hard to be identified by the existing database search strategies. An alternative method for peptide and protein identification is de novo sequencing, which infers amino acid sequences directly from tandem mass spectra. De novo sequencing does not need any reference databases, so it has an irreplaceable advantage for identifying novel protein sequences. For example, many studies have used de novo sequencing methods to assemble monoclonal antibodies (Bogdanoff ; Guthals ; Tran ). Over the past decades, many de novo sequencing algorithms for shotgun proteomics have been proposed, such as PEAKS (Ma ), PepNovo (Frank and Pevzner, 2005), pNovo (Chi , 2010; Yang ) and Novor (Ma, 2015). Although many de novo sequencing tools have been proposed, the precision of de novo sequencing is still questionable. Muth and Renard (2017) stated that only ∼40% of de novo sequencing results were consistent with the database search results, in which the analyses on simulation datasets showed that the low precision of de novo sequencing was mainly caused by the abundant noise peaks and the low fragment ion coverage in tandem mass spectra, especially for the latter. When the fragment ion coverage decreased from 100% to 50%, the proportion of correctly sequenced peptides dropped from 80% to only 20%, suggesting that the precision of de novo sequencing is very sensitive to the fragment ion coverage, whose fundamental cause is that the lack of fragment ions makes the order of consecutive amino acids indistinguishable, e.g. if no supporting fragment ions are detected between the first two amino acids of the peptide AEHDK in the tandem mass spectrum, then EAHDK may be wrongly regarded as the de novo sequencing result of this spectrum without any addition information. In order to discriminate the similar peptide candidates, there needs a more powerful scoring method to better rerank de novo sequencing results of each single spectrum and, in particular, the difference among spectra does not need to be considered. Learning-to-rank models (Bartell ; Liu, 2010) are suitable for solving this problem, which are also useful for many applications in information retrieval. Given one query, all webpages should be ranked by the relevance between the query and each webpage, which is quite similar to ranking peptides (webpages) for each given spectrum (query), regardless of the diversity among different spectra. In addition, deep learning has a continuous upward trend in many research fields, even in the hard decision problems such as the game of Go (Silver ). Also, a few studies based on deep learning in proteomics have been proposed recently. For example, DeepNovo (Tran ) uses convolutional neural networks and recurrent neural networks (Hochreiter and Schmidhuber, 1997) to learn features of tandem mass spectra for de novo sequencing, and pDeep (Zhou ) uses the bidirectional long short-term memory (Graves ) to predict the theoretical spectrum for one given peptide with a median Pearson similarity of over 0.9. Generally, researches based on deep learning are still not very common in proteomics community. In fact, deep learning can automatically learn high-levels of representation of complex data without predesigned features based on domain-specific knowledge, so this character can be used to learn the fragmentation pattern and other important features in tandem mass spectra and construct a universal learning-to-rank model to discriminate very similar results of de novo sequencing. In this article, we developed a novel de novo sequencing algorithm, pNovo 3. Unlike the way of using deep learning directly in DeepNovo (Tran ), peptide candidates were generated firstly using the traditional dynamic programming approach (Yang ) in pNovo 3, and then a few features were extracted based on the prediction results of pDeep (Zhou ) by deep learning, as well as other information related to the fragmentation patterns. Finally, a learning-to-rank model, trained by SVM-rank (Joachims, 2002; Joachims ), was built to rerank the peptide candidates generated previously. In addition, a spectrum merging method was proposed to merge the results of spectra with similar precursor ion masses to further improve the performance of pNovo 3. Compared with three other state-of-the-art de novo peptide sequencing tools, the recall of pNovo 3 was increased by 29.4–96.1% at the full-length peptide level and 2.0–20.1% at the amino acid level on seven test datasets with different species. In addition, the recall of pNovo 3 was 20.6–49.8% higher than that of DeepNovo on nine other datasets, proving the significant improvement on the precision of de novo sequencing by using deep learning and learning-to-rank. pNovo 3 can now be freely downloaded from the following website: http://pfind.ict.ac.cn/software/pNovo/index.html.

2 Materials and methods

pNovo 3 uses the same approach as pNovo+ and Open-pNovo (Chi ; Yang ) to get top-ranked peptide candidates for each spectrum, and then it has four steps to rerank the preliminary results. First, the theoretical spectrum for each candidate is predicted by pDeep (Zhou ) based on the deep learning approach. Second, features are extracted based on the results of pDeep and other statistics. Third, peptide candidates are reranked by the model trained by learning-to-rank (Joachims, 2002; Joachims ). Last, the results of the whole dataset are updated using the spectrum merging method. The workflow of pNovo 3 is shown in Figure 1. Before the introduction of the pNovo 3 workflow, we will first introduce the seven benchmark datasets, one of which was used in the following steps of model training.

Fig. 1.

The workflow of pNovo 3

2.1 Generating the benchmark datasets

Seven high-resolution datasets are used in this study. The first five datasets are acquired from the Thermo Scientific Q Exactive with the HCD activation mode (Cassidy ; Hu ; Nevo ; Paiva ; Seidel ) and the last two datasets are acquired from the Thermo Scientific Q Exactive HF-X (the latest MS instrument in the benchtop Orbitrap series) with the HCD activation mode (Kelstrup ). These datasets are from a wide variety of species to ensure an unbiased evaluation on different samples. All datasets can be downloaded from the ProteomeXchange website and the details are shown in Supplementary Table S1. The first one (Vigna mungo, V.mungo) is used for training the learning-to-rank model while the other six ones are used for the performance evaluation. pFind (Chi , 2018) and PEAKS DB (Zhang ) are used to search the seven datasets mentioned above against the proteins of the corresponding sample downloaded from the UniProt database in 2017.9. The search results of pFind and PEAKS DB are filtered with the false discovery rate (FDR) of 1% at the peptide level and the peptide-spectrum match (PSM) level, respectively. The detailed database search parameters of pFind and PEAKS DB are shown in Supplementary Table S2. To build the benchmark datasets, the inconsistent PSMs reported by pFind and PEAKS DB are removed. In addition, as the current version of pDeep cannot predict the theoretical spectra of modified peptides, PSMs with modified peptides except those with carbamidomethylation of cysteines only are removed. The retained PSMs consistently reported by pFind and PEAKS DB are used as the ground truth, and the corresponding MS/MS data are extracted from the original datasets for evaluating the performances of different de novo sequencing algorithms in the Section 3.

2.2 Generating theoretical spectra by pDeep

The intensity information is essential for calculating the similarity between a theoretical spectrum and a real spectrum. However, in most scoring methods for PSMs, the intensities of all peaks in theoretical spectra are set to equal values or only by a few simple rules, which makes it difficult to distinguish among different orders of consecutive amino acids if no fragment ions are observed among them. In order to show the importance of the intensity information, we take two peptide candidates, P1: GTFSGLESSSPEVK and P2: GTFSGLESSSEPVK, for one spectrum, as a running example. If there are no fragment ions observed between PE and between EP, then the scores of the two corresponding PSMs should be identical to each other. However, the fragmentation patterns of the two peptides are quite different that the intensity of the y-ion between E and P should be much higher than that between P and E, which is in general in CID or HCD activation mode (Snyder, 2000). If such pattern can be used in generating theoretical spectra, then P1 will be more probable to be the result of the spectrum because the absence of the fragment ions between P and E is more consistent with its real fragmentation pattern. The training datasets of pDeep are from several published datasets (Chick ; Kulak ; Sharma ) of a wide variety of species produced by Q Exactive or Q Exactive HF. The theoretical spectrum predicted by pDeep is composed of the masses and intensities of all backbone theoretical ions, including b and y ion series with 1+ and 2+ charge states. Assuming that , …, (n is the number of all ions) are the real intensities of all ions (b1+, b2+, …, b1++, b2++, …, y1+, y2+, …, y1++, y2++,…), , …, are the predicted intensities of the corresponding ions. is the mean of , …, , is the mean of , …, , , …, are the indexes of , …, if they are sorted in descending order, and , …, are the indexes of , …, if they are sorted in descending order, three measures of similarities, i.e. cosine, Pearson and Spearman between the theoretical and real spectra, are computed by formulas 1 to 3, respectively. The value of cosine similarity is from 0 to 1 and the values of the other two similarities are from −1 to 1.

2.3 Extracting gap features

The information of fragmentation gaps in PSMs is used to design features independently to the theoretical spectrum prediction of pDeep. In the running example, when the b and y ions fragmented between two consecutive amino acids (e.g. PE or EP) are not observed in the spectrum, we cannot distinguish between the two peptides without other information. However, we can compute the probabilities of losing fragment ions between PE and EP based on the statistics by using a large amount of the existing high-resolution MS/MS data. The probability of losing fragment ions between two consecutive amino acids XZ is defined by the number of XZ between which the ions are missing divided by the total number of XZ in the dataset used in the statistics. This probability is referred to as g1, which is from 0 to 1. Specifically, considering that the order of the two N-terminal amino acids reported by de novo sequencing is often more error-prone (Fu and Li, 2005), we also compute the probability of losing fragment ions between the two N-terminal amino acids, which is referred to as g2. Its value is also from 0 to 1. In the running example, the probabilities of losing fragment ions between EP and PE are 3.4% and 20.1%, respectively, indicating that the peptide with PE should be more confident than the peptide with EP. Then, given one PSM, two features are generated based on g1 and g2, and then used in the learning-to-rank model. One is called G1, which is the arithmetic mean value of all gaps g1 found in the PSM. G1 is set as 1.0 if no gaps are found, indicating that the PSM has no gaps. The other one is called G2, which is equal to g2 if there is an N-terminal gap, otherwise it is also set as 1.0. Take one peptide P: GTFSGLESSSPEVK as an example, where three gaps, GT, LE and PE are detected and the first gap GT is the N-terminal gap. Then, two gap features are computed: and . It is worth mentioning again that GT is involved in the feature extraction twice based on two different ways of data statistics, one is from all possible amino acid pairs and the other is only from the N-terminal ones. Therefore, the values of and may not be identical to each other.

2.4 Training the learning-to-rank model

Six features are finally extracted before model training, i.e. the original PSM score, the three similarities between the theoretical and real spectrum described in Section 2.2, and the two features, G1 and G2, related to gap information described in Section 2.3. Then, SVM-rank (Joachims ) is used to train the model for reranking top-ranked peptide candidates for each spectrum. All feature values are normalized to [0, 1] according to the value range of the corresponding feature of top-ranked peptide candidates for each spectrum. As mentioned in Section 2.1, the first benchmark dataset of V.mungo is used for training the model. For each spectrum in this benchmark dataset, pNovo+ (Yang ) is used to report de novo sequencing results and top-10 candidate sequences are retained. If the correct peptide, annotated by the database search results, is not contained in the top-10 candidate sequences for one spectrum, then this spectrum cannot be used in the model training; otherwise, the PSM with the correct peptide sequence is regarded as one positive sample, and the other nine PSMs with the incorrect peptide sequences are regarded as nine negative samples. SVM-rank is then trained on all of the positive and negative samples using the regularization parameter of 1000 based on a linear classifier rather than a kernel classifier, owing to the higher speed of the former.

2.5 Refining the top-1 results by spectrum merging

After reranking the top-10 candidate sequences for each spectrum by the output scores of SVM-rank, different spectra with similar precursor ion masses within a pre-set tolerance (e.g. ±20 ppm) are further checked to see whether they are generated by the same peptide. In this step, the only top-1 sequence in each spectrum is retained. To avoid merging spectra incorrectly by the random match of similar precursor ions from different peptides, a few additional measures should be involved for evaluating the match quality between the current spectrum and each peptide from other spectra. For example, for one spectrum s1, assume that there is another spectrum s2 with a similar precursor ion mass. The top-1 sequences for s1 and s2 are p1 and p2, respectively. Then the match quality of s1 and p2 is to be tested. To be more specific, for the PSM between s1 and p2, if the maximum matched tag length is greater than 3, and the summed intensities of matched peaks account for 5% of the total in s1, then p2 should be considered as a peptide candidate for updating the result of s1. The match quality of s2 and p1 needs to be tested in the same way. Supplementary Figure S1 shows an example to further explain the process of spectrum merging. For spectrum s1, its top-1 result reranked by SVM-rank is ADCEFK with the score of 19. After spectrum merging, another five spectra are found and there are three different sequence candidates in total for the six spectra from s1 to s. Then, another SVM-rank model is trained, in which two features are considered: one is the number of PSMs supporting each peptide candidate, and the other is the mean value of SVM-scores of the supporting PSMs for each peptide candidate. The two features are based on the fact that a peptide candidate is more confident if it is supported with more PSMs and with a better score. Finally, in this example, the incorrect sequence ADCEFK of spectrum s1 is corrected to the true one ACDEFK.

3 Results

3.1 The effect of different features

First, we have investigated the three similarity distributions based on the correct identified PSMs on the seven real datasets (Supplementary Fig. S2). The details about these datasets are shown in Supplementary Table S1. For V.mungo dataset, there are as high as 82–87% of results, whose similarities are larger than 0.9, and the median values of cosine, Pearson and Spearman similarities are as high as 0.97, 0.97 and 0.94, respectively (Table 1), suggesting the excellent performance of pDeep that the theoretically predicted spectra are very similar to the real ones.

Table 1.

Median values of three similarities on all datasets

	V.mungo	M.musculus	M.mazei	S.cerevisiae	A.mellifera	QE_HF_X1	QE_HF_X2
Cosine	0.97	0.97	0.97	0.96	0.97	0.96	0.96
Pearson	0.97	0.97	0.96	0.96	0.96	0.95	0.95
Spearman	0.94	0.94	0.94	0.92	0.94	0.94	0.93
#PSMs^a	41 721	12 538	67 452	55 163	126 966	104 052	83 313

All spectra whose top-10 peptide candidates contain the correct results are considered. This part accounts for 60–76% of total spectra on all of the seven datasets Vigna mungo (V.mungo), Mus musculus (M.musculus), Methanosarcina mazei (M.mazei), Saccharomyces cerevisiae (S.cerevisiae), Apis mellifera (A.mellifera), QE_HF_X1 and QE_HF_X2.

Median values of three similarities on all datasets All spectra whose top-10 peptide candidates contain the correct results are considered. This part accounts for 60–76% of total spectra on all of the seven datasets Vigna mungo (V.mungo), Mus musculus (M.musculus), Methanosarcina mazei (M.mazei), Saccharomyces cerevisiae (S.cerevisiae), Apis mellifera (A.mellifera), QE_HF_X1 and QE_HF_X2. Then, we also tested the performances of reranking separately using each of the six features and merging all features in one learning-to-rank framework (Supplementary Table S3). If the values of the five features (i.e. cosine, Pearson, Spearman, G1 and G2) of two peptide candidates are the same, the original PSM score is used to rerank these two candidates. Although the performances of separately using the last five features except the original score are all inferior to that of the original score, the performance of considering all features in the same learning-to-rank model is significantly better, suggesting the good effect and complementary of the features considered in our model. In order to further investigate the validity of the features, we have compared the distributions between the correct and incorrect results considering each feature (Supplementary Fig. S3). The two distributions shown in each subfigure are with a large K-S distance and the corresponding P-value is less than 0.01 based on the two-sample Kolmogorov–Smirnov test, suggesting that the features can effectively discriminate between the correct and incorrect results.

3.2 Performance of different de novo sequencing algorithms at the peptide level

pNovo 3 was compared with three other state-of-the-art de novo peptide sequencing tools (Supplementary Table S4), specifically PEAKS (Ma ) (v8.5), Novor (Ma, 2015) (v1.1) and pNovo+ (Yang ) (referred to as pNovo). The seven benchmark datasets described in Section 2.1 were used to measure the result accuracy. A PSM was regarded as correct if its peptide was the same as that annotated by database search in the benchmark datasets (regardless of the difference between Ile and Leu). The recalls of the top-1 peptide for each spectrum reported by all of the four de novo sequencing algorithms were calculated. As shown in Table 2, on V.mungo dataset, which was also used for the model training, the recall of pNovo 3 was 64.6%, which was 50.6% higher than that of pNovo (42.9%) and 45.8% higher than that of PEAKS (44.3%). Novor recalled less PSMs, which might be caused by not training with the high-resolution datasets in its test version. On all seven datasets, the recall of pNovo 3 was 35.6–96.1% higher than that of pNovo and 29.4–102.4% higher than that of PEAKS, which demonstrated the good extendibility of the machine learning model. Figure 2 and Supplementary Figure S4 show the consistency of the correct results reported by pNovo 3, pNovo and PEAKS, in which pNovo 3 covered 88.1–94.6% of pNovo results and 82.5–89.6% of PEAKS results. Also, pNovo 3 independently reported 20.6–43.3% more PSMs, which were reported by neither pNovo nor PEAKS.

Table 2.

Recall of top-1 peptides identified by different de novo sequencing algorithms

	V.mungo	M.musculus	M.mazei	S.cerevisiae	A.mellifera	QE_HF_X1	QE_HF_X2
pNovo 3	64.6%	50.4%	66.0%	64.7%	62.5%	47.8%	38.3%
pNovo	42.9%	25.7%	42.4%	47.7%	36.7%	29.8%	21.4%
PEAKS	44.3%	24.9%	42.4%	50.0%	38.0%	32.2%	24.6%
Novor	17.4%	9.7%	19.1%	19.1%	13.7%	10.9%	9.3%
#Total PSMs	62 089	25 354	103 959	81 326	217 841	196 759	201 301

Fig. 2.

Venn diagram of the correct results of pNovo 3, pNovo and PEAKS on the first three datasets: (a) V.mungo, (b) M.musculus and (c) M.mazei

Recall of top-1 peptides identified by different de novo sequencing algorithms Venn diagram of the correct results of pNovo 3, pNovo and PEAKS on the first three datasets: (a) V.mungo, (b) M.musculus and (c) M.mazei The recalls considering from top-1 to top-10 peptide candidates for each spectrum are also demonstrated in Figure 3 for the first three datasets and Supplementary Figure S5 for the other four datasets. The recall considering top-k (1 ≤ k ≤ 10) candidates was calculated by the number of the spectra whose correct peptide results were in the top-k sequences divided by the number of total spectra. As Novor only reported the top-1 results, it was not considered in this analysis. As shown in Figure 3 and Supplementary Figure S5, the recall of top-10 results reported by pNovo 3 was ∼20.8% higher than that of pNovo and ∼25.7% higher than that of PEAKS on all datasets. Although the recall of top-10 results reported by pNovo was slightly higher than that of PEAKS, the recall of top-1 results reported by pNovo was even a little worse than that of PEAKS. This meant that the scoring method in pNovo was less effective to distinguish the similar candidates in one spectrum. However, the refined scoring method in pNovo 3 was shown to be much more powerful. As a result, pNovo 3 yielded a large difference of recall compared with pNovo and PEAKS, especially for the top-1 results which were more important for real biological discoveries.

Fig. 3.

The recalls of top-1 to top-10 on the first three datasets: (a) V.mungo, (b) M.musculus and (c) M.mazei

The recalls of top-1 to top-10 on the first three datasets: (a) V.mungo, (b) M.musculus and (c) M.mazei The recall difference between pNovo 3 and pNovo considering top-10 candidates demonstrated the effect of the spectrum merging method, the last step of pNovo 3, because the original top-10 peptide candidates were the same for both pNovo 3 and pNovo. Furthermore, we compared the performance between pNovo 3, the same algorithm without spectrum merging (referred to as pNovo 3-NM) and the same algorithm without SVM-rank model mentioned in Section 2.4 (referred to as pNovo 3-NR). The recall of top-1 results reported by pNovo 3 was 15.0–35.2% higher than that of pNovo 3-NM and 15.6–35.8% higher than that of pNovo 3-NR (Supplementary Table S5). This demonstrated that both of the two SVM-rank models are useful for increasing the number of correct results. Furthermore, pNovo 3 stably covered ∼96% of pNovo 3-NM results on all datasets and independently reported 17.2–30.0% of the total results (Supplementary Fig. S6), which was also proved that this strategy hardly replaced a correct PSM from the learning-to-rank model by an incorrect one.

3.3 Examples showing the effect of the features used in pNovo 3

Two examples were selected to explain why pNovo 3 could report more correct results than pNovo and PEAKS. The first one is shown in Figure 4. For this spectrum, only pNovo 3 reported the correct sequence (KYDEIDAAPEER) annotated by database search while pNovo and PEAKS reported the same incorrect sequence (KYDEIDAAEPER). If only considering the quality of the PSM, both two sequences matched the same backbone fragment ions, hence the match scores should be the same if no additional information was considered. However, according to the two theoretical spectra correspondingly predicted by pDeep, the fragmentation patterns of the two sequences were actually quite different, especially for the intensities of y2, y3 and y4 ions (Supplementary Figs S7 and S8), which resulted in the different similarities. In addition, the probabilities of the existence of two gaps, PE and EP, were 0.2 and 0.03, respectively, which is also helpful in distinguishing between the two sequences. Another similar example is shown in Supplementary Figures S9–S11.

Fig. 4.

One example shows that the features extracted by pNovo 3 can effectively discriminate between the correct and very similar incorrect results. The real spectrum is from V.mungo dataset and the title of this spectrum is 4723.8552.8552.2.dta. Both the correct (KYDEIDAAPEER, the above subfigure) and incorrect (KYDEIDAAEPER, the below subfigure) peptide sequences are matched to this real spectrum. Five features of the correct and incorrect peptide sequences are labeled with the green and red figures, respectively

3.4 Recalls and precisions of different de novo sequencing algorithms at the amino acid level

PEAKS and Novor also reported scores for individual amino acids in the peptide results, which indicate the local confidence level of PSMs and is helpful in assembling entire protein sequences (Tran ). The same function of pNovo 3 was implemented by the newly developed software tool, pSite (Yang ). Considering the top-1 results reported by each algorithm, the recalls and precisions with different confidence score thresholds were computed (Fig. 5a–c for the first three datasets and Supplementary Fig. S12 for others), and the area under curve (AUC) (Davis, 2006) metric can be used to evaluate the overall accuracy of de novo sequencing at the amino acid level. On all datasets, the precision-recall (PR) curves of pNovo 3 were always higher than those of pNovo, PEAKS and Novor. The AUC of pNovo 3 was 12.1–34.4% higher than that of pNovo, 2.0–20.1% higher than that of PEAKS and 65.7–112.5% higher than that of Novor (Fig. 5d).

Fig. 5.

The precision-recall (PR) curves of pNovo 3, pNovo, PEAKS and Novor on the first three datasets: (a) V.mungo, (b) M.musculus, (c) M.mazei. (d) The AUCs of the three algorithms on the seven datasets

The precision-recall (PR) curves of pNovo 3, pNovo, PEAKS and Novor on the first three datasets: (a) V.mungo, (b) M.musculus, (c) M.mazei. (d) The AUCs of the three algorithms on the seven datasets Supplementary Table S6 shows the total recall and precision of amino acids on the seven datasets regardless of the confidence level, i.e. all amino acids reported by each de novo sequencing tool were considered to compute the recall and precision. The recall and precision of pNovo 3 were always greater than 80% on the first five datasets in most cases. On the last two datasets produced by Q Exactive HF-X, the recall and precision decreased to 55–73%, which performed similarly to that for full-length peptides. Overall, the recall of pNovo 3 was 20.5%, 9.2% and 65.9% higher than those of pNovo, PEAKS and Novor, respectively; meanwhile, the precision of pNovo 3 was 18.4%, 17.5% and 83.8% higher than the above three algorithms, respectively.

3.5 Comparing the performance of pNovo 3 and DeepNovo

Both pNovo 3 and DeepNovo (Tran ) have used the deep learning model on the Google TensorFlow library, and the performances of them were also compared in this study. As DeepNovo used different models for the nine high-resolution test datasets (Cassidy ; Cypryk ; Hu ; Mata ; Nevo ; Paiva ; Petersen ; Reuß ; Seidel ), we downloaded the original results of DeepNovo rather than re-analysis the datasets using the unified model in the software, and then tested pNovo 3 with the same nine datasets to make a fair comparison. The benchmark strategy was the same as that shown in Section 2.1 that PSMs generated by the database search results of PEAKS DB at 1% FDR were used as the ground truth, while the PSMs whose corresponding spectra were not appeared in the results of DeepNovo were removed. As shown in Supplementary Table S7, the recall of pNovo 3 was still 20.6–49.8% higher than that of DeepNovo. Furthermore, pNovo 3 also yields higher recall and precision at the amino acid level for the top-1 peptide sequences (Supplementary Table S8). This gap might be owing to the different ways of using deep learning between these two algorithms. DeepNovo combined deep learning and dynamic programming in a unified de novo sequencing workflow, while pNovo 3 divided this workflow into two steps: finding top-ranked candidates by the traditional algorithm, e.g. pNovo+, and then reranking candidates considering several different features extracted by deep learning, which was integrated into a learning-to-rank framework. The first step has been widely investigated in past decades, which might be more mature compared with the newly proposed deep learning approach. However, once the top-ranked peptide candidates were generated, the deep learning approach, which provided more accurate spectrum prediction for the following learning-to-rank model, played a more important role in distinguishing among the similar peptides in pNovo 3.

4 Discussion

In this study, we have used the deep learning approach to extract features, and built a learning-to-rank model to rerank the results of de novo sequencing. Until now, the problem of low precision on de novo sequencing has not been solved well because there are no effective methods to distinguish similar peptides if no pivotal peaks in one spectrum can be detected. As a result, a more powerful scoring function is needed, and using deep learning to learn the fragmentation pattern of peptides comes into the view of this study. However, deep learning models are often learned directly from raw data and do not rely on well-designed features, and we still need to find that which features are most useful for de novo sequencing to discriminate between correct and incorrect peptides. In this study, we found that the similarity between the experimental and theoretical spectra, which were measured by three types of metrics, was very important for reranking de novo sequencing results. As no model comprehensively considering modified peptides was yet trained by the current version of pDeep, only peptides without variable modifications were used in this study; however, the learning-to-rank model can be easily extended to modified peptides with the upgrading of pDeep. We used learning-to-rank models [e.g. SVM-rank (Joachims, 2002; Joachims ) and RankBoost (Freund )], rather than traditional machine learning models [e.g. SVM (Cortes and Vapnik, 1995; Vapnik, 1999) and decision tree (Quinlan, 1999)], in this study. In general, traditional machine learning models are more suitable to learn a global classification function to effectively discriminate between correct and incorrect PSMs from different spectra; however, as the comparison among different spectra is less important in de novo sequencing, learning-to-rank models are more applicable to solve the reranking problem, i.e. rerank the similar sequence candidates for each spectrum. On all datasets, the recalls and precisions of pNovo 3 are always the highest compared with pNovo, PEAKS, Novor and DeepNovo. But the recalls of top-10 results from pNovo 3 are still only 60–76% on different datasets so that we are curious about the reason why the rest of the results cannot be sequenced even when as many as ten candidates are considered. For example, a total of 15 067 (24% of 62 089) PSMs in the V.mungo dataset are not recalled in the top-10 results, and 32% (4775/15 067) of which are difficult to be recalled by de novo sequencing because the maximum gap lengths are greater than 2. We further try to enumerate the similar peptide candidates based on the correct sequence from these low-quality PSMs, and then match them to the original spectra. For example, if the correct sequence is ASQEPK with a gapped subsequence ASQ, the similar candidates involve ASQEPK, AQSEPK, …, SQAEPK, and then the three similarity metrics used in this study are computed (Supplementary Fig. S13). We find that their similarities are too close to find which one is correct. This means that the de novo sequencing algorithms at the current stage may not be able to distinguish among the similar results with long gapped subsequences, even using the effective deep learning approach. In this case, the more effective way to improve the accuracy of de novo sequencing is to produce high-quality MS/MS spectra with higher fragment ion coverage.

Funding

This work was supported by the National Key Research and Development Program of China (2016YFA0501300 to S.-M.H), the National Natural Science Foundation of China (31470805), the Youth Innovation Promotion Association CAS (2014091), the National High Technology Research and Development Program of China (863) (2014AA020902 to S.-M.H. and 2014AA020901 to H.C.). Conflict of Interest: none declared. Click here for additional data file.

39 in total

1. Probability-based protein identification by searching sequence databases using mass spectrometry data.

Authors: D N Perkins; D J Pappin; D M Creasy; J S Cottrell
Journal: Electrophoresis Date: 1999-12 Impact factor: 3.535

2. PEAKS: powerful software for peptide de novo sequencing by tandem mass spectrometry.

Authors: Bin Ma; Kaizhong Zhang; Christopher Hendrie; Chengzhi Liang; Ming Li; Amanda Doherty-Kirby; Gilles Lajoie
Journal: Rapid Commun Mass Spectrom Date: 2003 Impact factor: 2.419

3. PepNovo: de novo peptide sequencing via probabilistic network modeling.

Authors: Ari Frank; Pavel Pevzner
Journal: Anal Chem Date: 2005-02-15 Impact factor: 6.986

4. Monoclonal antibody successes in the clinic.

Authors: Janice M Reichert; Clark J Rosensweig; Laura B Faden; Matthew C Dewitz
Journal: Nat Biotechnol Date: 2005-09 Impact factor: 54.908

5. De novo sequencing of neuropeptides using reductive isotopic methylation and investigation of ESI QTOF MS/MS fragmentation pattern of neuropeptides with N-terminal dimethylation.

Authors: Qiang Fu; Lingjun Li
Journal: Anal Chem Date: 2005-12-01 Impact factor: 6.986

6. An overview of statistical learning theory.

Authors: V N Vapnik
Journal: IEEE Trans Neural Netw Date: 1999

7. MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification.

Authors: Jürgen Cox; Matthias Mann
Journal: Nat Biotechnol Date: 2008-11-30 Impact factor: 54.908

8. pNovo: de novo peptide sequencing and identification using HCD spectra.

Authors: Hao Chi; Rui-Xiang Sun; Bing Yang; Chun-Qing Song; Le-Heng Wang; Chao Liu; Yan Fu; Zuo-Fei Yuan; Hai-Peng Wang; Si-Min He; Meng-Qiu Dong
Journal: J Proteome Res Date: 2010-05-07 Impact factor: 4.466

9. pNovo+: de novo peptide sequencing using complementary HCD and ETD tandem mass spectra.

Authors: Hao Chi; Haifeng Chen; Kun He; Long Wu; Bing Yang; Rui-Xiang Sun; Jianyun Liu; Wen-Feng Zeng; Chun-Qing Song; Si-Min He; Meng-Qiu Dong
Journal: J Proteome Res Date: 2012-12-28 Impact factor: 4.466

10. PEAKS DB: de novo sequencing assisted database search for sensitive and accurate peptide identification.

Authors: Jing Zhang; Lei Xin; Baozhen Shan; Weiwu Chen; Mingjie Xie; Denis Yuen; Weiming Zhang; Zefeng Zhang; Gilles A Lajoie; Bin Ma
Journal: Mol Cell Proteomics Date: 2011-12-20 Impact factor: 5.911

6 in total

1. A Novel Proteogenomic Integration Strategy Expands the Breadth of Neo-Epitope Sources.

Authors: Haitao Xiang; Le Zhang; Fanyu Bu; Xiangyu Guan; Lei Chen; Haibo Zhang; Yuntong Zhao; Huanyi Chen; Weicong Zhang; Yijian Li; Leo Jingyu Lee; Zhanlong Mei; Yuan Rao; Ying Gu; Yong Hou; Feng Mu; Xuan Dong
Journal: Cancers (Basel) Date: 2022-06-19 Impact factor: 6.575

2. Modern venomics-Current insights, novel methods, and future perspectives in biological and applied animal venom research.

Authors: Bjoern M von Reumont; Gregor Anderluh; Agostinho Antunes; Naira Ayvazyan; Dimitris Beis; Figen Caliskan; Ana Crnković; Maik Damm; Sebastien Dutertre; Lars Ellgaard; Goran Gajski; Hannah German; Beata Halassy; Benjamin-Florian Hempel; Tim Hucho; Nasit Igci; Maria P Ikonomopoulou; Izhar Karbat; Maria I Klapa; Ivan Koludarov; Jeroen Kool; Tim Lüddecke; Riadh Ben Mansour; Maria Vittoria Modica; Yehu Moran; Ayse Nalbantsoy; María Eugenia Pachón Ibáñez; Alexios Panagiotopoulos; Eitan Reuveny; Javier Sánchez Céspedes; Andy Sombke; Joachim M Surm; Eivind A B Undheim; Aida Verdes; Giulia Zancolli
Journal: Gigascience Date: 2022-05-18 Impact factor: 7.658

3. Software Options for the Analysis of MS-Proteomic Data.

Authors: Avinash Yadav; Federica Marini; Alessandro Cuomo; Tiziana Bonaldi
Journal: Methods Mol Biol Date: 2021

Review 4. Bioinformatics Methods for Mass Spectrometry-Based Proteomics Data Analysis.

Authors: Chen Chen; Jie Hou; John J Tanner; Jianlin Cheng
Journal: Int J Mol Sci Date: 2020-04-20 Impact factor: 5.923

Review 5. A tale of solving two computational challenges in protein science: neoantigen prediction and protein structure prediction.

Authors: Ngoc Hieu Tran; Jinbo Xu; Ming Li
Journal: Brief Bioinform Date: 2022-01-17 Impact factor: 11.622

6. SPEQ: Quality Assessment of Peptide Tandem Mass Spectra with Deep Learning.

Authors: Soroosh Gholamizoj; Bin Ma
Journal: Bioinformatics Date: 2022-01-03 Impact factor: 6.937

6 in total