Literature DB >> 35413196

Prosit Transformer: A transformer for Prediction of MS2 Spectrum Intensities.

Markus Ekvall¹, Patrick Truong¹, Wassim Gabriel², Mathias Wilhelm², Lukas Käll¹.

Abstract

Machine learning has been an integral part of interpreting data from mass spectrometry (MS)-based proteomics for a long time. Relatively recently, a machine-learning structure appeared successful in other areas of bioinformatics, Transformers. Furthermore, the implementation of Transformers within bioinformatics has become relatively convenient due to transfer learning, i.e., adapting a network trained for other tasks to new functionality. Transfer learning makes these relatively large networks more accessible as it generally requires less data, and the training time improves substantially. We implemented a Transformer based on the pretrained model TAPE to predict MS2 intensities. TAPE is a general model trained to predict missing residues from protein sequences. Despite being trained for a different task, we could modify its behavior by adding a prediction head at the end of the TAPE model and fine-tune it using the spectrum intensity from the training set to the well-known predictor Prosit. We demonstrate that the predictor, which we call Prosit Transformer, outperforms the recurrent neural-network-based predictor Prosit, increasing the median angular similarity on its hold-out set from 0.908 to 0.929. We believe that Transformers will significantly increase prediction accuracy for other types of predictions within MS-based proteomics.

Entities: Chemical

Keywords: MS2 Spectra; Machine Learning; Proteomics; Transformers

Mesh：

Year: 2022 PMID： 35413196 PMCID： PMC9087333 DOI： 10.1021/acs.jproteome.1c00870

Source DB: PubMed Journal: J Proteome Res ISSN： 1535-3893 Impact factor: 5.370

Introduction

Just as in many other areas involving the analysis of large and complex data sets, different types of machine learning are tremendously helpful for the modern analysis of mass spectrometry (MS)-based proteomics data.[1,2] For example, we nowadays can use machine learning to predict tryptic digestion,[3] chromatographic retention time,[4−6] collisional cross section,[7] the accuracy of peptide–spectrum matches,[8] and the accuracy of transitions in DIA data[9] are tasks that utilize machine learning. One task that has gained traction in the last couple of years is predicting MS2 spectra from peptide sequences.[10,11] Such predictors can predict relative intensities of a given peptide sequence’s b- and y-ions. Together with the m/z values of the ions, which one can derive from first principles, one can subsequently form a full MS2 spectrum. MS2 spectrum prediction has in a short time established itself as a means to rescore peptide spectrum matches,[12] increasing the sensitivity in large search spaces,[13] and target–decoy strategies for DIA interpretation.[14] Many types of frameworks are available for training a predictor, such as support vector machines and recurrent neural networks (RNNs) used within MS-based proteomics. However, in the last couple of years, a structure first in natural language processing[15] known as Transformers[16] has successfully been employed within bioinformatics, e.g., structure prediction,[17,18] gene expression prediction,[19] and even within MS-based proteomics, e.g., peptide detection problem,[20] DIA library generation for the phosphoproteome,[21] and de novo interpretation of MS2 spectra.[22] Transformers are, like RNNs, designed to handle sequential input data and do so through attention mechanisms, i.e., mechanisms that enhance the essential parts of the input sequence for its output. However, unlike RNNs, the Transformers do not use recurrence, thus enabling a significant speed-up by parallelizing their training. The encoder–decoder structure is the basis of the Transformers, where both the encoder and decoder adopt the multiheaded attention mechanism.[16] Notably, the task assessing protein embedding (TAPE) model[17] is exciting; a Transformer-based autoencoder of protein sequences is formed by withholding one amino acid at a time in a large set of protein sequences and subsequently predicting which is the missing amino acid. One can subsequently employ the model for higher-level tasks by plugging them into some extra layers of neurons in a process known as transfer learning.[17,18] Here, we argue that Transformers can greatly aid MS-based proteomics. We demonstrate that TAPE’s BERT submodel can predict MS2 spectrum intensities from peptide sequences. We are using the training and test sets of the popular Prosit[11] predictor and demonstrate that the transformer-based predictor, which we named Prosit Transformer, drastically outperforms the old implementation of Prosit.

Methods

Data

We downloaded the Prosit training data from https://figshare.com/projects/Prosit/35582. This set is composed of spectra from PXD004732, PXD010595, and PXD021013.[11,13] The Prosit data had to be converted from HDF5 to LMDB to be compatible with the TAPE framework. The LMDB data files used during training and validation are accessible at https://figshare.com/articles/dataset/LMDB_data_Tape_Input_Files/16688905.

Architecture

The TAPE model consists of 12 768 hidden unit attention layers, with the attention dropout (DropHead) rate[23] and regular dropout rate set to 0.1. We downloaded weights for the pretrained model that has been trained on the raw protein sequences in the protein families database (Pfam) to predict the amino acid at each protein position given the previous amino acids and the following amino acids.[17] The Prosit-specific transformer has the same parameter but consists of nine attention layers. The metadata layer is a multilayer perceptron (MLP) with two layers of size 512 units followed by a dropout rate of 0.1 each. The final prediction layer has the same structure, except for no dropout after the final layer. The activation function is ReLU, except for the prediction layer where the first layer uses a ReLU6,[24] i.e., a max(0,min(6,x)) function as an activation function, and the final layer uses a linear layer.

Metrics

We measure angular distanceand angular similarity, s = 1d as measures of accuracy of the predicted intensities. Here, A is the vector of predicted intensities, and B is the vector of observed intensities for the ion series included in the prediction. However, we introduced a few extra steps during training to avoid undefined behavior. First, to avoid undefined values using angular similarity during training, we had to clip the inputs to cos–1 with −(1 – ϵ) and (1 – ϵ) to avoid undefined values. This implementation was necessary since some predictions were too similar to their target after training, resulting in an undefined loss. However, there was no clipping during the evaluation, so it will not affect the final result. Lastly, we also had to introduce a small ϵ in the denominator in the cosine similarity, i.e., max(∥A∥·∥B∥, ϵ), to ensure no undefined behavior during training. The sum of all d for all peptides in the test set was used as a loss function for the training of the networks. We calculated the FDR = FP/(FP + TP) and FNR = FN/(FN + TP) for each predicted spectrum to measure the number of erroneous peak predictions. Here, FP is the number of peaks predicted in excess to be present in a spectrum that was absent in the observed spectrum; FN is the number of peaks deficiently predicted to be absent in a spectrum that was present in the observed spectrum, and TP is the number of peaks accurately predicted to be present in a spectrum that was present in the observed spectrum.

Postprocessing of Predicted Intensities

We use the same postprocessing on the predicted spectrum used in Prosit[11] for the final result. To clarify, we set ions with a predicted negative intensity to zero, i.e., a negative intensity indicates an absent peak. Furthermore, we set all ion’s intensity that is not obtainable for any given peptide due to too low a charge state or too low peptide length to −1. However, we exclude such peaks for similarity measurements.

Hardware

The model was trained on the Berzelius SuperPOD, a GPU cluster consisting of 60 NVIDIA DGX A100 systems, linked on a 200 Gbit/s NVIDIA Mellanox InfiniBand HDR network.

Results

We set out to test whether Transformers are a technology fit for spectrum intensity predictions, i.e., to predict the intensities of the most commonly observed ion series (b+, b2+, b3+, y+, y2+, and y3+) of product ion spectra from peptide fragmentation. The length of the peptides ranged between 7 and 30 amino acids long. We used the train/test data and the preprocessing coming with the Prosit predictor as a testbed. Prosit’s scripts calculating the intensity vectors, adopting metadata, and calculating predictions’ angular similarity have been found to be robust after years of use. We also found it straightforward to set up a benchmark, as we could reuse the Prosit test sets just out of the box. We will refer to the traditional Prosit predictor as Prosit RNN from hereon to avoid confusion.

Model

We set out to use the setup previously used for training and testing the Prosit model but with a transformer. We used the pretrained TAPE model[17] and retrofitted it with a Prosit-specific decoder and some additional application-specific code (see Figure ). The TAPE model will encode the peptide into a 512-dimensional embedding. Furthermore, just as for the original RNN-based Prosit model, we used layers for handling metadata consisting of the charge state of the spectrum and its collision energy (CE). The charge states range from one to six, represented as six-dimensional one-hot encoding. Hence, the metadata layer has seven input nodes to account for the charge state and CE. The metalayer transforms the metadata into a 512-dimensional vector that is subsequently combined with the encoded peptide by element-wise multiplication. Then a Prosit-specific Transformer will decode this combined embedding. Lastly, a two-layered multilayer perception (MLP) follows the decoding layer, serving as a prediction layer to predict the spectrum intensity. The MLP used activation by a hinge loss function constrained between 0 and 6 (a RELU6 function) to activate the two final layers to avoid a so-called gradient explosion. For the training, the objective function was to minimize the sum of the angular distances between the observed and predicted spectrum intensity vectors.

Figure 1

Architecture of the Prosit Transformer. The model depends on a pretrained encoder from the TAPE project and uses the TAPE design for a Prosit-specific decoder. However, our model implements many of the design features of Prosit RNN, i.e., layers handling metadata and final intensity prediction.

Training of the Model

During the training, we used a batch size of 1024, a learning rate of 0.0001, gradient accumulation step of 1, and a linear learning rate schedular with 10000 warmup steps. The training proceeded until no further improvement over 10 epochs. To better predict present and absent peaks, we introduced a hyperparameter, δ, setting an artificial offset of the intensities of absent peaks to δp = δ/|number of considered peaks|. This hyperparameter adds an extra penalty if the model predicts intensities for absent peaks. By varying the size of δ, we can control the model’s propensity to predict peaks as absent and, by such means, tune the model’s false positive and false negative predictions. We measured the false discovery rate (FDR) and the false negative rate (FNR) of each spectrum and then plotted the average angular similarity, the FDR, and the FNR for different choices of δ. We selected δ = 0.34 for the final training (see Figure ).

Figure 2

Effect of adjusting the hyperparameter δ on predicting the absence/presence of individual MS2 peaks. To obtain better prediction accuracy of present and absent MS2 peaks, we adjusted the intensities of absent peaks from zero to δ. We measured the false discovery rate (FDR) and the false negative rate (FNR) of each spectrum and then plotted the average angular similarity, the FDR, and the FNR for different choices of δ. We selected δ = 0.34 for the final training. The predicted spectra were not postprocessed for the measurements in this figure (see Methods).

Comparison of Performance to Regular Prosit

To test the performance of our final Prosit Transformer, we investigated its performance on the same held-out test set as used when initially training Prosit RNN. We calculated the so-called angular similarity between the predicted and observed intensities for both predictors. Overall, we see that the predictions from Prosit Transformer have an angular similarity higher than that of Prosit RNN and are hence more accurate (Figure A). The Prosit Transformer increased the median angular similarity from Prosit RNN’s 0.908 to 0.929. We also see that Prosit Transformer obtained an angular similarity higher than that of Prosit RNN in 75.7% of the spectra, whereas the opposite was true in 24.3% of the spectra. The same pattern was also true when dividing the PSMs based on their peptide’s lengths (Figure B). We also wanted to compare the predictors’ ability to predict present and absent (zero intensity) fragment peaks. Our choice of hyperparameter δ for Prosit Transformer resulted in a lower fraction of observed absent peaks among the predicted nonzero intensity peaks (Figure C) while observing a higher fraction of predicted absent peaks among the observed nonzero intensity peaks (Figure D) for Prosit Transformer compared to Prosit RNN.

Figure 3

Comparison of the accuracy of Prosit Transformer and Prosit RNN. (A) We made separate histograms and smoothed them with a kernel density estimator to observe the distribution of angular similarity for the spectra predicted with Prosit Transformer and Prosit RNN. (B) Same angular similarity was also stratified by the length of peptides. We also measured the (C) false discovery rate, i.e., the fraction of observed absent peaks among the predicted nonzero intensity peaks for each spectrum, and (D) false negative rate, i.e., the fraction of predicted absent peaks among the observed nonzero intensity peaks.

Comparison of a Transformer to an Extended RNN for Prediction of Spectra

We set out to eliminate other explanations for Prosit Transformer’s elevated performance than the Transformers themselves. A notable difference between Prosit RNN and Prosit Transformer is their difference in size. Prosit RNN contains 3 million parameters, while Prosit Transformer contains 164 million parameters, which gives the Transformer an unfair advantage. Hence, we stacked long short-term memory layers to create RNN models of similar size to the ones of the Transformers. This extended RNN gave a median angular similarity of 0.892 compared to Prosit Transformer’s 0.929. Further, Prosit Transformer also outperformed the extended RNNs encoder in combination with Prosit Transformer’s decoder (median angular similarity of 0.927), as well as Prosit Transformer’s encoder in combination with Prosit RNN’s decoder (median angular similarity of 0.915). See Table for an overview of the permutations of encoder decode architectures and their sizes.

Table 1

Extended RNN Model’s Size and Performancea

architecture for encoder–decoder	encoder size	encoder layers	encoder units	decoder size	decoder layers	decoder units	total size	median angular similarity
Transformer–Transformer	85M	12	768	64M	9	768	164M	0.929
RNN–RNN	77M	5	1028	93M	5	2056	178M	0.892
Transformer–RNN	64M	9	768	94M	10	768	172M	0.9156
RNN–Transformer	53M	6	768	113M	6	768	173M	0.927

We trained and tested different permutations of expanded RNNs and Transformers of comparable size and compared their prediction accuracy.

We trained and tested different permutations of expanded RNNs and Transformers of comparable size and compared their prediction accuracy. When training the RNN models, the learning rate had to be decreased from 0.0001 to 0.00008 to get the model to learn. Everything else was the same as for the Transformer–Transformer model. We also had to switch the gated recurrent unit of the Prosit RNN to an LSTM to use the TAPE framework, leading to minor differences between the extended Prosit RNN and Prosit RNN. Surprisingly, the extended RNN–RNN model got worse results than regular Prosit. The decrease could be due to that increase from 3 to 178 M parameters, leading to overfitting, requiring more data to justify such a massive model for the type of architecture. However, a performance increase was observed in all cases when adding a Transformer to the architecture. The most significant increase in performance appeared when implementing the Transformer as a decoder, i.e., after the peptide has been encoded and combined with the metadata, and not in the peptide’s encoding, although this improves the results, as well. At first, the conclusion that the Transformer–Transformer model performed best might seem to contradict the results of others. Particularly, DeepPhospho[21] reports a better performance for their LSTM–Transformer model than for their Transformer–Transformer model. However, it is worth noting that the circumstances were different; their LSTM decoder was larger than their Transformer decoder (34 M vs 6 M parameters).[21] One would expect that the Transformer’s performance would increase with a larger model, whereas the LSTM would not benefit as much (perhaps even getting worse) with a larger model.

Time Comparison of Spectrum Prediction

The Prosit Transformer was quicker to train than the full RNN model (approximately 3 versus 6 GPU days). However, all of the models in Table , were slower than the original Prosit RNN due to their increased size. To demonstrate this, both regular Prosit and Prosit Transformer were timed for predicting 1000, 10000, and 100000 spectra; see Table . Prosit Transformer requires roughly 40 times more time, so there is a trade-off between accuracy and time requirements for the transformer’s predictions when increasing model size.

Table 2

Larger Transformer Model Needs More Time to Predict Spectraa

number of predicted spectra	1k	10k	100k
Prosit RNN	0.05 s	0.5 s	4.7 s
Prosit Transformer	2 s	18 s	180 s

We measured the required time to predict spectra from peptides for both Prosit RNN and Prosit Transformer.

Prosit Transformer’s Ability to Model Collision Energy

We also wanted to test that the improved ability of Prosit Transformer to predict MS2 intensities did not affect the predictor’s ability to model CE’s influence on predicted spectra. We hence isolated batches of spectra with CE = {0.2, 0.25, 0.3, 0.35, 0.4} and measured the median angular similarity when predicting the spectra for a range of different collision energies (Figure ). The highest angular similarity was found between the observed and predicted spectra when setting CE to the set’s actual specified value.

Figure 4

Mean spectral angle as a function of the collision energy for spectra acquired with different CEs.

Discussion

Here, we have used a Transformer trained to predict a protein sequence and transferred its functionality into predicting intensities of the b- and y-ions of MS2 spectra. The resulting predictor’s performance outperformed a predictor built by a classical recurrent neural network. This type of structure can likely improve other types of peptide property prediction. One interesting finding was that the most significant improvement was when using Transformers as a decoder when comparing different combinations of RNNs and Transformers as decoders and encoders. A possible interpretation of this result is that Transformer architecture better utilizes the metadata, i.e., the collision energy and charge state information. A future direction of the project could be to investigate the source of the improved accuracy by examining the effects of removing this information from the different decoders. Here, we made use of the framework provided by the original Prosit project. It was essential to access the scripts and data sets provided and hardened by the previous team of algorithm designers. In general, it is of utmost importance to keep this type of resource easy to access. If we want to attract the attention of the machine learning community, which often wants a precise problem formulation and does not like to get into the details of how to generate data sets from scratch, we need to help them.

19 in total

1. Semi-supervised learning for peptide identification from shotgun proteomics datasets.

Authors: Lukas Käll; Jesse D Canterbury; Jason Weston; William Stafford Noble; Michael J MacCoss
Journal: Nat Methods Date: 2007-10-21 Impact factor: 28.547

Review 2. Artificial intelligence for proteomics and biomarker discovery.

Authors: Matthias Mann; Chanchal Kumar; Wen-Feng Zeng; Maximilian T Strauss
Journal: Cell Syst Date: 2021-08-18 Impact factor: 10.304

3. DeepDigest: Prediction of Protein Proteolytic Digestion with Deep Learning.

Authors: Jinghan Yang; Zhiqiang Gao; Xiuhan Ren; Jie Sheng; Ping Xu; Cheng Chang; Yan Fu
Journal: Anal Chem Date: 2021-04-07 Impact factor: 8.008

4. Evaluating Protein Transfer Learning with TAPE.

Authors: Roshan Rao; Nicholas Bhattacharya; Neil Thomas; Yan Duan; Xi Chen; John Canny; Pieter Abbeel; Yun S Song
Journal: Adv Neural Inf Process Syst Date: 2019-12

5. MS2PIP prediction server: compute and visualize MS2 peak intensity predictions for CID and HCD fragmentation.

Authors: Sven Degroeve; Davy Maddelein; Lennart Martens
Journal: Nucleic Acids Res Date: 2015-05-18 Impact factor: 16.971

6. Deep learning the collisional cross sections of the peptide universe from a million experimental values.

Authors: Florian Meier; Niklas D Köhler; Andreas-David Brunner; Jean-Marc H Wanka; Eugenia Voytik; Maximilian T Strauss; Fabian J Theis; Matthias Mann
Journal: Nat Commun Date: 2021-02-19 Impact factor: 14.919

7. DeepPhospho accelerates DIA phosphoproteome profiling through in silico library generation.

Authors: Ronghui Lou; Weizhen Liu; Rongjie Li; Shanshan Li; Xuming He; Wenqing Shui
Journal: Nat Commun Date: 2021-11-18 Impact factor: 14.919

8. Deep learning boosts sensitivity of mass spectrometry-based immunopeptidomics.

Authors: Mathias Wilhelm; Daniel P Zolg; Michael Graber; Siegfried Gessulat; Tobias Schmidt; Karsten Schnatbaum; Celina Schwencke-Westphal; Philipp Seifert; Niklas de Andrade Krätzig; Johannes Zerweck; Tobias Knaute; Eva Bräunlein; Patroklos Samaras; Ludwig Lautenbacher; Susan Klaeger; Holger Wenschuh; Roland Rad; Bernard Delanghe; Andreas Huhmer; Steven A Carr; Karl R Clauser; Angela M Krackhardt; Ulf Reimer; Bernhard Kuster
Journal: Nat Commun Date: 2021-06-07 Impact factor: 14.919

9. Generating high quality libraries for DIA MS with empirically corrected peptide predictions.

Authors: Brian C Searle; Kristian E Swearingen; Christopher A Barnes; Tobias Schmidt; Siegfried Gessulat; Bernhard Küster; Mathias Wilhelm
Journal: Nat Commun Date: 2020-03-25 Impact factor: 14.919