| Literature DB >> 35475237 |
Abstract
Mass-spectrometry-based proteomics enables quantitative analysis of thousands of human proteins. However, experimental and computational challenges restrict progress in the field. This review summarizes the recent flurry of machine-learning strategies using artificial deep neural networks (or "deep learning") that have started to break barriers and accelerate progress in the field of shotgun proteomics. Deep learning now accurately predicts physicochemical properties of peptides from their sequence, including tandem mass spectra and retention time. Furthermore, deep learning methods exist for nearly every aspect of the modern proteomics workflow, enabling improved feature selection, peptide identification, and protein inference.Entities:
Keywords: MS/MS; bioinformatics; deep learning; mass spectrometry; neural networks; peptides; proteomics; retention time
Year: 2021 PMID: 35475237 PMCID: PMC9017218 DOI: 10.1016/j.crmeth.2021.100003
Source DB: PubMed Journal: Cell Rep Methods ISSN: 2667-2375
Figure 1General proteomics workflow highlighting challenges
Peptides are produced from enzymatic hydrolysis of the isolated proteome and analyzed by liquid chromatography-tandem mass spectrometry (LC-MS/MS). This process involves detecting features and assigning retention times to peptides, detecting precursor peptide charge, and then measuring the fragment ion spectra or tandem mass spectra for a peptide. The collection of tandem mass spectra is then subject to peptide-spectra matching to identify peptides, and the original set of proteins is inferred. Red stars indicate that deep learning tools now facilitate these aspects of the workflow.
Figure 2Basic neural network background
(A and B) Neural networks are simply collections of math operations that transform an input (x) to an output (y). Inputs and outputs are connected to the neuron by weights, which are linear operators that multiply the previous value. The function in the hidden layer can be anything. In the simplest case with one neuron in the hidden layer (A), the input value is multiplied by the first weight, and then the new value x∗weight1 is input to the function in the neuron. The output of that function is multiplied by weight2 to calculate the output. When a neural network is “trained,” inputs are passed forward through the math to compute the output. The value of the output is compared with the true known value of y, and then the weights are updated slightly to make the output value closer to the true value of y. A simple example of this is shown in (B) where weight1 is 2, the function is 2∗x, and weight2 is 2. Note that neuron functions are often not linear (such as a rectified linear unit (ReLU) or sigmoid).
(C) The output y of this neural network is 16 when the input = 2.
(D) A simple recurrent neural network accepts sequence or time series data and adds a connection between the hidden layer and itself across time points, which allows the network to learn interactions between inputs in the series.
(E) A simple one-dimensional convolutional neural network showing how local patterns are summarized by a filter kernel into a new output vector with fewer dimensions.
Methods for fragment ion intensity prediction
| Year | Name | Neural network details | Comments | Citations |
|---|---|---|---|---|
| 2005 | PeptideART | feedforward network | engineered peptide feature inputs, outputs of fragment probabilities | |
| 2017 | pDeep | bidirectional LSTM, multi-output regression; Keras v1.2.1, TensorFlow 0.12.1 | limited to peptides of up to 20 amino acids | |
| 2018 | DeepMatch | bidirectional LSTM, weak supervision | direct integration with peptide spectrum matching algorithm outperforms COMET | |
| 2018 | Prosit (latin for “of benefit”) | encoder: bidirectional GRU with dropout and attention, parallel encoding of precursor charge and collision energy; decoder: bidirectional GRU with dropout and time-distributed dense; multi-output regression Keras 2.1.1 and TensorFlow 1.4.0 | over half a million training peptides and 21 million MS/MS spectra at multiple collision energies, predicts MS/MS spectra and retention time, integration with database search to decrease FDR, integration with Skyline ( | |
| 2019 | DeepMass | encoder: three bidirectional LSTM with 385 units each; decoder: four fully connected dense layers 768 units each; multi-output regression | predicted fragmentation with accuracy similar to repeated measure of the same peptide's fragmentation. Predicted spectra used for DIA data analysis nearly equivalent to spectral libraries | |
| 2019 | pDeep2 | bidirectional LSTM, multi-output regression | original pDeep model adapted to predict spectra of modified peptides using transfer learning | |
| 2019 | N/A | encoder: bidirectional LSTM with dropout; | predicts retention time, precursor charge state distribution, and fragment ion spectra | |
| 2019 | MS2CNN | basic CNN architecture, engineered peptide features as input with a CNN kernel size of 4 | better than pDeep for prediction of spectra from +3 charge state peptide precursors | |
| 2020 | DeepDIA | hybrid CNN and bidirectional LSTM, CNN first extracts features from pairs of amino acids, then LSTM, then dense layer. Multi-output regression of the b/year ions, including water/ammonia losses. Keras 2.2.4 and TensorFlow 1.11 | predicts MS/MS spectra and indexed retention time (iRT). Slightly more protein identifications from DIA analysis of Hela proteome than libraries from DDA or Prosit | |
| 2020 | N/A | sequence-to-sequence CNN | full-spectrum prediction, not only fragment ions |
Abbreviation are as follows: FDR, false discovery rate; N/A, not applicable.
Indicates methods that predict other factors apart from fragment ion spectra.
Figure 3Concept of LSTM neural network applied to fragment ion spectra prediction for peptides
Each amino acid in the sequence is converted to a string of ones and zeros unique to that amino acid (called “one-hot encoding”). The encoded sequences are fed into one or more bidirectional LSTM layers. The output from the hidden layers is essentially multi-output regression, where real values are predicted for each possible b and y ions corresponding to the relative abundances of those fragments. Network weights are learned that accurately convert a given sequence in the correct proportions of fragment ions.
Methods for prediction of retention time
| Year | Name | Neural network details | Comment | Citation |
|---|---|---|---|---|
| 2003 | N/A | fully connected neural network with 2 hidden layers, 20 inputs and one output | 95% of retention predictions within 10% of the true value | |
| 2006 | N/A | fully connected neural network with 16 inputs, 4 hidden neurons, and 1 output | mean prediction error ~5.8% | |
| 2006 | N/A | 1,052 input nodes, 24 hidden nodes, 1 output node | average elution time precision of 1.5% | |
| 2017 | DeepRT | feature extraction by LSTM and CNN, retention prediction from bagged ensemble of standard prediction models. Theano (0.9.0 dev1), Keras (1.0.1), and sklearn (0.17.1) | 95% of retention predictions within 28 min versus best benchmark of 45.8 min | |
| 2018 | DeepRT+ | capsule network (a type of CNN) | 95% of retention predictions within 15.7 min versus DeepRT at 24.7 min or best benchmark of 45.8 min | |
| 2019 | Prosit | encoder: bidirectional GRU with dropout and attention, parallel encoding of precursor charge and collision energy; decoder: bidirectional GRU with dropout and time-distributed dense; multi-output regression Keras 2.1.1 and TensorFlow 1.4.0 | over half a million training peptides and 21 million MS/MS spectra at multiple collision energies, predicts MS/MS spectra and retention time, integration with database search to decrease FDR, integration with Skyline (cite), web tool | |
| 2019 | DeepMass | encoder: three bidirectional LSTM with 385 units each; decoder: four fully connected dense layers 768 units each; multi-output regression | predicted fragmentation with accuracy similar to repeated measure of the same peptide's fragmentation. Predicted spectra used for DIA data analysis nearly equivalent to spectral libraries | |
| 2019 | N/A | encoder: bidirectional LSTM with dropout; | predicts retention time, precursor charge state distribution, and fragment ion spectra | |
| 2020 | DeepDIA | hybrid CNN and bidirectional LSTM, CNN first extracts features from pairs of amino acids, then LSTM, then dense layer. Multi-output regression of the b/year ions, including water/ammonia losses. Keras 2.2.4 and TensorFlow 1.11 | predicts MS/MS spectra and indexed retention time (iRT). Slightly more protein identifications from DIA analysis of Hela proteome than libraries from DDA or Prosit | |
| 2020 | DeepLC | hybrid network: three CNN input paths (1) one-hot amino acid sequence, (2) amino acid pairs, and (3) amino acid composition. One dense input of peptide features. Inputs concatenated and processed through dense layers | predicts retention time for previously unseen peptide modifications | |
| 2020 | AutoRT | ensemble of 10 best CNN and LSTM, networks returned by transfer learning. Keras 2.2.4 and TensorFlow 1.13.1 | used predicted retention time as a filter to assess identification strategies for mutated peptides |
Abbreviations are as follows: FDR, false discovery rate; N/A, not applicable.
Indicates methods that predict other factors beyond retention time.
Methods for protein and peptide identification
| Year | Name | Neural network details | Comment | Citation |
|---|---|---|---|---|
| 2012 | Barista | special type of network or tripartite graph where layers represent proteins, peptides, and spectra | protein inference through integration of protein and peptide identification | |
| 2017 | DeepPep | CNN, torch7 framework | predicts peptide probability from binarized protein sequence, protein scored based on change in peptide prediction without each protein | |
| 2017 | DeepNovo | LSTM/CNN hybrid network built with TensorFlow | application to DDA data. Iteratively predicts one amino acid at each step. Up to 64% better than previous algorithms | |
| 2018 | DeepMatch | bidirectional LSTM, weak supervision | spectral prediction integrated with peptide identification | |
| 2019 | DeepNovo | LSTM/CNN hybrid network built with TensorFlow | adapted to DIA data by incorporating the retention time dimension | |
| 2020 | DIA-NN | ensemble of dense, feedforward classifiers. Implemented with Cranium DNN | operates with or without a user-supplied spectral library | |
| 2020 | DeepRescore | uses AutoRT and pDeep2 models | generates new scores derived from comparing observed peptide properties to deep learning-predicted properties. Those scores are input to Percolator |