| Literature DB >> 34944547 |
Yulia V Samukhina1, Dmitriy D Matyushin1, Oksana I Grinevich1, Aleksey K Buryak1.
Abstract
Most frequently, the identification of peptides in mass spectrometry-based proteomics is carried out using high-resolution tandem mass spectrometry. In order to increase the accuracy of analysis, additional information on the peptides such as chromatographic retention time and collision cross section in ion mobility spectrometry can be used. An accurate prediction of the collision cross section values allows erroneous candidates to be rejected using a comparison of the observed values and the predictions based on the amino acids sequence. Recently, a massive high-quality data set of peptide collision cross sections was released. This opens up an opportunity to apply the most sophisticated deep learning techniques for this task. Previously, it was shown that a recurrent neural network allows for predicting these values accurately. In this work, we present a deep convolutional neural network that enables us to predict these values more accurately compared with previous studies. We use a neural network with complex architecture that contains both convolutional and fully connected layers and comprehensive methods of converting a peptide to multi-channel 1D spatial data and vector. The source code and pre-trained model are available online.Entities:
Keywords: deep learning; ion mobility spectrometry; peptides; proteomics
Mesh:
Substances:
Year: 2021 PMID: 34944547 PMCID: PMC8699202 DOI: 10.3390/biom11121904
Source DB: PubMed Journal: Biomolecules ISSN: 2218-273X
Features (descriptors) that characterize a single amino acid residue. Letters in parentheses denote amino acid residues for which the considered feature is 1.
|
| Description of Features |
|---|---|
| 0–20 | One hot encoded amino acid (including oxidized methionine as the 21st amino acid). One of these features is set as 1 and others are set as 0. |
| 21–25 | Elemental composition of the residue: integer values that describe a number of each element (H, C, N, O, S). |
| 26–31 | Six binary features that show whether the side-chain of the amino acid is acidic (D, E), modified (non-standard), has an amide group (N, Q), is non-polar (G, A, V, L, I, P, M, F, V), is small (P, G, A, S), is uncharged polar (S, T, N, Q, C, Y, oxidized methionine residue). |
| 32–35 | Four binary features that show whether the side-chain of the amino acid is aliphatic non-polar (V, I, L, G, A), is aromatic (W, F, Y), is positively charged (K, R, H), has a hydroxyl group (S, T, Y). |
| 36 | For N, D (Asp, Asn) this feature is 1, 0 elsewhere. |
| 37 | For E, Q (Glu, Gln) this feature is 1, 0 elsewhere. |
| 38–43 | Descriptors of amino acids from references [ |
| 44 | Always 0 for all amino acids, 1 for padding symbol. |
Figure 1The deep neural network for CCS prediction and its input features: (A) subsequences used for the creation of input features for dense layers; borders of subsequences with non-integer length are rounded in such a way that at least one residue is included in each subsequence; (B) architecture of the neural network. Numbers in parenthesis denote the number of output channels or nodes (only for trainable layers).
Figure 2Learning curves (dependencies of accuracy on the number of training iterations) for (A) a neural network with the full feature set and for (B) a neural network with the reduced feature set.
Accuracy measures after various numbers of training iterations for two different feature sets and three data sets.
| Data Set | Training Iterations | Features | Accuracy Measures | ||||||
|---|---|---|---|---|---|---|---|---|---|
| RMSE, Å2 | MAE, Å2 | MPE,% | MdPE, % | Δ90, % |
|
| |||
| Training set | 14,160 | Full | 12.6 | 7.5 | 1.46 | 1.00 | 3.21 | 0.987 | 0.993 |
| 26,000 | Full | 11.7 | 7.1 | 1.40 | 0.96 | 3.05 | 0.988 | 0.994 | |
| 76,110 | Full | 9.8 | 6.2 | 1.23 | 0.87 | 2.68 | 0.992 | 0.996 | |
| 30,000 | Reduced | 11.4 | 7.0 | 1.38 | 0.95 | 3.03 | 0.989 | 0.994 | |
| Test set | 14,160 | Full | 13.1 | 7.7 | 1.50 | 1.02 | 3.28 | 0.985 | 0.993 |
| 26,000 | Full | 12.8 | 7.6 | 1.47 | 1.00 | 3.25 | 0.986 | 0.993 | |
| 76,110 | Full | 13.2 | 7.6 | 1.48 | 0.99 | 3.22 | 0.985 | 0.993 | |
| 30,000 | Reduced | 13.3 | 7.8 | 1.52 | 1.03 | 3.32 | 0.985 | 0.992 | |
| ProteomeTools test set | 14,160 | Full | 14.9 | 9.4 | 1.84 | 1.28 | 4.02 | 0.983 | 0.991 |
| 26,000 | Full | 14.6 | 9.3 | 1.82 | 1.28 | 3.95 | 0.983 | 0.992 | |
| 76,110 | Full | 15.3 | 9.5 | 1.86 | 1.27 | 4.06 | 0.982 | 0.991 | |
| 30,000 | Reduced | 15.4 | 9.7 | 1.90 | 1.32 | 4.17 | 0.982 | 0.991 | |
Accuracy for different charge states and CCS ranges for the neural network with the full set of features after 26,000 training iterations and the ProteomeTools test set. The values from reference [16] are given with the same accuracy as in the corresponding work.
| Subset | This Work | Previous Results [ | ||||
|---|---|---|---|---|---|---|
| MdPE, % | Δ90, % |
| MdPE, % | Δ90, % |
| |
| Full ProteomeTools test set | 1.28 | 3.95 | 0.992 | 1.40 | 4.0 | 0.992 |
| Charge state + 2 | 1.06 | 3.06 | 0.985 | 1.2 | ||
| Charge state + 3 | 1.77 | 5.49 | 0.938 | 1.8 | ||
| Charge state + 4 | 1.95 | 5.47 | 0.886 | 2.0 | ||
| CCS values 0–400 | 1.06 | 3.02 | 0.920 | 1.2 | ||
| CCS values 400–800 | 1.38 | 4.32 | 0.986 | 1.5 | ||
| CCS values 800–1200 | 2.30 | 5.71 | 0.695 | 2.2 | ||
Figure 3A correlation plot between reference and predicted using a deep neural network CCS values (Å2) and distribution of relative errors (%) between reference and predicted CCS values for various charge states.
Accuracy for different subsets of the external test set.
| Subset | This Work | |||||
|---|---|---|---|---|---|---|
| RMSE, Å2 | MAE, Å2 | MdPE, % | Δ90, % |
|
| |
| Human plasma | 26.7 | 20.4 | 3.85 | 7.99 | 0.989 | 0.979 |
| Mouse plasma | 26.9 | 21.7 | 4.53 | 8.55 | 0.990 | 0.980 |
| Shewanella oneidensis | 17.2 | 11.8 | 1.85 | 5.18 | 0.988 | 0.976 |
Figure 4Correlation plots between reference and predicted CCS values (Å2) for peptides from various organisms (external test set).
Accuracy of drift time prediction for different charge states for the external test set in comparison with the results from reference [13].
| Subset | This Work | Previous Results [ | ||
|---|---|---|---|---|
|
| MSE, s2 |
| MSE, s2 | |
| Charge state + 2 | 0.9634 | 0.287 | 0.9620 | 0.290 |
| Charge state + 3 | 0.9179 | 0.715 | 0.9062 | 0.813 |
| Charge state + 4 | 0.9266 | 0.705 | 0.9308 | 0.556 |
Figure 5The dependence of median percentage error on the number of neural networks of the same architecture in the ensemble. A horizontal line denotes the accuracy of the previous model described in reference [16].