| Literature DB >> 35616364 |
Wenrong Chen1, Elijah N McCool2, Liangliang Sun2, Yong Zang3, Xia Ning4,5,6, Xiaowen Liu7,8.
Abstract
Reversed-phase liquid chromatography (RPLC) and capillary zone electrophoresis (CZE) are two primary proteoform separation methods in mass spectrometry (MS)-based top-down proteomics. Proteoform retention time (RT) prediction in RPLC and migration time (MT) prediction in CZE provide additional information for accurate proteoform identification and quantification. While existing methods are mainly focused on peptide RT and MT prediction in bottom-up MS, there is still a lack of methods for proteoform RT and MT prediction in top-down MS. We systematically evaluated eight machine learning models and a transfer learning method for proteoform RT prediction and five models and the transfer learning method for proteoform MT prediction. Experimental results showed that a gated recurrent unit (GRU)-based model with transfer learning achieved a high accuracy (R = 0.978) for proteoform RT prediction and that the GRU-based model and a fully connected neural network model obtained a high accuracy of R = 0.982 and 0.981 for proteoform MT prediction, respectively.Entities:
Keywords: machine learning; retention/migration time prediction; top-down mass spectrometry
Mesh:
Substances:
Year: 2022 PMID: 35616364 PMCID: PMC9250612 DOI: 10.1021/acs.jproteome.2c00124
Source DB: PubMed Journal: J Proteome Res ISSN: 1535-3893 Impact factor: 5.370
Figure 1MT calibration for the CZE-ONE data set with prefractionation. (a) MTs predicted by the semi-empirical model are plotted against experimental MTs in six CZE-MS runs. The Pearson correlation coefficient between predicted and experimental MTs is 0.956 on average for single runs and 0.792 for the combined data of the six runs. (b) The Pearson correlation coefficient between predicted and experimental MTs is improved to 0.954 for the combined data after calibration.
Benchmarking of Eight Machine Learning Models for Proteoform RT Prediction on the LC-ONE and LC-TEN Data Sets with the 7:3 Training-Test Split
| Model | LC-ONE | LC-TEN | ||||
|---|---|---|---|---|---|---|
| Δ | MAE | Δ | MAE | |||
| LR | 0.922 | 0.468 | 0.0576 | 0.923 | 0.377 | 0.0576 |
| SVR | 0.911 | 0.518 | 0.0639 | 0.918 | 0.366 | 0.0587 |
| RFR | 0.935 | 0.423 | 0.0531 | 0.920 | 0.379 | 0.0565 |
| GPTime | 0.926 | 0.433 | 0.0535 | 0.938 | 0.337 | 0.0479 |
| FNN | 0.931 | 0.439 | 0.0534 | 0.913 | 0.378 | 0.0595 |
| CNN + capsule | 0.889 | 0.518 | 0.0699 | 0.920 | 0.395 | 0.0540 |
| GRU + FNN | 0.934 | 0.438 | 0.0516 | 0.929 | 0.385 | 0.0508 |
| CNN + LSTM + FNN | 0.913 | 0.443 | 0.0573 | 0.917 | 0.426 | 0.0534 |
Benchmarking of the Semi-Empirical Model and Four Neural Network Models for Proteoform MT Prediction on the CZE-ONE and CZE-ALL Data Sets with the 7:3 Training-Test Split
| CZE-ONE | CZE-ALL | |||||
|---|---|---|---|---|---|---|
| Model | Δ | MAE | Δ | MAE | ||
| semi-empirical | 0.953 | 0.185 | 0.0179 | 0.970 | 0.141 | 0.0130 |
| FNN | ||||||
| CNN + capsule | 0.865 | 0.293 | 0.0329 | 0.946 | 0.207 | 0.0206 |
| GRU + FNN | ||||||
| CNN + LSTM + FNN | 0.777 | 0.387 | 0.0445 | 0.969 | 0.145 | 0.0142 |
Figure 2Comparison of the GRU + FNN model with and without transfer learning on the LC-TEN data. (a) An overview of the transfer learning method with the LC-PEPTIDE data for pretraining and the LC-TEN training data set for retraining. (b) Histograms of proteoform RT prediction errors for the model trained with and without transfer learning on the LC-TEN test data. (c) The Pearson correlation coefficient of the GRU + FNN model is 0.929 when it is trained with the LC-TEN training set and tested on the LC-TEN test set. (d) The Pearson correlation coefficient of the GRU + FNN model is 0.978 when it is pretrained using the LC-PEPTIDE data, retrained with the LC-TEN training set, and tested on the LC-TEN test set.
FNN, CNN + Capsule, GRU + FNN, and CNN + LSTM + FNN Models Are Assessed on the LC-TEN Test Data Using Three Training Methods: (1) Pretraining Using the LC-PEPTIDE Data Only, (2) Training Using the LC-TEN Training Data Only, and (3) Transfer Learning: Pretraining Using the LC-PEPTIDE Data and Retraining with the LC-TEN Training Data
| Model | pretraining
with LC-PEPTIDE data | training
with LC-TEN training data | transfer
learning | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Δ | MAE | Δ | MAE | Δ | MAE | ||||
| FNN | 0.914 | 0.385 | 0.0573 | 0.913 | 0.378 | 0.0595 | |||
| CNN + capsule | 0.767 | 0.756 | 0.0820 | 0.920 | 0.395 | 0.0540 | |||
| GRU + FNN | 0.974 | 0.180 | 0.0279 | 0.929 | 0.385 | 0.0508 | |||
| CNN + LSTM + FNN | 0.845 | 0.607 | 0.0576 | 0.917 | 0.426 | 0.0534 | |||
FNN, CNN + Capsule, GRU + FNN, and CNN + LSTM + FNN Models Are Evaluated on the CZE-ALL Test Data Using Three Training Methods: (1) Pretraining Using the CZE-PEPTIDE Data Only, (2) Training Using the CZE-ALL Training Data Only, and (3) Transfer Learning: Pretraining Using the CZE-PEPTIDE Data and Retraining with the CZE-ALL Training Data
| Model | pretraining
with CZE-PEPTIDE data | training
with CZE-TEN training data | transfer
learning | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Δ | MAE | Δ | MAE | Δ | MAE | ||||
| FNN | 0.965 | 0.152 | 0.0169 | 0.980 | 0.109 | 0.0109 | |||
| CNN + capsule | 0.865 | 0.302 | 0.0314 | 0.946 | 0.207 | 0.0206 | |||
| GRU + FNN | 0.943 | 0.210 | 0.0237 | 0.982 | 0.103 | 0.0104 | |||
| CNN + LSTM + FNN | 0.343 | 0.595 | 0.0651 | 0.969 | 0.145 | 0.0142 | |||
Figure 3Comparison of the MAEs of SVR, RFR, GPTime, CNN + capsule, and GRU + FNN using four training and test methods. (1) Training with the LC-SHORT training data and testing on the LC-SHORT test data; (2) training with the LC-SHORT training data and testing on the LC-LONG-TEST data; (3) training with the LC-TEN training data and testing on the LC-LONG-TEST data; and (4) transfer learning with the LC-SHORT training data for pretraining and the LC-TEN training data for retraining and testing on the LC-LONG-TEST data. The fourth method is used for CNN + capsule and GRU + FNN only.
Figure 4Filtering proteoform identifications using the differences between experimental and theoretical RTs reported by the GRU + FNN model. Target and decoy proteoforms identified from the LC-ONE data with an E-value <1 are filtered with a cutoff value of 0.1 for experimental and theoretical RT differences. The numbers of target and decoy proteoforms are plotted against their E-values with logarithm transformation.