| Literature DB >> 34131195 |
Niraj Thapa1, Meenal Chaudhari1, Anthony A Iannetta2, Clarence White1, Kaushik Roy3, Robert H Newman4, Leslie M Hicks2, Dukka B Kc5.
Abstract
Protein phosphorylation, which is one of the most important post-translational modifications (PTMs), is involved in regulating myriad cellular processes. Herein, we present a novel deep learning based approach for organism-specific protein phosphorylation site prediction in Chlamydomonas reinhardtii, a model algal phototroph. An ensemble model combining convolutional neural networks and long short-term memory (LSTM) achieves the best performance in predicting phosphorylation sites in C. reinhardtii. Deemed Chlamy-EnPhosSite, the measured best AUC and MCC are 0.90 and 0.64 respectively for a combined dataset of serine (S) and threonine (T) in independent testing higher than those measures for other predictors. When applied to the entire C. reinhardtii proteome (totaling 1,809,304 S and T sites), Chlamy-EnPhosSite yielded 499,411 phosphorylated sites with a cut-off value of 0.5 and 237,949 phosphorylated sites with a cut-off value of 0.7. These predictions were compared to an experimental dataset of phosphosites identified by liquid chromatography-tandem mass spectrometry (LC-MS/MS) in a blinded study and approximately 89.69% of 2,663 C. reinhardtii S and T phosphorylation sites were successfully predicted by Chlamy-EnPhosSite at a probability cut-off of 0.5 and 76.83% of sites were successfully identified at a more stringent 0.7 cut-off. Interestingly, Chlamy-EnPhosSite also successfully predicted experimentally confirmed phosphorylation sites in a protein sequence (e.g., RPS6 S245) which did not appear in the training dataset, highlighting prediction accuracy and the power of leveraging predictions to identify biologically relevant PTM sites. These results demonstrate that our method represents a robust and complementary technique for high-throughput phosphorylation site prediction in C. reinhardtii. It has potential to serve as a useful tool to the community. Chlamy-EnPhosSite will contribute to the understanding of how protein phosphorylation influences various biological processes in this important model microalga.Entities:
Year: 2021 PMID: 34131195 PMCID: PMC8206365 DOI: 10.1038/s41598-021-91840-w
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Positive and negative windows for phosphorylation in Chlamydomonas reinhardtii.
| Phosphorylation Sites | Positive | Negative |
|---|---|---|
| S | 17,732 | 361,218 |
| T | 3,951 | 213,802 |
| Y | 167 | 73,538 |
| ST | 21,683 | 434,756 |
Parameters description of CNN model with embedding layer.
| Parameters | Settings |
|---|---|
| Embedding Output Dimension | 21 |
| Learning Rate | 0.001 |
| Batch Size | 256 |
| Epochs | 50 |
| Dropout | 0.4 |
| Conv2d_1 filter (filter size) | 128 (27 × 3 for window size 53) |
| MaxPooling2d_1 | 2 × 2 |
| Conv2d_2 filter (filter size) | 256 (3 × 3) |
| MaxPooling2d_2 | 2 × 2 |
| Flatten_1 | Output = 6144 |
| Dense_1 | 768 |
| Dense_2 | 256 |
| Dense_3 | 64 |
| Output layer activation function | Softmax |
| Checkpointer | Best validation accuracy |
Parameters description of LSTM model with embedding layer.
| Parameters | Settings |
|---|---|
| Embedding Output Dimension | 21 |
| Learning Rate | 0.001 |
| Batch Size | 256 |
| Epochs | 50 |
| LSTM layer 1 memory units | 128 |
| LSTM layer 2 memory units | 64 |
| LSTM layer 2 dropout | 0.4 |
| Dense layer 1 | 128 |
| Dropout | 0.4 |
| Dense layer 2 | 64 |
| Dropout | 0.4 |
| Output layer activation function | Softmax |
| Checkpointer | Best validation accuracy |
Figure 1Multi-windows model Chlamy-MwPhosSite combining features from five models with different window sizes.
Figure 2Ensemble model Chlamy-EnPhosSite combining CNN and LSTM models with stacking.
Performance metrics of different models using an independent test dataset for general phosphorylation sites S and T.
| Models | Sensitivity | Specificity | Accuracy | AUC | MCC |
|---|---|---|---|---|---|
| DeepPhos | 0.72 | 0.73 | 0.75 | 0.82 | 0.52 |
| LSTM with embedding | 0.80 | 0.76 | 0.78 | 0.78 | 0.56 |
| CNN with embedding (Trained on MusiteDeep Dataset) | 0.83 | 0.75 | 0.79 | 0.87 | 0.58 |
| Ensemble (CNN Multi-window) | 0.84 | 0.76 | 0.80 | 0.88 | 0.60 |
| Ensemble-stacking (CNN + LSTM) | 0.86 | 0.73 | 0.79 | 0.88 | 0.59 |
Performance metrics of different models applying manually extracted features using an independent test dataset for S and T.
| Models | Sensitivity | Specificity | Accuracy | AUC | MCC |
|---|---|---|---|---|---|
| Random Forest (RF) | 0.84 | 0.61 | 0.72 | 0.80 | 0.46 |
| CNN | 0.88 | 0.58 | 0.73 | 0.81 | 0.48 |
Figure 3Tenfold cross-validation mean MCC of S, T, ST and Y for different window sizes.
Performance metrics of different models using an independent test dataset for S.
| Models | SN | SP | ACC | AUC | MCC |
|---|---|---|---|---|---|
| LSTM with embedding | 0.87 | 0.72 | 0.79 | 0.87 | 0.59 |
| CNN with embedding | 0.89 | 0.69 | 0.79 | 0.87 | 0.60 |
| Chlamy-MwPhosSite | 0.89 | 0.71 | 0.80 | 0.88 | 0.61 |
| Chlamy-EnPhosSite | 0.89 | 0.72 | 0.80 | 0.89 | 0.62 |
Figure 4ROC curve for different DL models for S.
Performance metrics of different models using an independent test dataset for T.
| Models | SN | SP | ACC | AUC | MCC |
|---|---|---|---|---|---|
| LSTM with embedding | 0.83 | 0.69 | 0.76 | 0.84 | 0.53 |
| CNN with embedding | 0.86 | 0.66 | 0.76 | 0.84 | 0.53 |
| Chlamy-MwPhosSite | 0.76 | 0.79 | 0.78 | 0.84 | 0.55 |
| Chlamy-EnPhosSite | 0.92 | 0.61 | 0.77 | 0.86 | 0.56 |
Figure 5ROC curve for different DL models for T.
Performance metrics of different models using an independent test dataset for S and T.
| Models | SN | SP | ACC | AUC | MCC |
|---|---|---|---|---|---|
| DeepPhos | 0.83 | 0.77 | 0.81 | 0.88 | 0.61 |
| LSTM with embedding | 0.87 | 0.74 | 0.81 | 0.88 | 0.61 |
| CNN with embedding | 0.91 | 0.69 | 0.81 | 0.88 | 0.61 |
| Chlamy-MwPhosSite | 0.86 | 0.78 | 0.82 | 0.90 | 0.64 |
| Chlamy-EnPhosSite | 0.90 | 0.73 | 0.82 | 0.90 | 0.64 |
Figure 6ROC curve for different DL models for S and T combined.