| Literature DB >> 29987232 |
Md Mehedi Hasan1, Mst Shamima Khatun2, Md Nurul Haque Mollah3, Cao Yong4, Guo Dianjing5.
Abstract
Nitrotyrosine is a product of tyrosine nitration mediated by reactive nitrogen species. As an indicator of cell damage and inflammation, protein nitrotyrosine serves to reveal biological change associated with various diseases or oxidative stress. Accurate identification of nitrotyrosine site provides the important foundation for further elucidating the mechanism of protein nitrotyrosination. However, experimental identification of nitrotyrosine sites through traditional methods are laborious and expensive. In silico prediction of nitrotyrosine sites based on protein sequence information are thus highly desired. Here, we report a novel predictor, NTyroSite, for accurate prediction of nitrotyrosine sites using sequence evolutionary information. The generated features were optimized using a Wilcoxon-rank sum test. A random forest classifier was then trained using these features to build the predictor. The final NTyroSite predictor achieved an area under a receiver operating characteristics curve (AUC) score of 0.904 in a 10-fold cross-validation test. It also significantly outperformed other existing implementations in an independent test. Meanwhile, for a better understanding of our prediction model, the predominant rules and informative features were extracted from the NTyroSite model to explain the prediction results. We expect that the NTyroSite predictor may serve as a useful computational resource for high-throughput nitrotyrosine site prediction. The online interface of the software is publicly available at https://biocomputer.bio.cuhk.edu.hk/NTyroSite/.Entities:
Keywords: Wilcoxon-rank sum test; post-translational modification; random forest; rule extraction
Mesh:
Substances:
Year: 2018 PMID: 29987232 PMCID: PMC6099560 DOI: 10.3390/molecules23071667
Source DB: PubMed Journal: Molecules ISSN: 1420-3049 Impact factor: 4.411
Figure 1A computational framework of the proposed NTyroSite. NR—non-redundant; PSI-BLAST; PSSM—position-specific scoring matrix; pbCKSAAP—profile-based of k-spaced amino acid pairs.
Figure 2Comparison of sequence information between nitrotyrosine and non-nitrotyrosine sites. (A) A two-sample logo for training dataset of NTyroSite in compositional bias sequences of surrounding nitrotyrosined proteins; (B) we calculated the average PSSM value (APV) of each position (i.e., the average positions of each row of the PSSM matrix) in the flanking sequences of each nitrotyrosine (green color)/non-nitrotyrosine (red color) site, to investigate the protein evolutionary conservation of nitrotyrosine or non-nitrotyrosine sites. Because the optimal window size in this study was 41, the APVs of the positions [+1, +20] were averaged to obtain the APV of the downstream sites, while the APVs of the positions [−20, −1] were averaged to obtain the APV of the upstream sites. p-values were also calculated using the Kruskal–Wallis test and corrected using Bonferroni (See Table S1).
The performance of the proposed models trained with different positive versus negative samples ratios based on 10-fold cross-validation (CV) test. MCC—Matthews correlation coefficient.
| The Ratio (P/N) | Sp | Sn | Pr | Ac | MCC |
|---|---|---|---|---|---|
| 1:1 | 0.901 | 0.616 | 0.865 | 0.759 | 0.537 |
| 1:2 | 0.900 | 0.567 | 0.757 | 0.789 | 0.511 |
| 1:3 | 0.902 | 0.548 | 0.612 | 0.813 | 0.474 |
| 1:total | 0.904 | 0.523 | 0.438 | 0.856 | 0.416 |
NTyroSite prediction performances for without and with features selection.
| Measurement | Training Test | Independent Test | ||
|---|---|---|---|---|
| without Feature Selection | with Feature Selection | without Feature Selection | with Feature Selection | |
| Sp | 0.901 | 0.900 | 0.806 | 0.801 |
| Sn | 0.616 | 0.675 | 0.479 | 0.609 |
| Pr | 0.865 | 0.884 | 0.197 | 0.231 |
| Ac | 0.759 | 0.787 | 0.778 | 0.782 |
| MCC | 0.537 | 0.601 | 0.196 | 0.272 |
Figure 3Performance comparison between without feature selection and with feature selection using receiver operating characteristics (ROC) curves: (A) performance based on 10-fold cross-validation (CV) test; (B) performance based on the independent dataset. AUC—area under an ROC curve.
Figure 4Top 30 amino acid residue pairs selected by the Wilcoxon rank-sum (WR)-based feature selection from the pbCKSAAP scheme. Blue color denotes nitrotyrosine sites and dark red color denotes non-nitrotyrosine sites. The radar diagram is represented by the composition of each residue pair whose length is proportional to the composition of pbCKSAAP features.
Comparison of NTyroSite with existing predictors using an independent test set.
| Measurement | GPS-YNO2 | iNitro-Tyr | NTyroSite |
|---|---|---|---|
| Sp | 0.791 | 0.796 | 0.801 |
| Sn | 0.211 | 0.211 | 0.609 |
| Pr | 0.087 | 0.089 | 0.231 |
| Ac | 0.741 | 0.745 | 0.782 |
| MCC | 0.002 | 0.004 | 0.274 |
The threshold values of GPS-YNO2 is considered as medium. However, the threshold value iNitro-Tyr is consistent with value defined in the server. And the proposed NTyroSite predictor threshold is controlled, the same as at Sp 90% of training model performances.
Figure 5AUC values after sequence redundancy removal. Blue and orange color represents without and with feature selection.
The extracted rules collected from NTyroSite model.
| No. | Individual Reports of Rule Extraction | No. of Samples Covered by Rule |
|---|---|---|
| 1 | 172 | |
| 2 | 112 | |
| 3 | 90 | |
| 4 | 78 | |
| 5 | 67 | |
| 6 | 49 | |
| 7 | 47 | |
| 8 | 44 | |
| 9 | 39 | |
| 10 | 35 |
For each rule, I(n, w) indicates the amino acid n at position w and ‘‘&’’ denotes the logical conjunction and.
Comparison with different feature selection schemes. IG—information gain; mRMR—minimum-redundancy–maximum-relevance; WR—Wilcoxon rank-sum.
| Methods | IG | mRMR | WR |
|---|---|---|---|
| Sp | 0.897 | 0.899 | 0.900 |
| Sn | 0.601 | 0.596 | 0.675 |
| Pr | 0.861 | 0.855 | 0.884 |
| Ac | 0.749 | 0.748 | 0.787 |
| MCC | 0.511 | 0.507 | 0.601 |
Performance comparison with different sequence-based features. AAindex—amino acid index; BE—binary encoding; pbCKSAAP—profile-based of k-spaced amino acid pairs.
| Methods | Sp | Sn | Pr | Ac | MCC |
|---|---|---|---|---|---|
| AAindex | 0.896 | 0.442 | 0.802 | 0.669 | 0.424 |
| BE | 0.899 | 0.435 | 0.799 | 0.667 | 0.402 |
| KSAAP | 0.900 | 0.587 | 0.857 | 0.744 | 0.501 |
| pbCKSAAP | 0.901 | 0.617 | 0.865 | 0.759 | 0.538 |