| Literature DB >> 31654438 |
Sumeet Patiyal1, Piyush Agrawal1,2, Vinod Kumar1,2, Anjali Dhall1, Rajesh Kumar1,2, Gaurav Mishra3, Gajendra P S Raghava1.
Abstract
N-acetylglucosamine (NAG) belongs to the eight essential saccharides that are required to maintain the optimal health and precise functioning of systems ranging from bacteria to human. In the present study, we have developed a method, NAGbinder, which predicts the NAG-interacting residues in a protein from its primary sequence information. We extracted 231 NAG-interacting nonredundant protein chains from Protein Data Bank, where no two sequences share more than 40% sequence identity. All prediction models were trained, validated, and evaluated on these 231 protein chains. At first, prediction models were developed on balanced data consisting of 1,335 NAG-interacting and noninteracting residues, using various window size. The model developed by implementing Random Forest using binary profiles as the main principle for identifying NAG-interacting residue with window size 9, performed best among other models. It achieved highest Matthews Correlation Coefficient (MCC) of 0.31 and 0.25, and Area Under Receiver Operating Curve (AUROC) of 0.73 and 0.70 on training and validation data set, respectively. We also developed prediction models on realistic data set (1,335 NAG-interacting and 47,198 noninteracting residues) using the same principle, where the model achieved MCC of 0.26 and 0.27, and AUROC of 0.70 and 0.71, on training and validation data set, respectively. The success of our method can be appraised by the fact that, if a sequence of 1,000 amino acids is analyzed with our approach, 10 residues will be predicted as NAG-interacting, out of which five are correct. Best models were incorporated in the standalone version and in the webserver available at https://webs.iiitd.edu.in/raghava/nagbinder/.Entities:
Keywords: Binary profile; Machine learning techniques; N-acetylglucosamine; NAG; PSSM profile
Mesh:
Substances:
Year: 2019 PMID: 31654438 PMCID: PMC6933864 DOI: 10.1002/pro.3761
Source DB: PubMed Journal: Protein Sci ISSN: 0961-8368 Impact factor: 6.725
Performance of the machine learning classifiers using binary profile on balanced data set for various window sizes
| Pattern (classifier) | Training data set | Validation data set | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Sen | Spc | Acc | MCC | AUROC | Sen | Spc | Acc | MCC | AUROC | |
| Pat5(SVC) | 67.42 | 60.90 | 64.16 | 0.28 | 0.71 | 67.18 | 55.69 | 61.43 | 0.23 | 0.68 |
| Pat7(SVC) | 66.52 | 64.12 | 65.32 | 0.31 | 0.72 | 66.26 | 60.00 | 63.13 | 0.26 | 0.70 |
| Pat9(RF) | 65.39 | 65.77 | 65.58 | 0.31 | 0.73 | 65.69 | 59.69 | 62.69 | 0.25 | 0.70 |
| Pat11(RF) | 66.07 | 65.17 | 65.62 | 0.31 | 0.72 | 69.08 | 60.62 | 64.85 | 0.30 | 0.71 |
| Pat13(RF) | 65.62 | 65.77 | 65.69 | 0.31 | 0.72 | 69.69 | 62.31 | 66.00 | 0.32 | 0.71 |
| Pat15(RF) | 66.52 | 65.24 | 65.88 | 0.32 | 0.72 | 68.00 | 59.23 | 63.62 | 0.27 | 0.71 |
| Pat17(RF) | 67.64 | 61.12 | 64.38 | 0.29 | 0.71 | 68.15 | 58.92 | 63.54 | 0.27 | 0.69 |
| Pat19(RF) | 66.37 | 62.47 | 64.42 | 0.29 | 0.71 | 67.69 | 59.54 | 63.62 | 0.27 | 0.70 |
| Pat21(RF) | 67.87 | 61.57 | 64.72 | 0.29 | 0.71 | 67.38 | 60.15 | 63.77 | 0.28 | 0.70 |
| Pat23(RF) | 67.57 | 62.02 | 64.79 | 0.30 | 0.71 | 66.00 | 59.85 | 62.92 | 0.26 | 0.69 |
Note: Various classifiers were used for building models and the performance obtained by the best classifier (mentioned in the bracket) for each window size has been reported.
Figure 1AUROC plots obtained for window length 9 developed using, binary profiles on balanced data set (binary_balanced), PSSM profiles on balanced data set (pssm_balanced), binary profiles on realistic dataset (binary_realistic) for (a) training data set and (b) validation data set
The performance of the machine learning classifiers developed using PSSM profile on balanced data set for various window sizes
| Pattern (classifier) | Training data set | Validation data set | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Sen | Spc | Acc | MCC | AUROC | Sen | Spc | Acc | MCC | AUROC | |
| Pat5(RF) | 61.27 | 61.57 | 61.42 | 0.23 | 0.67 | 52.00 | 64.46 | 58.23 | 0.17 | 0.64 |
| Pat7(RF) | 61.27 | 61.87 | 61.57 | 0.23 | 0.68 | 56.46 | 66.92 | 61.69 | 0.24 | 0.66 |
| Pat9(RF) | 62.47 | 61.87 | 62.17 | 0.24 | 0.69 | 55.38 | 66.92 | 61.15 | 0.22 | 0.66 |
| Pat11(RF) | 62.92 | 62.55 | 62.73 | 0.25 | 0.68 | 56.15 | 66.31 | 61.23 | 0.23 | 0.66 |
| Pat13(RF) | 64.27 | 62.17 | 63.22 | 0.26 | 0.68 | 56.92 | 66.46 | 61.69 | 0.23 | 0.66 |
| Pat15(RF) | 62.47 | 62.17 | 62.32 | 0.25 | 0.68 | 56.62 | 64.92 | 60.77 | 0.22 | 0.66 |
| Pat17(RF) | 63.67 | 61.8 | 62.73 | 0.25 | 0.68 | 54.15 | 63.23 | 58.69 | 0.17 | 0.65 |
| Pat19(ETree) | 64.04 | 62.77 | 63.41 | 0.27 | 0.68 | 53.85 | 65.38 | 59.62 | 0.19 | 0.65 |
| Pat21(ETree) | 65.02 | 62.25 | 63.63 | 0.27 | 0.69 | 54.31 | 65.69 | 60.00 | 0.20 | 0.66 |
| Pat23(ETree) | 63.45 | 63.00 | 63.22 | 0.26 | 0.68 | 54.62 | 66.77 | 60.69 | 0.22 | 0.65 |
Note: Various classifiers were used for building models and the performance obtained by the best classifier (mentioned in the bracket) for each window size has been reported.
The performance of the various machine learning classifiers developed using binary profile on realistic dataset for window size 9
| Classifier | Main data set | Validation data set | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Sen | Spc | Acc | MCC | AUROC | Sen | Spc | Acc | MCC | AUROC | |
| SVC | 14.91 | 99.47 | 97.15 | 0.25 | 0.71 | 18.95 | 99.43 | 97.59 | 0.28 | 0.72 |
| RF | 16.70 | 99.41 | 97.14 | 0.26 | 0.70 | 19.69 | 99.35 | 97.53 | 0.27 | 0.71 |
| ETree | 17.30 | 99.22 | 96.97 | 0.25 | 0.70 | 19.69 | 99.26 | 97.44 | 0.26 | 0.70 |
| KNN | 08.61 | 98.88 | 96.40 | 0.11 | 0.61 | 10.92 | 98.99 | 96.97 | 0.13 | 0.63 |
| MLP | 13.78 | 98.94 | 96.60 | 0.18 | 0.71 | 17.85 | 98.78 | 96.92 | 0.20 | 0.72 |
| Ridge | 13.11 | 99.11 | 96.74 | 0.18 | 0.70 | 16.62 | 99.2 | 97.31 | 0.22 | 0.71 |
Figure 2Architecture of NAGbinder