| Literature DB >> 32082172 |
Piyush Agrawal1,2, Gaurav Mishra1,3, Gajendra P S Raghava1.
Abstract
MOTIVATION: S-adenosyl-L-methionine (SAM) is an essential cofactor present in the biological system and plays a key role in many diseases. There is a need to develop a method for predicting SAM binding sites in a protein for designing drugs against SAM associated disease. To the best of our knowledge, there is no method that can predict the binding site of SAM in a given protein sequence. RESULT: This manuscript describes a method SAMbinder, developed for predicting SAM interacting residue in a protein from its primary sequence. All models were trained, tested, and evaluated on 145 SAM binding protein chains where no two chains have more than 40% sequence similarity. Firstly, models were developed using different machine learning techniques on a balanced data set containing 2,188 SAM interacting and an equal number of non-interacting residues. Our random forest based model developed using binary profile feature got maximum Matthews Correlation Coefficient (MCC) 0.42 with area under receiver operating characteristics (AUROC) 0.79 on the validation data set. The performance of our models improved significantly from MCC 0.42 to 0.61, when evolutionary information in the form of the position-specific scoring matrix (PSSM) profile is used as a feature. We also developed models on a realistic data set containing 2,188 SAM interacting and 40,029 non-interacting residues and got maximum MCC 0.61 with AUROC of 0.89. In order to evaluate the performance of our models, we used internal as well as external cross-validation technique.Entities:
Keywords: PSSM profile; S-adenosine-L-methionine; cancer; in silico prediction; machine learning technique (MLT)
Year: 2020 PMID: 32082172 PMCID: PMC7002541 DOI: 10.3389/fphar.2019.01690
Source DB: PubMed Journal: Front Pharmacol ISSN: 1663-9812 Impact factor: 5.810
Figure 1Percentage composition of S-adenosyl-L-methionine (SAM) interacting and non-interacting residues.
The performance of random forest model developed using amino acid sequence (binary pattern) for individual window size on balanced data set.
| Pattern size | Training data set | Validation data set | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Sen | Spc | Acc | MCC | AUROC | Sen | Spc | Acc | MCC | AUROC | |
| Pat5 | 65.71 | 61.67 | 63.69 | 0.27 | 0.70 | 66.06 | 63.21 | 64.64 | 0.29 | 0.71 |
| Pat7 | 69.87 | 64.31 | 67.09 | 0.34 | 0.74 | 69.43 | 66.58 | 68.01 | 0.36 | 0.74 |
| Pat9 | 72.05 | 65.66 | 68.86 | 0.38 | 0.76 | 69.95 | 68.65 | 69.30 | 0.39 | 0.76 |
| Pat11 | 69.19 | 70.26 | 69.73 | 0.39 | 0.77 | 65.03 | 71.76 | 68.39 | 0.37 | 0.76 |
| Pat13 | 73.06 | 66.05 | 69.56 | 0.39 | 0.77 | 70.98 | 65.03 | 68.01 | 0.36 | 0.77 |
| Pat15 | 69.58 | 70.71 | 70.15 | 0.40 | 0.78 | 66.06 | 72.02 | 69.04 | 0.38 | 0.78 |
| Pat17 | 70.37 | 71.10 | 70.74 | 0.41 | 0.78 | 67.36 | 71.50 | 69.43 | 0.39 | 0.78 |
| Pat19 | 70.54 | 71.32 | 70.93 | 0.42 | 0.78 | 67.36 | 73.83 | 70.60 | 0.41 | 0.79 |
| Pat21 | 70.37 | 71.21 | 70.79 | 0.42 | 0.78 | 67.62 | 74.09 | 70.85 | 0.42 | 0.79 |
| Pat23 | 70.76 | 70.99 | 70.88 | 0.42 | 0.78 | 68.39 | 71.76 | 70.08 | 0.40 | 0.79 |
Sen, sensitivity; Spc, specificity; Acc, accuracy; MCC, Matthews Correlation Coefficient; AUROC, area under the receiver operating characteristic curve.
Figure 2Area under receiver operating characteristics (AUROC) plots obtained for various window length developed using binary profile on balanced data set for (A) training data set and (B) validation data set.
The performance of ExtraTree classifier model developed using PSSM profile for individual window size on balanced data set.
| Pattern size (classifier) | Training data set | Validation data set | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Sen | Spc | Acc | MCC | AUROC | Sen | Spc | Acc | MCC | AUROC | |
| Pat5 | 79.18 | 75.76 | 77.47 | 0.55 | 0.86 | 73.83 | 76.17 | 75.00 | 0.50 | 0.83 |
| Pat7 | 79.85 | 79.12 | 79.49 | 0.59 | 0.87 | 73.32 | 79.53 | 76.42 | 0.53 | 0.85 |
| Pat9 | 81.48 | 76.49 | 78.98 | 0.58 | 0.87 | 76.68 | 77.46 | 77.07 | 0.54 | 0.85 |
| Pat11 | 81.54 | 76.32 | 78.93 | 0.58 | 0.87 | 75.39 | 78.50 | 76.94 | 0.54 | 0.86 |
| Pat13 | 79.29 | 80.42 | 79.85 | 0.60 | 0.88 | 74.87 | 81.61 | 78.24 | 0.57 | 0.86 |
| Pat15 | 81.59 | 77.10 | 79.35 | 0.59 | 0.88 | 76.68 | 78.24 | 77.46 | 0.55 | 0.86 |
| Pat17 | 79.24 | 81.54 | 80.39 | 0.61 | 0.88 | 72.54 | 81.61 | 77.07 | 0.54 | 0.86 |
| Pat19 | 82.21 | 76.04 | 79.12 | 0.58 | 0.88 | 76.68 | 78.24 | 77.46 | 0.55 | 0.86 |
| Pat21 | 79.85 | 81.59 | 80.72 | 0.61 | 0.88 | 74.87 | 81.87 | 78.37 | 0.57 | 0.86 |
| Pat23 | 79.91 | 81.14 | 80.53 | 0.61 | 0.88 | 73.06 | 81.61 | 77.33 | 0.55 | 0.86 |
PSSM, position-specific scoring matrix.
Figure 3AUROC plots obtained for various window length developed using evolutionary profile on balanced data set for (A) training data set and (B) validation data set.
The performance of PSSM profile based models developed using different machine learning techniques for window size 17 on realistic data set.
| Machine Learning Techniques | Main Data Set | Validation Data Set | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Sen | Spc | Acc | MCC | AUROC | Sen | Spc | Acc | MCC | AUROC | |
| SVC# | 53.65 | 98.94 | 96.64 | 0.61 | 0.89 | 36.53 | 99.40 | 95.99 | 0.52 | 0.87 |
| SVC* | 81.48 | 80.14 | 80.21 | 0.32 | 0.89 | 77.72 | 79.88 | 79.76 | 0.31 | 0.87 |
| RF# | 43.27 | 99.44 | 96.59 | 0.58 | 0.86 | 36.53 | 98.97 | 95.58 | 0.48 | 0.86 |
| RF* | 80.25 | 74.64 | 74.93 | 0.27 | 0.86 | 76.94 | 79.73 | 79.58 | 0.30 | 0.86 |
| ETree# | 46.13 | 99.42 | 96.72 | 0.60 | 0.87 | 47.67 | 97.93 | 95.20 | 0.50 | 0.86 |
| ETree* | 78.90 | 79.79 | 79.74 | 0.31 | 0.87 | 74.09 | 83.70 | 83.18 | 0.33 | 0.86 |
| KNN# | 46.91 | 99.27 | 96.61 | 0.59 | 0.84 | 34.72 | 99.36 | 95.85 | 0.50 | 0.79 |
| KNN* | 76.04 | 79.72 | 79.53 | 0.29 | 0.84 | 68.13 | 80.56 | 79.89 | 0.27 | 0.79 |
| MLP# | 37.49 | 98.59 | 95.49 | 0.45 | 0.85 | 34.46 | 98.01 | 94.55 | 0.39 | 0.83 |
| MLP* | 78.84 | 75.84 | 76.00 | 0.27 | 0.85 | 65.28 | 83.26 | 82.28 | 0.28 | 0.83 |
| Ridge# | 38.61 | 97.55 | 94.55 | 0.39 | 0.83 | 37.31 | 96.13 | 92.93 | 0.33 | 0.80 |
| Ridge* | 79.57 | 67.37 | 67.99 | 0.22 | 0.83 | 77.46 | 67.42 | 67.97 | 0.21 | 0.80 |
*Balanced sensitivity and specificity, #maximum MCC, SVC, support vector classifier; RF, random forest; ETree, ExtraTree; KNN, K-nearest neighbors; MLP, multilayer perceptron.
Figure 4AUROC plots obtained for window length 17 developed using evolutionary profile on realistic data set for (A) training data set and (B) validation data set.
Figure 5Architecture of SAMbinder.