| Literature DB >> 35539386 |
Indra Kundu1, Goutam Paul2, Raja Banerjee3.
Abstract
There is an exigency of transformation of the enormous amount of biological data available in various forms into some significant knowledge. We have tried to implement Machine Learning (ML) algorithm models on the protein-ligand binding affinity data already available to predict the binding affinity of the unknown. ML methods are appreciably faster and cheaper as compared to traditional experimental methods or computational scoring approaches. The prerequisites of this prediction are sufficient and unbiased features of training data and a prediction model which can fit the data well. In our study, we have applied Random forest and Gaussian process regression algorithms from the Weka package on protein-ligand binding affinity, which encompasses protein and ligand binding information from PdbBind database. The models are trained on the basis of selective fundamental information of both proteins and ligand, which can be effortlessly fetched from online databases or can be calculated with the availability of structure. The assessment of the models was made on the basis of correlation coefficient (R 2) and root mean square error (RMSE). The Random forest model gave R 2 and RMSE of 0.76 and 1.31 respectively. We have also used our features and prediction models on the dataset used by others and found that our model with our features outperformed the existing ones. This journal is © The Royal Society of Chemistry.Entities:
Year: 2018 PMID: 35539386 PMCID: PMC9079328 DOI: 10.1039/c8ra00003d
Source DB: PubMed Journal: RSC Adv ISSN: 2046-2069 Impact factor: 4.036
Features details and their source
| Molecule | Features | Source |
|---|---|---|
| Protein | Amino acid percentage | Calculated from Fasta files[ |
| Accessible surface of protein | DSSP[ | |
| Number of hydrogen bonds in antiparallel bridges and parallel bridges | ||
| Number of hydrogen bonds of type O(I) → H–N(I-5), O(I) → H–N(I-4), O(I) → H–N(I-3), O(I) → H–N(I-2), O(I) → H–N(I-1), O(I) → H–N(I+0), O(I) → H–N(I+1), O(I) → H–N(I+2), O(I) → H–N(I+3), O(I) → H–N(I+4), O(I) → H–N(I+5) | ||
| Number of chains | ||
| Number of ss bridge | ||
| Number of residues | ||
| Ligand | Atom count: C, N, O, H, S, P, Cl, F, Br, I | Padel descriptors[ |
| Bond count: number of single, double, triple bond including and excluding hydrogens | ||
| Ring count: number of 3, 4, 5, 6, 7, 8, 9 atom/carbon rings, aromatic rings, fused hetero rings, fused homo ring | ||
| Physicochemical properties: complexity, log | Pubchem[ |
Fig. 1(a) Random forest algorithm's performance analysis. Change in correlation coefficient with change in number of trees. (b) Random forest algorithm's performance analysis. Change in correlation coefficient with change in number of features with number of iterations fixed at 400.
Fig. 2Scatter plot for actual vs. predicted binding affinity of v2015 dataset using Random forest with 400 iterations having 30 features in each.
Fig. 3(a) Gaussian process algorithm's performance analysis. Change in the correlation coefficient with change in exponent value of normalised polykernel, (b) change in the correlation Coefficient with change in γ value of RBF kernel.
Fig. 4A comparison of percentage of relative error among algorithms used.
Fig. 5(a) Scatter plot for actual vs. predicted log K of Xue's dataset using SMO utilising RBF kernel over 10-fold cross-validation. (b) Scatter plot for actual vs. predicted binding affinity of Deng's dataset using Random forest over 10-fold cross validation. (c) Scatter plot for actual vs. predicted binding affinity of Wang's dataset using Random forest. (d) Scatter plot for actual vs. predicted binding affinity of Kramer's dataset using Random forest.
Comparison table
| Authors | Training instances | Test/method | Their result | Our result |
|---|---|---|---|---|
| C. X. Xue, | Human serum albumin; 95 drugs | Training set |
|
|
| Supplied test set |
|
| ||
| Cross validation |
|
| ||
| Wei Deng, | 105 (diverse) complexes | Cross validation |
|
|
| Christian Kramer and Peter Gedeck, | Pdbbind v2009; 1387 complexes | Cross validation |
|
|
| Yu Wang, | Hiv protease 136 | Supplied test set 34 |
| RF: |
|
| ||||
| Trypsin 88 | Supplied test set 22 |
| RF: | |
|
| ||||
| Carbonic anhydrase 100 | Supplied test set 26 |
|
| |
| V2012 2318 | Supplied test set 579 |
|
|
Fig. 6Bar graph for the correlation coefficient and RMSE (both rounded off to two decimal places) of the prediction models. A comparison of results using our prediction model and the results published by the respective authors.
Fig. 7Scatter plot for actual vs. predicted binding affinity of external dataset using Random forest.
Result of prediction of feasibility of protein–ligand interaction
| TP rate | 0.968 |
| FP rate | 0.056 |
| Precision | 0.968 |
| Recall | 0.968 |
|
| 0.967 |
| MCC | 0.927 |
| ROC area | 0.994 |
| PRC area | 0.994 |
Change in true positive rate of protein binding prediction using Random forest algorithm with respect to decrease in number of features
| Number of attributes | TP rate |
|---|---|
| 127 | 0.968 |
| 93 | 0.969 |
| 82 | 0.965 |
| 72 | 0.966 |
| 64 | 0.967 |
| 58 | 0.966 |
| 54 | 0.968 |
| 48 | 0.965 |
| 38 | 0.963 |
| 28 | 0.96 |
| 18 | 0.96 |
| 10 | 0.93 |