| Literature DB >> 25926892 |
Shardul Paricharak1,2, Isidro Cortés-Ciriano3, Adriaan P IJzerman2, Thérèse E Malliavin3, Andreas Bender1.
Abstract
The rampant increase of public bioactivity databases has fostered the development of computational chemogenomics methodologies to evaluate potential ligand-target interactions (polypharmacology) both in a qualitative and quantitative way. Bayesian target prediction algorithms predict the probability of an interaction between a compound and a panel of targets, thus assessing compound polypharmacology qualitatively, whereas structure-activity relationship techniques are able to provide quantitative bioactivity predictions. We propose an integrated drug discovery pipeline combining in silico target prediction and proteochemometric modelling (PCM) for the respective prediction of compound polypharmacology and potency/affinity. The proposed pipeline was evaluated on the retrospective discovery of Plasmodium falciparum DHFR inhibitors. The qualitative in silico target prediction model comprised 553,084 ligand-target associations (a total of 262,174 compounds), covering 3,481 protein targets and used protein domain annotations to extrapolate predictions across species. The prediction of bioactivities for plasmodial DHFR led to a recall value of 79% and a precision of 100%, where the latter high value arises from the structural similarity of plasmodial DHFR inhibitors and T. gondii DHFR inhibitors in the training set. Quantitative PCM models were then trained on a dataset comprising 20 eukaryotic, protozoan and bacterial DHFR sequences, and 1,505 distinct compounds (in total 3,099 data points). The most predictive PCM model exhibited R (2) 0 test and RMSEtest values of 0.79 and 0.59 pIC50 units respectively, which was shown to outperform models based exclusively on compound (R (2) 0 test/RMSEtest = 0.63/0.78) and target information (R (2) 0 test/RMSEtest = 0.09/1.22), as well as inductive transfer knowledge between targets, with respective R (2) 0 test and RMSEtest values of 0.76 and 0.63 pIC50 units. Finally, both methods were integrated to predict the protein targets and the potency on plasmodial DHFR for the GSK TCAMS dataset, which comprises 13,533 compounds displaying strong anti-malarial activity. 534 of those compounds were identified as DHFR inhibitors by the target prediction algorithm, while the PCM algorithm identified 25 compounds, and 23 compounds (predicted pIC50 > 7) were identified by both methods. Overall, this integrated approach simultaneously provides target and potency/affinity predictions for small molecules. Graphical abstractProteochemometric modelling coupled to in silico target prediction.Entities:
Keywords: DHFR; QSAR; Target prediction; chemogenomics; plasmodium falciparum; proteochemometrics
Year: 2015 PMID: 25926892 PMCID: PMC4413554 DOI: 10.1186/s13321-015-0063-9
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Figure 1Schematic overview of in silico target prediction and domain-based extrapolation workflow. The conventional in silico target prediction approach [10] is extended in this study by using protein domain annotations to extrapolate from non-plasmodial target predictions to protein target predictions in P. falciparum. This concept is generally applicable across organisms, in particular to those for which little bioactivity data is currently available.
Figure 2PCA of the compounds annotated as actives against plasmodial DHFR (green) as well as T. gondii DHFR (red). Overall, plasmodial DHFR inhibitors cover a substantial portion of the chemical space occupied by T. gondii DHFR inhibitors. However, some clusters of T. gondii DHFR inhibitors are located in additional chemical space not covered by the plasmodial inhibitors (red boxes). These clusters contain compounds with bicyclic ring systems. By contrast, plasmodial inhibitors only contain unfused rings (green boxes). These observations explain why recall is low (~35%) when plasmodial DHFR inhibitors are excluded from the training set: T. gondii inhibitors do not cover all relevant chemical space, particularly the space occupied by compounds with unfused ring systems.
Figure 3Performance of the DHFR target prediction model compared across a number of parameters. 145 data points annotated against plasmodial DHFR were used as a test set to assess the performance of the target prediction model. The top n predicted non-plasmodial targets were considered (n was varied for values between 1 and 12), after which these targets were extrapolated to plasmodial targets. When n increases, recall values rise up to 36% (with recall values of ~35% for n =3 and n = 4). On the other hand, precision values are 100% for n ≥ 2. The high precision values are likely to be explained by the fact that plasmodial DHFR inhibitors and T. gondii DHFR inhibitors occupy the same chemical space. In addition to varying the parameter n, we performed a 2-fold cross validation (averaged over 20 randomizations), which resulted in a drastic improvement as a recall value of 79% was achieved (with a standard deviation of 10.1%, which is shown as an error bar). These results show that domain-based extrapolations have added value to the prediction algorithm (correct predictions are made even when bioactivity data on plasmodial DHFR is not present in the training set) and that including plasmodial DHFR bioactivity data in the training set can drastically improve recall values.
PCM, Family QSAR and Family QSAM performance on the PCM dataset
|
|
|
|
| |
|---|---|---|---|---|
| GBM PCM | 0.75 | 0.64 | 0.79 | 0.59 |
| GP PCM | 0.75 | 0.65 | 0.76 | 0.63 |
| RF PCM | 0.74 | 0.66 | 0.77 | 0.62 |
| SVM PCM | 0.76 | 0.63 | 0.77 | 0.62 |
| Family QSAM | 0.07 | 1.24 | 0.09 | 1.22 |
| Family QSAR | 0.61 | 0.80 | 0.63 | 0.78 |
| Inductive Transfer | 0.72 | 0.68 | 0.76 | 0.63 |
Abbreviations: QSAM Quantitative Structure-Activity Modelling, QSAR Quantitative Structure-Activity Relationship, GBM Gradient Boosting Machine, GP Gaussian Process, RF Random Forest, SVM Support Vector Machine.
PCM, with R2 0 test and RMSEtest values of 0.79 and 0.59 pIC50 units, outperforms both Family QSAR, with R2 0 test and RMSEtest values of 0.63 and 0.78 pIC50 units, respectively, and Family QSAM, with with R2 0 test and RMSEtest values of 0.09 and 1.22 pIC50 units, respectively.
Figure 4Complementarity between in silico target prediction and PCM. The target prediction algorithm predicted 534 compounds of the GSK TCAMS dataset to interact with DHFR, representing 3.95% of the total number of compounds in this dataset. Out of these 534 compounds, the PCM model predicted 23 compounds to have a pIC50 value of 7 or greater. Therefore, the combination of both methods permits the assessment of compound polypharmacology and provides quantitative bioactivity predictions.
Figure 5Compounds predicted to interact with DHFR by the target prediction algorithm, and predicted by the PCM model to have a pIC50 value higher than 7 pIC50 units. Compound IDs correspond to the TCMDC identifier given in the original dataset. The 23 compounds for which the IDs are accompanied by an upward-pointing arrow were identified by the two methods. The two compounds predicted to have a pIC50 value higher than 7 by the PCM model, but not predicted to interact with DHFR by the target prediction algorithm, are accompanied by a downward-pointing arrow. The 23 compounds predicted to be high-affinity DHFR inhibitors (upward-pointing arrows) share a common scaffold: a 5-methylpyrido[2,3-d]pyrimidine-2,4-diamine ring with an aryl substituent in the 6-position. Overall, it can be seen that these data indicate a high agreement between the target prediction algorithm and the PCM model to identify high-affinity DHFR inhibitors.