| Literature DB >> 26322135 |
Daniel S Murrell1, Isidro Cortes-Ciriano2, Gerard J P van Westen3, Ian P Stott4, Andreas Bender1, Thérèse E Malliavin2, Robert C Glen1.
Abstract
BACKGROUND: In silico predictive models have proved to be valuable for the optimisation of compound potency, selectivity and safety profiles in the drug discovery process.Entities:
Keywords: Ensemble; Learning; PCM; Package; QSAR; QSPR; R; Workflow
Year: 2015 PMID: 26322135 PMCID: PMC4551546 DOI: 10.1186/s13321-015-0086-2
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Fig. 1Overview of camb functionalities. camb provides an open and seamless framework for bioactivity/property modelling (QSAR, QSPR, QSAM and PCM) including: (1) compound standardisation, (2) molecular and protein descriptor calculation, (3) pre-processing and feature selection, model training, visualisation and validation, and (4) bioactivity/property prediction for new molecules. In the first instance, compound structures are subjected to a common representation with the function StandardiseMolecules. Proteins are encoded with 8 types of amino acid and/or 13 types of full protein sequence descriptors, whereas camb enables the calculation of 905 1D physicochemical descriptors for small molecules, and 14 types of fingerprints, such as Morgan or Klekota fingerprints. Molecular descriptors are statistically pre-processed, e.g., by centering their values to zero mean and scaling them to unit variance. Subsequently, single or ensemble machine learning models can be trained, visualised and validated. Finally, the camb function PredictExternal allows the user (1) to read an external set of molecules with a trained model, (2) to apply the same processing to these new molecules, and (3) to output predictions for this external set. This ensures that the same standardization options and descriptor types are used when a model is applied to make predictions for new molecules.
Fig. 2PCA analysis output from PCM. PCA analysis of the binding site amino acid descriptors corresponding to the 11 mammalian cyclooxygenases considered in the second case study (Proteochemometrics). Binding site amino acid descriptors (5 Z-scales) were input to the function PCA. The first two principal components (PCs) explained more than 80% of the variance. This indicates that there are mainly two sources of variability in the data. To generate the plot, we used the function PCAPlot using the default options. Cyclooxygenases cluster into two distant groups, which correspond to the isoenzyme type, i.e. COX-1 and COX-2. Given that small molecules tend to display similar binding profiles within orthologues [43], we hypothesised that merging bioactivity data from paralogues and orthologues will lead to more predictive PCM models [28].
Cross-validation and testing metrics for the single and ensemble QSPR models trained on the compound solubility dataset
| Algorithm |
| RMSECV |
| RMSEtest | |
|---|---|---|---|---|---|
| A | |||||
| GBM | 0.90 | 0.59 | 0.93 | 0.52 | |
| RF | 0.89 | 0.62 | 0.91 | 0.59 | |
| SVM radial | 0.88 | 0.63 | 0.91 | 0.60 | |
| B | |||||
| Greedy | – | 0.57 | 0.93 | 0.51 | |
| Linear stacking | 0.90 | 0.57 | 0.93 | 0.51 | |
| RF stacking | 0.89 | 0.62 | 0.92 | 0.55 | |
The lowest RMSE value on the test set, namely 0.51, was obtained with the greedy and with the linear stacking ensembles.
GBM Gradient Boosting Machine, RF Random Forest, RMSE root mean square error, SVM Support Vector Machine.
Fig. 3Observed vs predicted for both case studies. Observed against predicted values on the test set corresponding to a the compound solubility (LogS) dataset (case study 1: QSPR), and b the cyclooxygenase (COX) inhibition dataset (case study 2: PCM). Both a and b were generated with the function CorrelationPlot. The area defined by the blue lines comprises 1 LogS units (a) and 1 pIC50 units (b). Both plots were generated using the predictions on the test set calculated with the Linear Stacking ensembles (Tables 1, 3). Overall, high predictive power is attained on the test set for both datasets, with respective RMSE/ values of 0.51/0.93 (a), and 0.73/0.63 (b). Taken together, these data indicate that ensemble modelling leads to higher predictive power, although this increase might be marginal for some datasets (b).
Cyclooxygenase inhibition dataset ("Results and discussion" section, case study 2)
| UniProt ID | Isoenzyme | Organism | Number of datapoints |
|---|---|---|---|
| P23219 | 1 |
| 1,346 |
| O62664 | 1 |
| 48 |
| P22437 | 1 |
| 50 |
| O97554 | 1 |
| 11 |
| P05979 | 1 |
| 442 |
| Q63921 | 1 |
| 23 |
| P35354 | 2 |
| 2,311 |
| O62698 | 2 |
| 21 |
| Q05769 | 2 |
| 305 |
| P79208 | 2 |
| 341 |
| P35355 | 2 |
| 39 |
We extracted the bioactivity data for 11 mammalian cyclooxygenases from ChEMBL 16 [2]. The final bioactivity selection comprised 3,228 distinct compounds.
Cross-validation and testing metrics for the single and ensemble PCM models trained on the COX dataset
| Algorithm |
| RMSECV |
| RMSEtest | |
|---|---|---|---|---|---|
| A | |||||
| GBM | 0.59 | 0.77 | 0.60 | 0.76 | |
| RF | 0.60 | 0.78 | 0.61 | 0.79 | |
| SVM | 0.61 | 0.75 | 0.60 | 0.76 | |
| B | |||||
| Greedy ensemble | – | 0.73 | 0.63 | 0.73 | |
| Linear stacking | 0.63 | 0.73 | 0.63 | 0.73 | |
| EN stacking | 0.63 | 0.72 | 0.62 | 0.72 | |
| SVM linear stacking | 0.63 | 0.73 | 0.62 | 0.73 | |
| SVM radial stacking | 0.63 | 0.73 | 0.63 | 0.73 | |
| RF stacking | 0.61 | 0.76 | 0.58 | 0.77 | |
Combining single models trained with different algorithms in model ensembles allows to increase model predictive ability. We obtained the highest and RMSEtest values namely, 0.63 and 0.73 pIC50 unit respectively, with the greedy ensemble, and with the following model stacking techniques: (1) linear, and (2) SVM radial.
EN Elastic Net, GBM Gradient Boosting Machine, RF Random Forest, RMSE root mean square error in prediction, SVM Support Vector Machines.