| Literature DB >> 29616425 |
Maris Lapins1, Staffan Arvidsson1, Samuel Lampa1, Arvid Berg1, Wesley Schaal1, Jonathan Alvarsson1, Ola Spjuth2.
Abstract
Lipophilicity is a major determinant of ADMET properties and overall suitability of drug candidates. We have developed large-scale models to predict water-octanol distribution coefficient (logD) for chemical compounds, aiding drug discovery projects. Using ACD/logD data for 1.6 million compounds from the ChEMBL database, models are created and evaluated by a support-vector machine with a linear kernel using conformal prediction methodology, outputting prediction intervals at a specified confidence level. The resulting model shows a predictive ability of [Formula: see text] and with the best performing nonconformity measure having median prediction interval of [Formula: see text] log units at 80% confidence and [Formula: see text] log units at 90% confidence. The model is available as an online service via an OpenAPI interface, a web page with a molecular editor, and we also publish predictive values at 90% confidence level for 91 M PubChem structures in RDF format for download and as an URI resolver service.Entities:
Keywords: Conformal prediction; LogD; Machine learning; QSAR; RDF; Support-vector machine
Year: 2018 PMID: 29616425 PMCID: PMC5882484 DOI: 10.1186/s13321-018-0271-1
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Fig. 1Workflow of 10-fold cross-conformal predictor. The training set is randomly permuted and split into ten, non-overlapping folds. An inductive conformal predictor (pink area) is trained for each split, using a single fold as its calibration set and the remaining nine folds as its proper training set. Proper training sets are used for fitting the endpoint and error models. Calibration sets are used to evaluate predictive ability of the model and to accumulate a list of (compound nonconformity) values. For any new prediction, each inductive predictor will give an endpoint prediction (single-value prediction) and produce a prediction interval based on the predicted error, the desired confidence and the list of values. The final prediction is computed by aggregating the individual predictions using the median midpoint and median interval width
Predictive ability of models, expressed as squared correlation coefficient () between acd_logd values in ChEMBL database and predicted logD values for 100,000 test set compounds
| Epsilon | Cost | ||||
|---|---|---|---|---|---|
| 0.001 | 0.01 | 0.1 | 1 | 10 | |
|
| 0.509 | 0.509 | 0.510 | ||
|
| 0.820 | 0.821 | 0.821 | 0.821 | |
|
| 0.918 | 0.943 | 0.949 | 0.952 | 0.952 |
|
| 0.923 | 0.958 | 0.970 | ||
|
| 0.958 | 0.971 | 0.972 | ||
Bolditalic values indicate models with the highest predictive ability
Fig. 2Predictive ability of the best model . Plotted are acd_logd values (x-axis) versus the predicted logD values (y-axis) for 100,000 test set compounds. The root mean square error of prediction (RMSEP) is 0.41 log units
Median prediction interval width at confidence levels from 10 to 99%
| Epsilon | Nonconformity measure | Confidence level | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10% | 20% | 30% | 40% | 50% | 60% | 70% | 80% | 90% | 95% | 99% | ||
|
| Abs-diff | 0.109 | 0.221 | 0.336 | 0.462 | 0.604 | 0.771 | 0.986 | 1.284 | 1.813 | 2.237 | 3.841 |
| Normalized | 0.122 | 0.243 | 0.362 | 0.478 | 0.595 | |||||||
| Log-normalized, | 0.155 | 0.257 | 0.387 | 0.560 | 0.801 | 1.171 | 1.812 | 3.291 | 5.273 | 10.879 | ||
| Log-normalized, | 0.074 | 0.159 | 0.260 | 0.763 | 1.080 | 1.599 | 2.689 | 4.031 | 7.676 | |||
|
| Abs-diff | 0.069 | 0.139 | 0.211 | 0.288 | 0.378 | 0.629 | 0.843 | 1.245 | |||
| Normalized | 0.079 | 0.157 | 0.233 | 0.311 | 0.395 | 0.491 | 1.918 | 7.194 | ||||
| Log-normalized, | 0.519 | 0.772 | 1.223 | 2.311 | 3.918 | 10.157 | ||||||
| Log-normalized, | 0.044 | 0.097 | 0.163 | 0.245 | 0.356 | 0.509 | 0.741 | 1.137 | 2.030 | 3.233 | 7.204 | |
|
| Abs-diff | 0.065 | 0.132 | 0.201 | 0.270 | 0.354 | ||||||
| Normalized | 0.075 | 0.148 | 0.220 | 0.293 | 0.376 | 0.474 | 0.605 | 0.824 | 1.445 | 2.664 | 12.199 | |
| Log-normalized, |
| 0.341 | 0.495 | 0.738 | 1.171 | 2.205 | 3.747 | 10.007 | ||||
| Log-normalized, | 0.042 | 0.095 | 0.158 | 0.235 | 0.486 | 0.710 | 1.095 | 1.963 | 3.156 | 7.247 | ||
Shown are MPI at confidence levels (validity) from 10 to 99%. Note that a smaller median prediction interval indicates higher efficiency of a nonconformity measure. Shown are results for models with and epsilon values , and . Italicized are results for the best model at each epsilon value and confidence level. Marked by bolditalics are results for overall best models at each confidence level
Fig. 3Example of prediction intervals. Shown are intervals at 80% confidence level for 29 reference compounds from [36] and 72 compounds from [37]. Grey arrows mark the compounds exemplified in Fig. 4
Fig. 4Examples of molecule gradients for the prediction of logD. Shown are gradients for four compounds indicated by arrows in Fig. 3. Upper row: atenolol () and sotalol (). Lower row: tolnaftate () and amiodarone ()
Fig. 5Example of a compound with large prediction interval as seen in the prediction service user interface. One compound which gives rise to a large prediction interval in Fig. 3 is strychnine (; prediction interval from to 1.498). Here it is drawn using the prediction service available at http://predict-cplogd.os.pharmb.io/