| Literature DB >> 35346368 |
Jiazhen He1, Eva Nittinger2, Christian Tyrchan2, Werngard Czechtizky2, Atanas Patronov3, Esben Jannik Bjerrum3, Ola Engkvist3,4.
Abstract
Molecular optimization aims to improve the drug profile of a starting molecule. It is a fundamental problem in drug discovery but challenging due to (i) the requirement of simultaneous optimization of multiple properties and (ii) the large chemical space to explore. Recently, deep learning methods have been proposed to solve this task by mimicking the chemist's intuition in terms of matched molecular pairs (MMPs). Although MMPs is a widely used strategy by medicinal chemists, it offers limited capability in terms of exploring the space of structural modifications, therefore does not cover the complete space of solutions. Often more general transformations beyond the nature of MMPs are feasible and/or necessary, e.g. simultaneous modifications of the starting molecule at different places including the core scaffold. This study aims to provide a general methodology that offers more general structural modifications beyond MMPs. In particular, the same Transformer architecture is trained on different datasets. These datasets consist of a set of molecular pairs which reflect different types of transformations. Beyond MMP transformation, datasets reflecting general structural changes are constructed from ChEMBL based on two approaches: Tanimoto similarity (allows for multiple modifications) and scaffold matching (allows for multiple modifications but keep the scaffold constant) respectively. We investigate how the model behavior can be altered by tailoring the dataset while using the same model architecture. Our results show that the models trained on differently prepared datasets transform a given starting molecule in a way that it reflects the nature of the dataset used for training the model. These models could complement each other and unlock the capability for the chemists to pursue different options for improving a starting molecule.Entities:
Keywords: ADMET; Matched molecular pairs; Molecular optimization; Scaffold; Tanimoto similarity; Transformer
Year: 2022 PMID: 35346368 PMCID: PMC8962145 DOI: 10.1186/s13321-022-00599-3
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Fig. 1Input and output of the Transformer model (following [25]). The input is the concatenation of property change tokens and the SMILES of the starting molecule. During training, the output is the target molecule with the desirable properties while during inference the output is generated token by token and is expected to satisfy the property constraint in the input
Property change encoding
| Property | Measured unit | Threshold | Threshold in | Designed property change tokens |
|---|---|---|---|---|
| LogD | - | - | - | LogD_change_(− inf, − 6.9] |
| ... | ||||
| LogD_change_(− 0.3, − 0.1] | ||||
| LogD_change_(− 0.1, 0.1] | ||||
| LogD_change_(0.1, 0.3] | ||||
| ... | ||||
| LogD_change_(6.9, inf] | ||||
| Solubility | low: | low: | Solubility_low | |
| high: >50 | high: >1.7 | Solubility_high | ||
| Solubility_no_change | ||||
| Clearance | low: | low: | Clearance_low | |
| high: >20 | high: >1.3 | Clearance_high | ||
| Clearance_no_change |
Fig. 2Tanimoto similarity distribution considering all the possible unique pairs with the same publication
Dataset
| Datasets | Training (2000-2017) | Validation (2018) | Test (2019-2020) |
|---|---|---|---|
| MMPs | 2,287,588 | 143,978 | 166,582 |
| Similarity ( | 6,543,684 | 418,180 | 475,070 |
| Similarity ([0.5,0.7)) | 4,543,472 | 286,682 | 327,606 |
| Similarity ( | 2,000,212 | 131,498 | 147,464 |
| Scaffold | 2,850,180 | 171,914 | 199,786 |
| Scaffold generic | 4,127,058 | 255,580 | 289,034 |
Property prediction model performance on in-house data
| LogD | Solubility | Clearance | |
|---|---|---|---|
| Train size | 186,575 | 197,988 | 155,652 |
| Train RMSE | 0.295 | 0.489 | 0.271 |
| Train NRMSE | 0.025 | 0.056 | 0.053 |
| Train | 0.942 | 0.775 | 0.760 |
| Test size | 20,731 | 21,999 | 17,295 |
| Test RMSE | 0.395 | 0.600 | 0.352 |
| Test NRMSE | 0.038 | 0.076 | 0.091 |
| Test | 0.897 | 0.659 | 0.555 |
Fig. 3Overlap of training molecular pairs among different datasets. Exemplar molecular pairs are shown for data only in dataset Similarity (0.5), scaffold generic and MMP respectively
Performance comparison of Transformer and baselines in terms of successful property constraints, successful structure constraints and both metrics simultaneously
| Dataset | Model | Successful property constraints (%) | Successful structure constraints (%) | Successful property and structure constraints (%) |
|---|---|---|---|---|
| MMP | Transformer | 91.55 | ||
| Transformer-U | 33.67 | 93.25 | 31.85 | |
| Random | 13.44±0.43 | 100 | 13.44±0.43 | |
| Similarity ( | Transformer | 82.30 | ||
| Transformer-U | 29.04 | 83.63 | 25.32 | |
| Random | 15.17±0.27 | 100 | 15.17±0.27 | |
| Similarity ([0.5,0.7)) | Transformer | 68.09 | ||
| Transformer-U | 26.23 | 69.13 | 18.72 | |
| Random | 14.57±0.37 | 100 | 14.57±0.37 | |
| Similarity ( | Transformer | 82.68 | ||
| Transformer-U | 39.57 | 84.83 | 34.70 | |
| Random | 11.48±0.29 | 100 | 11.48±0.29 | |
| Scaffold | Transformer | 95.32 | ||
| Transformer-U | 37.16 | 95.69 | 36.26 | |
| Random | 17.22±0.74 | 100 | 17.22±0.74 | |
| Scaffold generic | Transformer | 96.01 | ||
| Transformer-U | 32.55 | 96.30 | 31.69 | |
| Random | 16.48±0.41 | 100 | 16.48±0.41 |
The results in bold indicate the best values; higher values are better
Each model is trained on the corresponding dataset for that row
Fig. 4Tanimoto similarity distribution for Similarity (≥ 0.5) dataset, Similarity ([0.5,0.7)) dataset, Similarity (≥ 0.7) dataset, MMP dataset, Scaffold dataset and Scaffold generic dataset. Legend Train for the molecular pairs from the training set; Generated desirable property for the pairs between the generated molecules that fulfil successful property constraints and their starting molecules from the test set; Generated desirable property+structure for the pairs between the generated molecules that fulfil both successful property and structure constraints and their starting molecules from the test set; Generated desirable propertystructure for the pairs between the generated molecules that fulfil successful property but not structure constraints and their starting molecules from the test set
Performance comparison of the Transformer models trained on different types of molecular pairs on the restricted intersection test set (numbers in bracket represent the absolute increase or decrease compared to the corresponding Transformer model performance on the original test set in Table 4)
| Test set | Type of molecular pairs where Transformer is trained | Successful property constraints (%) | Successful structure constraints (%) | Successful property and structure constraints (%) |
|---|---|---|---|---|
| MMP | 91.68 ( | |||
| Similarity ( | 55.55 ( | 84.47 ( | 48.97 ( | |
| Restricted | Similarity ([0.5,0.7)) | |||
| intersection | Similarity ( | 65.39 ( | 81.49 ( | 55.55 ( |
| Scaffold | 62.91 ( | 94.42 ( | 60.70 ( | |
| Scaffold generic | 59.07 ( | 57.68 ( |
The extremes (best/worst performance or largest/smallest change) are highlighted in bold
Fig. 5Comparison of heatmaps for training set and test set. The more similar, the better. a Relationship between the training molecular pairs of different datasets, e.g. the number 0.2 with Similarity ([0.5, 0.7)) as row and MMP as column from the training set represents 20% of the pairs with Similarity ([0.5, 0.7)) are also MMPs. b Each row represents the model trained on the corresponding dataset, and each column represents the corresponding structure constraints. The number 0.22 with Similarity ([0.5, 0.7)) as row and MMP as column from the Restricted intersection test set represents that when looking at the generated molecules using the Transformer model trained on Similarity ([0.5, 0.7)) dataset, among all the ones fulfilling the the property constraints and structure constraints (i.e. Similarity ([0.5, 0.7))), 22% of them are MMPs. The diagonal for the Restricted intersection is always 1 because we only look at the generated molecules that already fulfil the property constraints and structure constraints
Test sets where big property changes (logD change is above 1; solubility and clearance change is either lowhigh or highlow) are desired
| Test set | Size | Percentage (%) |
|---|---|---|
| MMP | 6,180 | 3.7 |
| Similarity ( | 18,546 | 3.9 |
| Similarity ([0.5, 0.7)) | 15,130 | 4.6 |
| Similarity ( | 3,416 | 2.3 |
| Scaffold | 6,252 | 3.1 |
| Scaffold generic | 10,514 | 3.6 |
| Merged | 21,652 | - |
Size indicates the number of data points where big property change are desired; Percentage indicates the fraction of the original test set in Table 2 with data points that have big property changes, e.g. 6180/1665823.7%
Performance comparison of Transformer models trained on different types of molecular pairs on the Merged dataset where big property changes are desired (numbers in bracket represent the absolute increase/decrease compared to the corresponding Transformer model performance on the original test set in Table 4)
| Test set | Type of molecular pairs where Transformer is trained | Successful property constraints (%) | Successful structure constraints (%) | Successful property and structure constraints (%) |
|---|---|---|---|---|
| MMP | 83.89 ( | |||
| Similarity ( | 39.81 ( | 75.00 ( | 30.70 ( | |
| Merged | Similarity ([0.5,0.7)) | 38.33 ( | 25.94 ( | |
| Similarity ( | 68.57 ( | |||
| Scaffold | 36.50 ( | 89.17 ( | 33.60 ( | |
| Scaffold generic | 37.78 ( | 35.26 ( |
The extremes (best/worst performance or largest/smallest change) are highlighted in bold
Fig. 6Example of diverse molecules with desirable properties generated by models trained on (b) MMPs (c) pairs with Similarity (0.5) (d) pairs with Similarity ([0.5, 0.7)). The changes in the generated molecules compared with starting molecule are highlighted in red. Sim represents Tanimoto similarity
Fig. 7Example of diverse molecules with desirable properties generated by models trained on b pairs with Similarity (0.7) c pairs sharing scaffold and d pairs sharing generic scaffold. The changes in the generated molecules compared with starting molecule are highlighted in red. Sim represents Tanimoto similarity