| Literature DB >> 33430938 |
Oleksii Prykhodko1,2, Simon Viet Johansson3,4, Panagiotis-Christos Kotsias1, Josep Arús-Pous1,5, Esben Jannik Bjerrum1, Ola Engkvist1, Hongming Chen6,7.
Abstract
Deep learning methods applied to drug discovery have been used to generate novel structures. In this study, we propose a new deep learning architecture, LatentGAN, which combines an autoencoder and a generative adverEntities:
Keywords: Autoencoder networks; Deep learning; Generative adversarial networks; Molecular design
Year: 2019 PMID: 33430938 PMCID: PMC6892210 DOI: 10.1186/s13321-019-0397-9
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Fig. 1Workflow of the LatentGAN. The latent vectors generated from the encoder part of the heteroencoder is used as the input for the GAN. Once the training of the GAN is finished, new compounds are generated by first sampling the generator network of the GAN and then converting the sampled latent vector into a molecular structure using the decoder component of the heteroencoder
Targeted data set and the performance of the SVM models
| Target | Training set | Test set | SVM model | |
|---|---|---|---|---|
| ROC-AUC | Kappa value | |||
| EGFR | 2949 | 2326 | 0.850 | 0.56 |
| HTR1A | 48,283 | 23,048 | 0.993 | 0.90 |
| S1PR1 | 49,381 | 23,745 | 0.995 | 0.91 |
Training set size (training set), test set size (test set), receiver operating characteristic area under the curve (ROC-AUC), kappa value
The performance of heteroencoder in both the training and test sets
| Dataset | # compounds | Validity (%) | Reconstruction error (%) |
|---|---|---|---|
| Training set | 974,105 | 99 | 18 |
| Test set | 10,823 | 98 | 20 |
Percent of valid SMILES strings generated by the decoder (validity), percent of molecules not reconstructed correctly from valid SMILES (reconstruction error)
Fig. 2Plot of the first two PCA components (explained variance 74.1%) of a set of 200,000 generated molecules from the ChEMBL LatentGAN model using the MQN fingerprint
Metrics obtained from a 50,000 SMILES sample of all the models trained
| Dataset | Arch. | Valid (%) | Unique (%) | Novel (%) | Active (%) | Recovered actives/total actives (%) | Recovered neighbors |
|---|---|---|---|---|---|---|---|
| EGFR | GAN | 86 | 56 | 97 | 71 | 5.26 | 196 |
| RNN | 96 | 46 | 95 | 65 | 7.74 | 238 | |
| HTR1A | GAN | 86 | 66 | 95 | 71 | 5.05 | 284 |
| RNN | 96 | 50 | 90 | 81 | 7.28 | 384 | |
| S1PR1 | GAN | 89 | 31 | 98 | 44 | 0.93 | 24 |
| RNN | 97 | 35 | 97 | 65 | 3.72 | 43 |
Dataset used (Dataset), Architecture used (Arch.), Percent of valid molecules in the sampled set (Valid), Percent of valid unique compounds (Unique), Percent of unique novel (not present in the training set) compounds (Novel), Percent of unique active compounds (Active), Recovered actives from the test set given the entire number of actives in the test set (Recovered actives/Total Actives), Recovered neighbors of active compounds using FCFP6 fingerprint with 2048 bits and a threshold Tanimoto similarity of 0.7
Fig. 3Venn diagram of LatentGAN (red) and RNN (blue) active compounds/scaffolds
Fig. 4The distribution of Murcko scaffold similarity (left) and FCFP6 Tanimoto compound similarity (right) to the training set of molecules generated by LatentGAN models for a EGFR, b S1PR1 and c HTR1A
Fig. 5PCA analysis for a EGFR (explained variance 82.8%), b HTR1A (explained variance 75.0%) and c S1PR1 (explained variance 79.3%) dataset. The red dots are the training set, the blue dots are the predicted inactive compounds in the sampled set and other dots are the predicted actives in the sampled set with different level of probability of being active
Fig. 6The same PCA analysis, showing the Murcko scaffold similarities of the predicted active compounds for a EGFR (explained variance 80.2%), b HTR1A (explained variance 74.1%) and c S1PR1 (explained variance 71.3%). Note that due to the lower amount in the outlier region of c, the image has been rotated slightly. No significant relationship between the scaffold similarities and the regions was found. For a separation of the generated points by similarity interval, see Additional file 1
Fig. 7Examples generated by the LatentGAN. Compound 1-3 are generated by the EGFR model, 4–6 are generated by HTR1A model and 7–9 are generated by S1PR1 model
Fig. 8QED distributions of sampled molecules from EGFR (a), HTR1A (b) and S1PR1 (c)
Fig. 9SA distributions of sampled molecules from EGFR (a), HTR1A (b) and S1PR1 (c)