| Literature DB >> 35516285 |
Rustam Zhumagambetov1, Daniyar Kazbek1, Mansur Shakipov1, Daulet Maksut1, Vsevolod A Peshkov2, Siamac Fazli1.
Abstract
Several recent ML algorithms for de novo molecule generation have been utilized to create an open-access database of virtual molecules. The algorithms were trained on samples from ZINC, a free database of commercially available compounds. Generated molecules, stemming from 10 different ML frameworks, along with their calculated properties were merged into a database and coupled to a web interface, which allows users to browse the data in a user friendly and convenient manner. ML-generated molecules with desired structures and properties can be retrieved with the help of a drawing widget. For the case of a specific search leading to insufficient results, users are able to create new molecules on demand. These newly created molecules will be added to the existing database and as a result, the content as well as the diversity of the database keeps growing in line with the user's requirements. This journal is © The Royal Society of Chemistry.Entities:
Year: 2020 PMID: 35516285 PMCID: PMC9058596 DOI: 10.1039/d0ra07820d
Source DB: PubMed Journal: RSC Adv ISSN: 2046-2069 Impact factor: 4.036
Fig. 1The graphical representations of machine learning algorithms used to create the database.
Fig. 2The screenshots of two main pages of the application.
Qualitative comparison of the algorithms
| Model | Architecture | Learning technique | Molecule representation | Property targeting | Computational costs | Training dataset size |
|---|---|---|---|---|---|---|
| JT-VAE | VAE | Autoencoder | Graph | Yes | Medium | 250k |
| RNN | RNN | Direct flow | SMILES | No | Low | 250k |
| GrammarVAE | VAE | Autoencoder | SMILES | No | High | 250k |
| ChemVAE | VAE | Autoencoder | SMILES | Yes | Medium | 250k |
| MolCycleGan | GAN | Direct flow | Latent vector | Yes | Medium | 250k |
| ORGAN | GAN | RL | SMILES | Yes | High | 1 million |
| ORGANIC | GAN | RL | SMILES | Yes | High | 250k |
| SSVAE | VAE | Autoencoder | SMILES | Yes | Medium | 310k |
| CDN | VAE | Autoencoder | SMILES | No | Low | 250k |
| CVAE | VAE | Autoencoder | SMILES | Yes | Medium | 500k |
Fig. 3(Left) The diagonal of the matrix illustrates total number of molecules generated by each method. Intersections below the diagonal show number of same molecules that were generated by both methods. (Right) Each entry shows the proportion of shared molecules between each method.
General information about the three databases
| Database | Number of molecules | Exact molecular mass | Number of atoms | Number of chiral centers | Number of rings | Number of bridgehead atoms | Number of heterocycles |
|---|---|---|---|---|---|---|---|
| CheML | 2 899 276 | 373 ± 213 | 26.3 ± 15 | 0.23 ± 0.60 | 2.55 ± 1.02 | 0.023 ± 0.248 | 1.18 ± 0.93 |
| eMolecules | 26 394 586 | 331 ± 87.7 | 22.8 ± 6.46 | 0.0944 ± 0.489 | 2.63 ± 1.16 | 0.036 ± 0.319 | 1.35 ± 0.97 |
| ZINC | 30 000 000 | 321 ± 24.5 | 22.7 ± 2.0 | 1.399 ± 1.017 | 2.55 ± 0.90 | 0.076 ± 0.391 | 1.72 ± 0.95 |
| CheML JT-VAE | 1 399 265 | 323 ± 55 | 22.7 ± 3.9 | 0.0 ± 0.0 | 2.69 ± 0.93 | 0.024 ± 0.25 | 1.41 ± 0.89 |
| CheML RNN | 962 245 | 475 ± 336 | 33.8 ± 23.8 | 0.30 ± 0.66 | 2.37 ± 1.12 | 0.0176 ± 0.23 | 0.80 ± 0.84 |
| CheML GrammarVAE | 239 206 | 326 ± 59 | 22.7 ± 4.25 | 0.86 ± 0.90 | 2.63 ± 0.93 | 0.0238 ± 0.247 | 1.37 ± 0.92 |
| CheML ChemVAE | 99 273 | 333 ± 63 | 22.9 ± 4.6 | 0.81 ± 0.9 | 2.72 ± 1.01 | 0.0540 ± 0.37 | 1.48 ± 0.95 |
| CheML MolCycleGAN | 60 856 | 330 ± 61 | 23.0 ± 4.4 | 0.94 ± 1.00 | 2.75 ± 0.98 | 0.0400 ± 0.32 | 1.45 ± 0.96 |
| CheML ORGAN | 50 262 | 273 ± 58 | 18.6 ± 4.1 | 0.28 ± 0.57 | 2.20 ± 0.75 | 0.029 ± 0.26 | 0.77 ± 0.72 |
| CheML ORGANIC | 42 609 | 222 ± 60 | 15.8 ± 4.39 | 0.74 ± 0.82 | 1.39 ± 0.58 | 0.0022 ± 0.068 | 0.46 ± 0.55 |
| CheML SSVAE | 42 606 | 355 ± 70 | 24.9 ± 4.8 | 0.0 ± 0.0 | 2.89 ± 0.92 | 0.034 ± 0.26 | 1.18 ± 0.87 |
| CheML CDN | 2415 | 385 ± 413 | 26.8 ± 27.8 | 0.44 ± 0.74 | 2.70 ± 1.28 | 0.167 ± 0.7 | 1.35 ± 1.07 |
| CheML CVAE | 539 | 304 ± 19 | 22.3 ± 1.48 | 0.87 ± 1.26 | 2.19 ± 0.52 | 0.0074 ± 0.122 | 0.69 ± 0.69 |