| Literature DB >> 34123336 |
AkshatKumar Nigam1,2, Robert Pollice1,2, Mario Krenn1,2,3, Gabriel Dos Passos Gomes1,2, Alán Aspuru-Guzik1,2,3,4.
Abstract
Inverse design allows the generation of molecules with desirable physical quantities using property optimization. Deep generative models have recently been applied to tackle inverse design, as they possess the ability to optimize molecular properties directly through structure modification using gradients. While the ability to carry out direct property optimizations is promising, the use of generative deep learning models to solve practical problems requires large amounts of data and is very time-consuming. In this work, we propose STONED - a simple and efficient algorithm to perform interpolation and exploration in the chemical space, comparable to deep generative models. STONED bypasses the need for large amounts of data and training times by using string modifications in the SELFIES molecular representation. First, we achieve non-trivial performance on typical benchmarks for generative models without any training. Additionally, we demonstrate applications in high-throughput virtual screening for the design of drugs, photovoltaics, and the construction of chemical paths, allowing for both property and structure-based interpolation in the chemical space. Overall, we anticipate our results to be a stepping stone for developing more sophisticated inverse design models and benchmarking tools, ultimately helping generative models achieve wider adoption. This journal is © The Royal Society of Chemistry.Entities:
Year: 2021 PMID: 34123336 PMCID: PMC8153210 DOI: 10.1039/d1sc00231g
Source DB: PubMed Journal: Chem Sci ISSN: 2041-6520 Impact factor: 9.825
Fig. 1Illustration of string manipulations within STONED to form local chemical subspaces (a, see Section II B) for a given structure, discovering median molecules on the chemical path between two structures (b, see Section II C) and formation of generalized chemical paths between more than two molecules (c, see Section II D).
Number and percentage of unique molecules obtained within different fingerprint-based similarity thresholds (δ) of the starting structures. The molecules in each experiment were generated from 250 000 random string mutations of the starting structures. Additionally, for celecoxib, we also formed the local chemical space with a scaffold constraint
| Starting structure (method) | Fingerprint | Number of molecules (and percentage) | ||
|---|---|---|---|---|
|
|
|
| ||
| Aripirazole (SELFIES, random) | ECFP4 | 513 (0.25%) | 4206 (2.15%) | 34 416 (17.66%) |
| Albuterol (SELFIES, random) | FCFP4 | 587 (0.32%) | 4156 (2.33%) | 16 977 (9.35%) |
| Mestranol (SELFIES, random) | AP | 478 (0.22%) | 4079 (1.90%) | 45 594 (21.66%) |
| Celecoxib (SELFIES, random) | ECFP4 | 198 (0.10%) | 1925 (1.00%) | 18 045 (9.44%) |
| Celecoxib (SELFIES, terminal 10%) | ECFP4 | 864 (2.02%) | 9407 (21.99%) | 34 187 (79.91%) |
| Celecoxib (SELFIES, central 10%) | ECFP4 | 111 (0.08%) | 1767 (1.32%) | 15 348 (11.45%) |
| Celecoxib (SELFIES, initial 10%) | ECFP4 | 368 (0.53%) | 7345 (10.53%) | 34 702 (49.74%) |
| Celecoxib (SMILES, random) | ECFP4 | 122 (18.43%) | 515 (77.49%) | 662 (100.00%) |
| Celecoxib (SMILES, terminal 10%) | ECFP4 | 90 (20.79%) | 368 (84.99%) | 433 (100.00%) |
| Celecoxib (SMILES, central 10%) | ECFP4 | 114 (22.18%) | 419 (81.52%) | 514 (100.00%) |
| Celecoxib (SMILES, initial 10%) | ECFP4 | 122 (19.71%) | 490 (79.16%) | 619 (100.00%) |
| Celecoxib (DeepSMILES, random) | ECFP4 | 132 (4.43%) | 953 (31.99%) | 2793 (93.76%) |
| Celecoxib (DeepSMILES, terminal 10%) | ECFP4 | 106 (9.73%) | 513 (47.11%) | 1083 (99.45%) |
| Celecoxib (DeepSMILES, central 10%) | ECFP4 | 53 (6.54%) | 162 (19.98%) | 658 (81.13%) |
| Celecoxib (DeepSMILES, initial 10%) | ECFP4 | 105 (9.28%) | 609 (53.80%) | 1106 (97.70%) |
| Celecoxib (SELFIES, scaffold constraint) | ECFP4 | 354 (0.44%) | 6311 (7.79%) | 53 479 (66.07%) |
| Celecoxib (CReM, ChEMBL: SCScore ≤ 2.5) | ECFP4 | 239 (0.58%) | 5547 (13.47%) | 14 887 (36.14%) |
Fig. 2Systematic local chemical space exploration of celecoxib using mutations of different SELFIES representations. The similarity is calculated using the Tanimoto distance of the ECFP4 fingerprint between celecoxib and the generated structures.
Fig. 3Systematic local exploration of the 3D similarity space of celecoxib.
Fig. 4(a) log P and QED values of molecules encountered along chemical paths between tadalafil and sildenafil. (b) Ligand binding affinities of molecules encountered along chemical paths between dihydroergotamine and prinomastat. For both subfigures, the corresponding reference properties are indicated by black lines.
Fig. 5(Top) Example of molecules along a chemical path between tadalafil and sildenafil, with their corresponding log P and QED scores. (Bottom) Example of a chemical path between dihydroergotamine (binder for 5-HT1B) and prinomastat (binder for CYP2D6). Docking scores for the intermediate structures on both proteins and their joint similarity to the reference structures are provided in the diagram to the right.
Fig. 6Multi-objective property optimization of potential molecules of interest for photovoltaics. Structural (left) and property similarity (right) of generated median molecules compared to specific sets of three molecules taken from the Harvard Clean Energy (HCE) database. Bar plots for the mean, and error bars for the standard deviation of the mean (2 standard deviations) are shown for the joint similarity and the normalized property distance of the 100 median structures with highest joint similarities to the references, with (Filtered Median) and without (Unfiltered Median) a bridgehead atom filter. They are compared to Random SELFIES and to molecules from the HCE database (Random HCE and Best HCE). The obtained median molecules are very close to Best HCE in joint similarity and slightly better in the properties.
Comparison of algorithms for the generation of molecules. ✓ and ✗ indicate the presence and absence of a feature, respectively. ∼ indicates that implementation of a feature within the algorithm is, in principle, possible but not straightforward or has not been carried out yet
| Feature | ES | VAE | GAN | RL | STONED |
|---|---|---|---|---|---|
| Expert rule-free | ✗ | ✓ | ✓ | ✓ | ✓ |
| Structure coverage | ∼ | ∼ | ∼ | ∼ | ✓ |
| Interpolatability | ✗ | ✓ | ✓ | ✗ | ✓ |
| Property-based navigation | ∼ | ✓ | ✓ | ✓ | ∼ |
| Training-free | ✓ | ✗ | ✗ | ✗ | ✓ |
| Data independence | ✓ | ✗ | ✗ | ✗ | ✓ |