| Literature DB >> 36091415 |
AkshatKumar Nigam1,2,3, Robert Pollice2,3, Alán Aspuru-Guzik2,3,4,5.
Abstract
Inverse molecular design involves algorithms that sample molecules with specific target properties from a multitude of candidates and can be posed as an optimization problem. High-dimensional optimization tasks in the natural sciences are commonly tackled via population-based metaheuristic optimization algorithms such as evolutionary algorithms. However, often unavoidable expensive property evaluation can limit the widespread use of such approaches as the associated cost can become prohibitive. Herein, we present JANUS, a genetic algorithm inspired by parallel tempering. It propagates two populations, one for exploration and another for exploitation, improving optimization by reducing property evaluations. JANUS is augmented by a deep neural network that approximates molecular properties and relies on active learning for enhanced molecular sampling. It uses the SELFIES representation and the STONED algorithm for the efficient generation of structures, and outperforms other generative models in common inverse molecular design tasks achieving state-of-the-art target metrics across multiple benchmarks. As neither most of the benchmarks nor the structure generator in JANUS account for synthesizability, a significant fraction of the proposed molecules is synthetically infeasible demonstrating that this aspect needs to be considered when evaluating the performance of molecular generative models. This journal is © The Royal Society of Chemistry.Entities:
Year: 2022 PMID: 36091415 PMCID: PMC9358752 DOI: 10.1039/d2dd00003b
Source DB: PubMed Journal: Digit Discov ISSN: 2635-098X
Fig. 1Schematic depiction of the architecture of JANUS. Two populations are propagated in parallel with distinct sets of genetic operators. The exploitative population uses molecular similarity as selection pressure, the explorative population uses a deep neural network estimating molecular properties as selection pressure.
Comparison of JANUS against literature baselines in the maximization of the penalized logarithm of the octanol–water partition coefficient scores. Except for “EvoMol”, all other literature baselines were taken directly from other work. The entry denoted as “JANUS” does not use additional selection pressure for the exploration population, “JANUS + P” uses a DNN predictor as additional selection pressure, the two “JANUS + C” entries use a DNN classifier as additional selection pressure
| Algorithm | Average of best | Single best | Algorithm | Average of best | Single best |
|---|---|---|---|---|---|
| GVAE[ | 2.87 ± 0.06 | — | GA + D | 13.31 ± 0.63 | 14.57 |
| SD-VAE[ | 3.60 ± 0.44 | — | GB-GA | 15.76 ± 5.76 | — |
| ORGAN[ | 3.52 ± 0.08 | — | EvoMol | 17.71 ± 0.41 | 18.59 |
| CVAE + BO[ | 4.85 ± 0.17 | — | GA + D | 20.72 ± 3.14 | 23.93 |
| JT-VAE[ | 4.90 ± 0.33 | — | GEGL | 31.40 ± 0.00 | 31.40 |
| ChemTS[ | 5.6 ± 0.5 | — | |||
| GCPN[ | 7.87 ± 0.07 | — | JANUS | 18.4 ± 4.4 | 20.96 |
| MRNN[ | — | 8.63 | JANUS + P | 21.0 ± 1.3 | 21.92 |
| MolDQN[ | — | 11.84 | JANUS + C (50%) | 23.6 ± 6.9 |
|
| GraphAF[ | — | 12.23 | JANUS + C (20%) | 21.9 ± 0.0 | 21.92 |
Average of 10 separate runs with 500 molecules of up to 81 SMILES characters per generation and 100 generations.
Average of 5 separate runs with 500 molecules of up to 81 SMILES characters per generation and 1000 generations.
Average of 5 separate runs with 16 384 molecules of up to 81 SMILES characters per generation and 200 generations.
Average of 15 separate runs with 500 molecules of up to 81 SMILES characters per generation and 100 generations.
Fig. 2Optimization progress of JANUS with four variations of selection pressure in the maximization of the penalized logarithm of the octanol–water partition coefficient (penalized log P). (a) Progress of the median of the highest fitness in each generation across 15 independent runs. (b) Progress of the median-of-medians fitness in each generation of the exploration population across 15 independent runs. The semi-transparent areas in both (a) and (b) depict the fitness intervals between the corresponding first and third quartiles of each generation.
Comparison of the number of property evaluations needed by JANUS and other molecular design algorithms to reach three threshold property values in the unconstrained maximization of the penalized logarithm of the octanol–water partition coefficient benchmark task. The entry denoted as “JANUS” does not use additional selection pressure for the exploration population, “JANUS + P” uses a DNN predictor as additional selection pressure, the two “JANUS + C” entries use a DNN classifier as additional selection pressure
| Algorithm | Number of evaluations | ||
|---|---|---|---|
|
|
|
| |
| GA[ | 40 500 | >50 000 | >50 000 |
| GA + D[ | 11 500 | >50 000 | >50 000 |
| EvoMol[ | 17 500 | 33 500 | >50 000 |
| JANUS | 15 000 | 32 000 | >50 000 |
| JANUS + P | 5000 | 8500 | 16 500 |
| JANUS + C (50%) | 10 000 | 26 000 | 29 500 |
| JANUS + C (20%) |
|
|
|
Fig. 3Molecules discovered by JANUS with the highest penalized log P score of 36.62 that are within the 81 SMILES character limit.
Comparison of JANUS against literature baselines for the four imitated inhibition benchmark tasks of the targets GSK3β and JNK3 (A: GSK3β, B: JNK3, C: GSK3β + JNK3, D: GSK3β + JNK3 + QED + SAscore). These benchmarks are evaluated based on 5000 molecules generated by the models. All literature baselines, including GA + D, were taken directly from other papers. The entry denoted as “JANUS” does not use additional selection pressure for the exploration population, “JANUS + P” uses a DNN predictor as additional selection pressure, the “JANUS + C (50%)” entry uses a DNN classifier as additional selection pressure
| Method | Success | Novelty | Diversity | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| A | B | C | D | A | B | C | D | A | B | C | D | |
| JTVAE[ | 32.2% | 23.5% | 3.3% | 1.3% | 11.8% | 2.9% | 7.9% | — | 0.901 | 0.882 |
| — |
| GCPN[ | 42.4% | 32.3% | 3.5% | 4.0% | 11.6% | 4.4% | 8.0% | — |
| 0.884 | 0.874 | — |
| GVAE-RL[ | 33.2% | 57.7% | 40.7% | 2.1% | 76.4% | 62.6% | 80.3% | — | 0.874 | 0.832 | 0.783 | — |
| REINVENT[ | 99.3% | 98.5% | 97.4% | 47.9% | 61.0% | 31.6% | 39.7% | 56.1% | 0.733 | 0.729 | 0.595 | 0.621 |
| RationaleRL[ |
|
|
| 74.8% | 53.4% | 46.2% | 97.3% | 56.8% | 0.888 | 0.862 | 0.824 | 0.701 |
| GA + D | 84.6% | 52.8% | 84.7% | 85.7% |
|
|
|
| 0.714 | 0.726 | 0.424 | 0.363 |
| JANUS (no fragments) | 90.6% | 86.4% | 90.4% | 90.2% | 57.9% | 15.9% | 74.9% | 22.8% | 0.850 | 0.807 | 0.681 | 0.728 |
| JANUS |
|
|
|
| 80.9% | 40.6% | 77.9% | 32.4% | 0.821 |
| 0.876 | 0.831 |
| JANUS + P |
|
|
|
| 84.1% | 43.1% | 78.8% | 17.4% | 0.881 | 0.883 | 0.857 | 0.822 |
| JANUS + C (50%) |
|
|
|
| 82.9% | 40.4% | 74.4% | 18.4% | 0.884 |
| 0.877 |
|
Result obtained from using 500 molecules per generation and 278 generations in total.
Result obtained from using 500 molecules per generation and up to 100 generations in total.
Fig. 4Histograms based on the SYBA scores of the molecules that fulfilled all the respective benchmark conditions generated by JANUS with three variations of selection pressure and two different types of mutations in the four imitated inhibition tasks (a–d). The training dataset provided by the authors of the benchmark and taken from the ChEMBL database (labelled ChEMBL) was used to estimate the reference synthesizability scores.
Comparison of JANUS against literature baselines for the minimization of molecular docking scores to the protein targets 5HT1B, 5HT2B, ACM2 and CYP2D6, respectively. Except for “GA + D”, all other literature baselines were taken directly from other papers. The first value corresponds to the docking score, the value in parenthesis is the diversity of the 250 molecules with the highest docking scores generated. The entry denoted as “JANUS” does not use additional selection pressure for the exploration population, “JANUS + P” uses a DNN predictor as additional selection pressure, the “JANUS + C (50%)” entry uses a DNN classifier as additional selection pressure. The ZINC (n%) entries indicate the highest, i.e. worst, docking score of the n% of molecules from the ZINC dataset that have the highest docking scores in that dataset. The Train (n%) entries indicate the highest, i.e. worst, docking score of the n% of molecules from the training set provided for this benchmark that have the highest docking scores in that dataset
| Method | 5HT1B | 5HT2B | ACM2 | CYP2D6 |
|---|---|---|---|---|
| ZINC (10%) | −9.894 (0.862) | −9.228 (0.851) | −8.282 (0.860) | −8.787 (0.853) |
| ZINC (1%) | −10.496 (0.861) | −9.833 (0.838) | −8.802 (0.840) | −9.291 ( |
| Train (10%) | −10.837 (0.749) | −9.769 (0.831) | −8.976 (0.812) | −9.256 (0.869) |
| Train (1%) | −11.493 (0.849) | −10.023 (0.746) | −10.003 (0.773) | −10.131 (0.763) |
| CVAE[ | −4.647 ( | −4.188 ( | −4.836 ( | — (—) |
| GVAE[ | −4.955 (0.901) | −4.641 (0.887) | −5.422 (0.898) | −7.672 (0.714) |
| REINVENT[ | −9.774 (0.506) | −8.657 (0.455) | −9.775 (0.467) | −8.759 (0.626) |
| GA + D | −8.3 ± 0.5 (0.123) | −8.1 ± 0.9 (0.122) | −7.9 ± 0.3 (0.136) | −8.3 ± 0.5 (0.149) |
| JANUS | −9.6 ± 0.9 (0.126) | −9.8 ± 0.7 (0.133) | −8.1 ± 0.5 (0.112) | −9.1 ± 0.4 (0.166) |
| JANUS + P | −9.9 ± 0.9 (0.132) | −9.8 ± 1.5 (0.166) | −8.0 ± 0.5 (0.125) | −9.3 ± 0.6 (0.194) |
| JANUS + C (50%) |
|
|
|
|
Result obtained from using 500 molecules per generation and 25 generations in total.
Fig. 5Optimization progress of JANUS with three variations of selection pressure in the minimization of the docking scores to the protein targets (a) 5HT1B and 5HT2B, and (b) ACM2 and CYP2D6. Progress is depicted via the median of the highest fitness in each generation across 3 independent runs. The semi-transparent areas in both (a) and (b) depict the fitness intervals between the corresponding 10% and 90% quantiles of each generation.
Fig. 6Progress of the median-of-median synthesizability scores (a and b) SYBA and (b and c) SCScore of the molecules generated by JANUS with three variations of selection pressure in the minimization of the docking scores to the protein targets (a and c) 5HT1B and 5HT2B, and (b and d) ACM2 and CYP2D6. Progress is depicted via the median of the corresponding median synthesizability scores in each generation across 3 independent runs. The semi-transparent areas depict the synthesizability score intervals between the corresponding 10% and 90% quantiles of each generation. The training dataset provided by the authors of the benchmark and taken from the ChEMBL database (labelled ChEMBL) was used to estimate the reference synthesizability scores.
Fig. 7Histograms based on the SYBA scores of the molecules generated by JANUS with three variations of selection pressure in the minimization of the docking scores to the protein targets (a) 5HT1B, (b) 5HT2B, (c) ACM2 and (d) CYP2D6. The training dataset provided by the authors of the benchmark and taken from the ChEMBL database (labelled ChEMBL) was used to estimate the reference synthesizability scores.