| Literature DB >> 30996948 |
Abstract
This paper presents a comparison of a graph-based genetic algorithm (GB-GA) and machine learning (ML) results for the optimization of log P values with a constraint for synthetic accessibility and shows that the GA is as good as or better than the ML approaches for this particular property. The molecules found by the GB-GA bear little resemblance to the molecules used to construct the initial mating pool, indicating that the GB-GA approach can traverse a relatively large distance in chemical space using relatively few (50) generations. The paper also introduces a new non-ML graph-based generative model (GB-GM) that can be parameterized using very small data sets and combined with a Monte Carlo tree search (MCTS) algorithm. The results are comparable to previously published results (Sci. Technol. Adv. Mater., 2017, 18, 972-976) using a recurrent neural network (RNN) generative model, and the GB-GM-based method is several orders of magnitude faster. The MCTS results seem more dependent on the composition of the training set than the GA approach for this particular property. Our results suggest that the performance of new ML-based generative models should be compared to that of more traditional, and often simpler, approaches such a GA.Entities:
Year: 2019 PMID: 30996948 PMCID: PMC6438151 DOI: 10.1039/c8sc05372c
Source DB: PubMed Journal: Chem Sci ISSN: 2041-6520 Impact factor: 9.825
Fig. 1Two equally likely kinds of crossovers are considered: at non-ring (a–c) and at ring positions (d–f). Two equally likely kinds of ring cuts are considered: adjacent bonds and bonds separated by one bond. For ring crossovers fragments can be mated using both single and double bonds. (c) and (e) each shows two examples of children made by the mating process. Methylflouride is discarded because it is too small and the cycloheptene ring is discarded because the ring is too large.
Fig. 2Overview of mutation operations and their associated probabilities, e.g. if an “append atom’’ mutation is chosen then a single bond is added 60% of the time.“∼” indicates an arbitrary bond order.
Probability of the 15 most common 3-atom combinations in rings in the first 1000 structures of the ZINC data set (“ZINC”), and in the 1000 structures generated by the GB-GM method using the ZINC probabilities (“GB-GM (62%)”) and a probability set where the probability of [*][*]–[*] type bonding is increased to 80% (“GB-GM (80%)”)
| Bonding | ZINC | GB-GM (62%) | GB-GM (80%) |
| C | 45% | 41% | 53% |
| C–C–C | 15% | 23% | 21% |
| C–C–N | 9% | 9% | 6% |
| C–N–C | 6% | 7% | 5% |
| C | 4% | 6% | 4% |
| N | 3% | 2% | 2% |
| C | 2% | 2% | 1% |
| C–C–O | 2% | 2% | 2% |
| N | 2% | 0% | 0% |
| C | 2% | 0% | 0% |
| C–O–C | 1% | 1% | 1% |
| C–N–N | 1% | 1% | 1% |
| C | 1% | 1% | 0% |
| C–S–C | 1% | 1% | 1% |
| C | 1% | 1% | 1% |
Maximum J(m) scores averaged over 10 runs, the number of molecules evaluated per run, and the required CPU time per run. See the text for an explanation of the methods. Results for the non-GB methods are taken from the study of Yang et al.2 where the number of molecules evaluated per run is estimated based on the average number of molecules generated per minute and the CPU time
| Method | Average | No. molecules | CPU time |
| GB-GA (50%) | 6.8 ± 0.7 | 1000 | 30 seconds |
| GB-GA (1%) | 7.4 ± 0.9 | 1000 | 30 seconds |
| GB-GM-MCTS (62%) | 2.6 ± 0.6 | 1000 | 90 seconds |
| GB-GM-MCTS (80%) | 3.4 ± 0.6 | 1000 | 90 seconds |
| GB-GM-MCTS (80%) | 4.3 ± 0.6 | 5000 | 9 minutes |
| ChemTS | 4.9 ± 0.5 | ∼5000 | 2 hours |
| ChemTS | 5.6 ± 0.5 | ∼20 000 | 8 hours |
| RNN + BO | 4.5 ± 0.2 | ∼4000 | 8 hours |
| Only RNN | 4.8 ± 0.2 | ∼20 000 | 8 hours |
| CVAE + BO | 0.0 ± 0.9 | ∼100 | 8 hours |
| GVAE + BO | 0.2 ± 1.3 | ∼1000 | 8 hours |
Fig. 3Highest scoring molecules from the GB-GA (a and b) and GB-GM-MCTS (c and d) searches.
Fig. 4Plot of the highest J(m) value found as a function of generations for 10 different GB-GA runs with a mutation rate of 1%.
Number of occurrences of different ring-types in the first 1000 structures of the ZINC data set (“ZINC”), and in the 1000 structures generated by the GB-GM method using the ZINC probabilities (“GB-GM (62%)”) and a probability set where the probability of [*][*]–[*] type bonding is increased to 80% (“GB-GM (80%)”)
| Ring-type | ZINC | GM (62%) | GM (80%) |
| [*]1–[*]–[*]1 | 57 | 104 | 57 |
| [*]1–[*]–[*]–[*]1 | 17 | 33 | 10 |
| [*]1–[*]–[*]–[*]–[*]1 | 280 | 15 | 4 |
| [*]1 | 120 | 396 | 408 |
| [*]1 | 470 | 132 | 221 |
| [*]1–[*]–[*]–[*]–[*]–[*]1 | 409 | 64 | 2 |
| [*]1 | 77 | 363 | 104 |
| [*]1 | 100 | 591 | 405 |
| [*]1 | 7 | 321 | 202 |
| [*]1 | 1206 | 479 | 850 |
| 7-Membered ring | 24 | 0 | 0 |
| 8-Membered ring | 1 | 0 | 0 |
| Total | 2768 | 2498 | 2263 |