| Literature DB >> 29104927 |
Bowen Liu1, Bharath Ramsundar2, Prasad Kawthekar2, Jade Shi1, Joseph Gomes1, Quang Luu Nguyen1, Stephen Ho1, Jack Sloane1, Paul Wender1,3, Vijay Pande1,2,4.
Abstract
We describe a fully data driven model that learns to perform a retrosynthetic reaction prediction task, which is treated as a sequence-to-sequence mapping problem. The end-to-end trained model has an encoder-decoder architecture that consists of two recurrent neural networks, which has previously shown great success in solving other sequence-to-sequence prediction tasks such as machine translation. The model is trained on 50,000 experimental reaction examples from the United States patent literature, which span 10 broad reaction types that are commonly used by medicinal chemists. We find that our model performs comparably with a rule-based expert system baseline model, and also overcomes certain limitations associated with rule-based expert systems and with any machine learning approach that contains a rule-based expert system component. Our model provides an important first step toward solving the challenging problem of computational retrosynthetic analysis.Entities:
Year: 2017 PMID: 29104927 PMCID: PMC5658761 DOI: 10.1021/acscentsci.7b00303
Source DB: PubMed Journal: ACS Cent Sci ISSN: 2374-7943 Impact factor: 18.728
Figure 1Phenylalanine synthetic scheme.
Figure 2Retrosynthetic reaction prediction task and an example of a possible retrosynthetic disconnection for a target molecule.
Distribution of Major Reaction Classes within the Processed Reaction Data Set
| reaction class | reaction name | no. of examples |
|---|---|---|
| 1 | heteroatom alkylation and arylation | 15122 |
| 2 | acylation and related processes | 11913 |
| 3 | C–C bond formation | 5639 |
| 4 | heterocycle formation | 900 |
| 5 | protections | 650 |
| 6 | deprotections | 8353 |
| 7 | reductions | 4585 |
| 8 | oxidations | 814 |
| 9 | functional group interconversion (FGI) | 1834 |
| 10 | functional group addition (FGA) | 227 |
Figure 3Seq2seq model architecture.
Figure 4A partially completed beam search procedure with a beam width of 5 for an example input. Note that only the top 5 candidate sequences are retained at each time step. The visualization was produced using the seq2seq model library from Britz et al.[51]
Comparison of Top-N Accuracies between the Baseline and Seq2seq Models
| top- | ||||||
|---|---|---|---|---|---|---|
| model | top-1 | top-3 | top-5 | top-10 | top-20 | top-50 |
| baseline | 35.4 | 52.3 | 59.1 | 65.1 | 68.6 | 69.5 |
| seq2seq | 37.4 | 52.4 | 57.0 | 61.7 | 65.9 | 70.7 |
Figure 5Representative examples of correct seq2seq model predictions for each reaction class.
Breakdown of the Top-10 Accuracy of the Baseline and Seq2seq Models by Reaction Class
| reaction
class | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |
| top-10 accuracy (%) | ||||||||||
| baseline | 77.2 | 84.9 | 53.4 | 54.4 | 6.2 | 26.9 | 74.7 | 68.4 | 46.7 | 73.9 |
| seq2seq | 57.5 | 74.6 | 46.1 | 27.8 | 80.0 | 62.8 | 67.8 | 69.1 | 47.3 | 56.5 |
| no. of examples | 1512 | 1191 | 564 | 90 | 65 | 835 | 459 | 81 | 184 | 23 |
Breakdown of the Grammatically Invalid SMILES Error for Different Beam Sizes
| beam
size | ||||||
|---|---|---|---|---|---|---|
| 1 | 3 | 5 | 10 | 20 | 50 | |
| no. of valid SMILES | 4393 | 12438 | 19751 | 37311 | 71462 | 167281 |
| no. of invalid SMILES | 611 | 2242 | 4450 | 10544 | 24912 | 74605 |
| % error | 12.2 | 15.3 | 18.4 | 22.0 | 25.8 | 30.8 |
Figure 6Examples of reactant SMILES that are grammatically invalid: (a) reaction class 3 (C–C bond formation); (b) reaction class 7 (reductions).
Figure 7Examples of reactant SMILES that are grammatically valid, but the overall reaction is chemically implausible: (a) reaction class 2 (acylation and related processes); (b) reaction class 7 (reductions).
Figure 8Examples of reactant SMILES that are grammatically valid and the overall reaction is chemically plausible: (a) reaction class 1 (heteroatom alkylation and arylation); (b) reaction class 2 (acylation and related processes).
Figure 9Histogram of the highest rank assigned to the ground truth match in the top-10 predictions of the seq2seq and baseline models for each example. Note that the relative total counts across all the ranks for the seq2seq and baseline models is proportional to their relative top-10 accuracies shown in Table .