| Literature DB >> 36139164 |
Chaochao Yan1, Peilin Zhao2, Chan Lu2, Yang Yu2, Junzhou Huang1.
Abstract
The main target of retrosynthesis is to recursively decompose desired molecules into available building blocks. Existing template-based retrosynthesis methods follow a template selection stereotype and suffer from limited training templates, which prevents them from discovering novel reactions. To overcome this limitation, we propose an innovative retrosynthesis prediction framework that can compose novel templates beyond training templates. As far as we know, this is the first method that uses machine learning to compose reaction templates for retrosynthesis prediction. Besides, we propose an effective reactant candidate scoring model that can capture atom-level transformations, which helps our method outperform previous methods on the USPTO-50K dataset. Experimental results show that our method can produce novel templates for 15 USPTO-50K test reactions that are not covered by training templates. We have released our source implementation.Entities:
Keywords: and graph neural network; drug discovery; machine learning; reaction template; recurrent neural network; retrosynthesis
Mesh:
Year: 2022 PMID: 36139164 PMCID: PMC9496376 DOI: 10.3390/biom12091325
Source DB: PubMed Journal: Biomolecules ISSN: 2218-273X
Figure 1A retrosynthesis example from USPTO-50K dataset and its template extracted using an open-source toolkit. Note that the product and reactant are atom-mapped. The product and reactant subgraphs in (b) are highlighted in pink within the product and reactant molecule graphs in (a), respectively.
Figure 2The overall pipeline of our proposed method. Given the desired product as shown at the top left, single-step retrosynthesis finds the ground-truth reactant as shown at the bottom left. Numbers indicated in blue are the corresponding log-likelihoods of our models, and the log-likelihoods of the template composer model (TCM) and the reactant scoring model (RSM) are combined to obtain the final ranking of the reactants. In this example, combining log-likelihoods of TCM and RSM helps to find the correct Top-1 reactant.
Figure 3The workflow of our template composer model: (a) selecting a proper product subgraph from product subgraph candidates with PSSM, (b) selecting reactant subgraphs sequentially from reactant subgraph vocabulary with RSSM, and (c) annotating atom mappings between the product and reactant subgraphs to obtain a template.
Retrosynthesis evaluation results (%) on USPTO-50K. Existing methods are grouped into two categories. Our method RetroComposer belongs to the template-based methods. The best results in each column are highlighted in bold. RetroXpert* results have been updated by the authors in their GitHub repository (https://github.com/uta-smile/RetroXpert (accessed on 20 March 2022)).
| Methods | Without Reaction Types | With Reaction Types | ||||||
|---|---|---|---|---|---|---|---|---|
| Top-1 | Top-3 | Top-5 | Top-10 | Top-1 | Top-3 | Top-5 | Top-10 | |
| Template-free methods | ||||||||
| SCROP [ | 43.7 | 60.0 | 65.2 | 68.7 | 59.0 | 74.8 | 78.1 | 81.1 |
| G2Gs [ | 48.9 | 67.6 | 72.5 | 75.5 | 61.0 | 81.3 | 86.0 | 88.7 |
| MEGAN [ | 48.1 | 70.7 | 78.4 | 86.1 | 60.7 | 82.0 | 87.5 |
|
| RetroXpert* [ | 50.4 | 61.1 | 62.3 | 63.4 | 62.1 | 75.8 | 78.5 | 80.9 |
| RetroPrime [ | 51.4 | 70.8 | 74.0 | 76.1 | 64.8 | 81.6 | 85.0 | 86.9 |
| AT [ | 53.5 | - | 81.0 | 85.7 | - | - | - | - |
| GraphRetro [ | 53.7 | 68.3 | 72.2 | 75.5 | 63.9 | 81.5 | 85.2 | 88.1 |
| Dual model [ | 53.6 | 70.7 | 74.6 | 77.0 | 65.7 | 81.9 | 84.7 | 85.9 |
| Template-based methods | ||||||||
| RetroSim [ | 37.3 | 54.7 | 63.3 | 74.1 | 52.9 | 73.8 | 81.2 | 88.1 |
| NeuralSym [ | 44.4 | 65.3 | 72.4 | 78.9 | 55.3 | 76.0 | 81.4 | 85.1 |
| GLN [ | 52.5 | 69.0 | 75.6 | 83.7 | 64.2 | 79.1 | 85.2 | 90.0 |
| Ours |
|
|
|
|
|
|
| 91.5 |
| TCM only | 49.6 | 71.7 | 80.8 | 86.4 | 60.9 | 82.3 | 87.5 | 90.9 |
| RSM only | 51.8 | 75.7 | 82.4 | 87.3 | 64.3 | 84.8 | 88.9 | 91.4 |
Ablation study results (%) of two different PSSM loss functions: our proposed Equation (6) and BCE. The bold indicates the best results.
| Types |
| Methods | Top-1 | Top-3 | Top-5 | Top-10 |
|---|---|---|---|---|---|---|
| Without | Equation ( | Ours |
|
| 83.2 | 87.7 |
| TCM only | 49.6 | 71.7 | 80.8 | 86.4 | ||
| RSM only | 51.8 | 75.7 | 82.4 | 87.3 | ||
| BCE | Ours | 53.1 | 77.1 |
|
| |
| TCM only | 46.5 | 69.9 | 78.5 | 86.9 | ||
| RSM only | 51.2 | 75.7 | 82.9 | 88.6 | ||
| With | Equation ( | Ours |
| 85.8 | 89.5 | 91.5 |
| TCM only | 60.9 | 82.3 | 87.5 | 90.9 | ||
| RSM only | 64.3 | 84.8 | 88.9 | 91.4 | ||
| BCE | Ours | 65.3 |
|
|
| |
| TCM only | 58.5 | 81.8 | 87.6 | 91.5 | ||
| RSM only | 64.2 | 85.4 | 89.6 | 92.4 |
Top-1 accuracy (%) with different values. The bold indicates the best results.
|
| 0 | 0.1 | 0.2 | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 | 0.8 | 0.9 | 1.0 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Without types | 51.8 | 53.3 | 53.9 |
|
| 54.4 | 54.1 | 53.6 | 53.0 | 52.3 | 49.6 |
| With types | 64.3 | 65.2 | 65.6 | 65.7 |
|
| 65.6 | 65.1 | 64.7 | 64.4 | 60.9 |
Figure 4Our method successfully finds valid templates for two test reactions that are not covered by training data. The matched product subgraphs are highlighted in pink for better visualization.
Distribution of 10 recognized reaction types.
| Type | Reaction Type Name | Number of Reactions |
|---|---|---|
| 1 | Heteroatom alkylation and arylation | 15,204 |
| 2 | Acylation and related processes | 11,972 |
| 3 | C-C bond formation | 5667 |
| 4 | Heterocycle formation | 909 |
| 5 | Protections | 672 |
| 6 | Deprotections | 8405 |
| 7 | Reductions | 4642 |
| 8 | Oxidations | 822 |
| 9 | Functional group interconversion | 1858 |
| 10 | Functional group addition (FGA) | 231 |
Statistical results of templates and reactions. # is the short for “number”.
| # total templates | 10,386 |
| # unique product subgraphs | 7766 |
| # unique reactant subgraphs | 4391 |
| Test reactions coverage by training templates | 94.08% |
| Average # contained product subgraphs per mol | 35.19 |
| Average # applicable product subgraphs per mol | 2.02 |
| Average # templates per reaction | 2.23 |
| Average # reactants per reaction | 1.71 |
Bond features used in our method. These features are one-hot encoding.
| Feature | Description | Size |
|---|---|---|
| Bond type | Single, double, triple, or aromatic. | 4 |
| Conjugation | Whether the bond is conjugated. | 1 |
| In ring | Whether the bond is part of a ring. | 1 |
| Stereo | None, any, E/Z or cis/trans. | 6 |
Atom features used in our method. All features are one-hot encoding, except the atomic mass is a real number scaled to be on the same order of magnitude. The reaction type is applicable for type conditional setting.
| Feature | Description | Size |
|---|---|---|
| Atom type | Type of atom (ex. C, N, O), by atomic number. | 17 |
| # Bonds | Number of bonds the atom is involved in. | 6 |
| Formal charge | Integer electronic charge assigned to atom. | 5 |
| Chirality | Unspecified, tetrahedral CW/CCW, or other. | 4 |
| # Hs | Number of bonded Hydrogen atom. | 5 |
| Hybridization | sp, sp2, sp3, sp3d, or sp3d2. | 5 |
| Aromaticity | Whether this atom is part of an aromatic system. | 1 |
| Atomic mass | Mass of the atom, divided by 100. | 1 |
| Reaction type | The specified reaction type if it exists. | 10 |