| Literature DB >> 32438572 |
Renren Bai1, Chengyun Zhang1, Ling Wang1, Chuansheng Yao1, Jiamin Ge1, Hongliang Duan1.
Abstract
Effective computational prediction of complex or novel molecule syntheses can greatly help organic and medicinal chemistry. Retrosynthetic analysis is a method employed by chemists to predict synthetic routes to target compounds. The target compounds are incrementally converted into simpler compounds until the starting compounds are commercially available. However, predictions based on small chemical datasets often result in low accuracy due to an insufficient number of samples. To address this limitation, we introduced transfer learning to retrosynthetic analysis. Transfer learning is a machine learning approach that trains a model on one task and then applies the model to a related but different task; this approach can be used to solve the limitation of few data. The unclassified USPTO-380K large dataset was first applied to models for pretraining so that they gain a basic theoretical knowledge of chemistry, such as the chirality of compounds, reaction types and the SMILES form of chemical structure of compounds. The USPTO-380K and the USPTO-50K (which was also used by Liu et al.) were originally derived from Lowe's patent mining work. Liu et al. further processed these data and divided the reaction examples into 10 categories, but we did not. Subsequently, the acquired skills were transferred to be used on the classified USPTO-50K small dataset for continuous training and retrosynthetic reaction tests, and the pretrained accuracy data were simultaneously compared with the accuracy of results from models without pretraining. The transfer learning concept was combined with the sequence-to-sequence (seq2seq) or Transformer model for prediction and verification. The seq2seq and Transformer models, both of which are based on an encoder-decoder architecture, were originally constructed for language translation missions. The two algorithms translate SMILES form of structures of reactants to SMILES form of products, also taking into account other relevant chemical information (chirality, reaction types and conditions). The results demonstrated that the accuracy of the retrosynthetic analysis by the seq2seq and Transformer models after pretraining was significantly improved. The top-1 accuracy (which is the accuracy rate of the first prediction matching the actual result) of the Transformer-transfer-learning model increased from 52.4% to 60.7% with greatly improved prediction power. The model's top-20 prediction accuracy (which is the accuracy rate of the top 20 categories containing actual results) was 88.9%, which represents fairly good prediction in retrosynthetic analysis. In summary, this study proves that transferring learning between models working with different chemical datasets is feasible. The introduction of transfer learning to a model significantly improved prediction accuracy and, especially, assisted in small dataset based reaction prediction and retrosynthetic analysis.Entities:
Keywords: SMILES structure; artificial intelligence; products; reactants; retrosynthesis; seq2seq; transfer leaning; transformer
Mesh:
Year: 2020 PMID: 32438572 PMCID: PMC7287934 DOI: 10.3390/molecules25102357
Source DB: PubMed Journal: Molecules ISSN: 1420-3049 Impact factor: 4.411
Figure 1An example of a retrosynthetic prediction. The target module is shown to the left of the arrow and predicted reactants are displayed to the right. The SMILES code for each compound is also indicated.
Figure 2Designing concept and process of transfer-learning-aided retrosynthetic analysis.
Distribution and description of the major reaction classes within the processed reaction dataset [18].
| Class | Description | No. of Examples | Percentage of Dataset (%) |
|---|---|---|---|
|
| heteroatom alkylation and arylation | 15122 | 30.3 |
|
| acylation and related processes | 11913 | 23.8 |
|
| C−C bond formation | 5639 | 11.3 |
|
| heterocycle formation | 900 | 1.8 |
|
| protection | 650 | 1.3 |
|
| deprotection | 8353 | 16.5 |
|
| reduction | 4585 | 9.2 |
|
| oxidation | 814 | 1.6 |
|
| functional group interconversion (FGI) | 1834 | 3.7 |
|
| functional group addition (FGA) | 227 | 0.5 |
Figure 3The top-1 accuracies of the seq2seq/seq2seq-transfer-learning and Transformer/Transformer-transfer-learning models as a function of time.
Comparison of the top-N accuracies a of the seq2seq and seq2seq-transfer learning models.
| Model | Top-N Accuracy (%) | |||||
|---|---|---|---|---|---|---|
| Top-1 | Top-2 | Top-3 | Top-5 | Top-10 | Top-20 | |
|
| 37.4% | -- | 52.4% | 57.0% | 61.7% | 65.9% |
|
| 44.6% | 54.8% | 59.4% | 64.1% | 68.8% | 72.1% |
a Data are related to the test data set containing 5004 reactions. b The testing results are from the seq2seq2 model of Liu et al. [8].
Comparison of the top-N accuracies a of the Transformer and Transformer-transfer-learning models.
| Model | Top-N Accuracy (%) | |||||
|---|---|---|---|---|---|---|
| Top-1 | Top-2 | Top-3 | Top-5 | Top-10 | Top-20 | |
|
| 52.4% | 63.3% | 67.1% | 70.8% | 73.2% | 74.3% |
|
| 60.7% | 74.0% | 79.4% | 83.5% | 87.6% | 88.9% |
a Data are related to the test data set containing 5004 reactions.
True positives (TPs), true negatives (TNs), false positives (FPs) and false negatives (FNs) in top-1 predictions a by the Transformer-transfer-learning model.
| Positive (exp.) | Negative (exp.) | |
|---|---|---|
|
| 2021 | 776 |
|
| 1191 | 1016 |
|
| 5004 | |
a Data are related to the test data set (N = 5004 reactions).
Figure 4The top-1 and top-10 accuracies of the seq2seq/seq2seq-transfer-learning and Transformer/Transformer-transfer-learning models by reaction class. 1. Heteroatom alkylation and arylation. 2. Acylation and related processes. 3. C−C bond formation. 4. Heterocycle formation. 5. Protections. 6. Deprotections. 7. Reductions. 8. Oxidations. 9. Functional group interconversion (FGI). 10. Functional group addition (FGA).
Comparisons and representative examples (selected from the test set) of the Transformer and Transformer-transfer-learning models in the retrosynthetic prediction of heterocycle formation reactions.
| Target Compound | Retrosynthetic Analysis | ||
|---|---|---|---|
| Transformer Model | Transformer-Transfer-Learning Model | ||
|
|
|
|
|
| 2 |
|
|
|
| 3 |
|
|
|
| 4 |
|
|
|
| 5 |
|
|
|
Comparisons and representative examples (selected from the test set) of the Transformer and Transformer-transfer-learning models in retrosynthetic prediction with nonaromatic heterocycle structures.
| Target Compound | Retrosynthetic Analysis | ||
|---|---|---|---|
| Transformer Model | Transformer-Transfer-Learning Model | ||
|
|
|
|
|
| 2 |
|
|
|
| 3 |
|
|
|
| 4 |
|
|
|
| 5 |
|
|
|
| 6 |
|
|
|
| 7 |
|
|
|
Comparisons and representative examples (selected from the test set) of the Transformer and Transformer-transfer-learning models in retrosynthetic prediction with chiral carbon atoms.
| Target Compound | Retrosynthetic Analysis | ||
|---|---|---|---|
| Transformer Model | Transformer-Transfer-Learning Model | ||
|
|
|
|
|
| 2 |
|
|
|
| 3 |
|
|
|
| 4 |
|
|
|
| 5 |
|
|
|
| 6 |
|
|
|
| 7 |
|
|
|
Comparisons and representative examples (selected from the test set) of the Transformer and Transformer-transfer-learning models in retrosynthetic prediction with tert-butyl moieties.
| Target Compound | Retrosynthetic Analysis | ||
|---|---|---|---|
| Transformer Model | Transformer-Transfer-Learning Model | ||
|
|
| SMILES code error |
|
| 2 |
| SMILES code error |
|
| 3 |
|
|
|
| 4 |
|
|
|
| 5 |
|
|
|
| 6 |
|
|
|