| Literature DB >> 36224331 |
Xinyi Wu1, Yun Zhang1, Jiahui Yu1, Chengyun Zhang1, Haoran Qiao2, Yejian Wu1, Xinqiao Wang1, Zhipeng Wu1, Hongliang Duan3,4.
Abstract
To improve the performance of data-driven reaction prediction models, we propose an intelligent strategy for predicting reaction products using available data and increasing the sample size using fake data augmentation. In this research, fake data sets were created and augmented with raw data for constructing virtual training models. Fake reaction datasets were created by replacing some functional groups, i.e., in the data analysis strategy, the fake data as compounds with modified functional groups to increase the amount of data for reaction prediction. This approach was tested on five different reactions, and the results show improvements over other relevant techniques with increased model predictivity. Furthermore, we evaluated this method in different models, confirming the generality of virtual data augmentation. In summary, virtual data augmentation can be used as an effective measure to solve the problem of insufficient data and significantly improve the performance of reaction prediction.Entities:
Mesh:
Year: 2022 PMID: 36224331 PMCID: PMC9556613 DOI: 10.1038/s41598-022-21524-6
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.996
Figure 1Schematic illustration of the virtual data augmentation method.
Figure 2The schematic diagram of virtual data augmentation. (a) The single augmentation method of Buchwald-Hartwig and Chan-Lam coupling reactions. (b) The representative example of simultaneously virtual data augmentation method of Hiyama coupling reaction.
Statistical Summary of five coupling reactions before and after using virtual data augmentation method.
| Name | Depiction | Raw dataset | Virtual dataset |
|---|---|---|---|
| Hiyama |
| 2067 | 19011 |
| Buchwald-Hartwig |
| 4419 | 7640 |
| Chan-Lam |
| 5276 | 9170 |
| Kumada |
| 9657 | 54062 |
| Suzuki |
| 92399 | 424194 |
Figure 3UMAP plot of molecules from raw data and virtual augmented data and TMAP plot of rxnfp of reactions from raw data and virtual augmented data. (a) UMAP map of Hiyama coupling reaction before and after virtual data augmentation. (b) UMAP map before and after virtual data augmentation of Chan-Lam coupling reaction. (c) TMAP before and after virtual data augmentation of five classic coupling reaction.
Average accuracy comparison of several coupling reactions between raw data and augmented data based on the transformer-baseline model.
| Dataset | Average accuracy (%) | ||||
|---|---|---|---|---|---|
| Hiyama | Buchwald–Hartwig | Chan–Lam | Kumada | Suzuki | |
| Raw data | 24.96 | 30.09 | 60.99 | 78.52 | 94.33 |
| Augmented data | 46.56 | 46.82 | 66.07 | 83.66 | 96.48 |
Accuracy comparison of several coupling reactions between raw data and augmented data based on the transformer-baseline model and transformer-transfer model.
| Model | Dataset | Reaction types | ||||
|---|---|---|---|---|---|---|
| Hiyama | Buchwald–Hartwig | Chan–Lam | Kumada | Suzuki | ||
| Transformer-baseline model | Raw data | 23.67 | 41.63 | 64.71 | 78.99 | 95.05 |
| Augmented data | 49.47 | 49.32 | 68.50 | 85.40 | 97.79 | |
| Transformer-transfer model | Raw data | 60.87 | 94.57 | 96.39 | 96.48 | 97.84 |
| Augmented data | 69.57 | 95.93 | 96.77 | 97.00 | 98.63 | |
The reaction prediction accuracy of Chan-Lam reaction before and after augmentation was compared under different models.
| Dataset | Accuracy (%) | ||
|---|---|---|---|
| RNN | Molecular transformer | Baseline transformer | |
| Raw data | 54.08 | 68.88 | 64.71 |
| Augmented data | 59.38 | 71.92 | 68.50 |
Figure 4Visualization of attention weight before and after Hiyama reaction augmentation. The horizontal axis contains two reactants and reagents, and the vertical axis is the product. (a) SMILES:CC(=O)c1ccc(I)cc1.F[Si](c1ccccc1)(c1ccccc1)c1ccccc1.[F].[K+]>>CC(=O)c1ccc(c2ccccc2)cc1. (b) SMILES:CC[Si](Cl)(Cl)c1ccc(C)cc1.N#Cc1ccc(Br)cc1.[F-].[K+]>>Cc1ccc(-c2ccc(C#N)cc2)cc1.
The comparisons of different augmented reactants.
| Reaction types | Accuracy (%) | ||
|---|---|---|---|
| Augmented halogen | Augmented silicon (or boron) | Simultaneously augmented | |
| Hiyama | 44.44 | 48.31 | 49.47 |
| Kumada | 80.85 | 84.68 | 85.40 |
| Suzuki | 96.82 | 95.26 | 97.79 |
The number of error types in reaction prediction for five coupling reactions.
| Wrong type | Hiyama lift rate (%) | Suzuki lift rate (%) | Buchwald–Hartwig lift rate (%) | Cham–Lam lift rate (%) | Kumada lift rate (%) |
|---|---|---|---|---|---|
| Chirality error | 1.00 | 5.50 | 1.63 | 1.71 | 4.05 |
| SMILES error | 11.65 | 33.50 | 33.06 | 28.57 | 28.38 |
| Group isomerism error | 10.67 | 9.50 | 15.51 | 10.86 | 13.51 |
| Number of carbon error | 16.50 | 11.00 | 11.43 | 10.29 | 16.22 |
| Other’s error | 60.19 | 40.50 | 38.37 | 48.57 | 37.84 |
Figure 5Typical error analysis of Hiyama coupling reactions. (a) chirality errors, (b) SMILES errors, (c) the number of atom errors (d) functional group isomerism errors.
Figure 6Typical error analysis of Suzuki coupling reactions. (a) Chirality errors, (b) SMILES errors, (c) the number of atom errors, (d) functional group isomerism errors.