| Literature DB >> 34285269 |
Lev Krasnov1,2,3, Ivan Khokhlov2, Maxim V Fedorov1,2, Sergey Sosnin4,5.
Abstract
We developed a Transformer-based artificial neural approach to translate between SMILES and IUPAC chemical notations: Struct2IUPAC and IUPAC2Struct. The overall performance level of our model is comparable to the rule-based solutions. We proved that the accuracy and speed of computations as well as the robustness of the model allow to use it in production. Our showcase demonstrates that a neural-based solution can facilitate rapid development keeping the required level of accuracy. We believe that our findings will inspire other developers to reduce development costs by replacing complex rule-based solutions with neural-based ones.Entities:
Year: 2021 PMID: 34285269 PMCID: PMC8292511 DOI: 10.1038/s41598-021-94082-y
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Demonstration of SMILES tokenization (top) and IUPAC names tokenization (bottom).
Figure 2A scheme of Struct2IUPAC Transformer. Adopted from[6].
Figure 3A scheme of Verification step.
Figure 5The distribution of the lengths of SMILES and IUPAC on the test set.
Figure 4The dependence between model accuracy and the length of SMILES.
Accuracy (%) of models on the test set of 100k molecules with different beam size.
| OPSIN | ||||||
|---|---|---|---|---|---|---|
| Beam 1 | Beam 3 | Beam 5 | Beam 1 | Beam 3 | Beam 5 | |
| 96.1 | 98.2 | 98.9 | 96.6 | 98.6 | 99.1 | 99.4 |
Figure 6The correlation of mean time and output sequence length.
Figure 7An example of a molecule with four correctly generated IUPAC names.
Figure 8The distribution of the number of names variations using a Transformer’s beam search.
The performance of Struct2IUPAC model for different validation tasks.
| Task | Beam 1 (%) | Beam 3 (%) | Beam 5 (%) |
|---|---|---|---|
| Kekule representation | 95.6 | 97.4 | 97.7 |
| Augmented SMILES | 27.49 | 34.00 | 37.16 |
| Stereo-enriched | 44.11 | 61.24 | 66.52 |
The generated IUPAC names for various tautameric forms of Guanine and Uracil.
| Molecule | SMILES | Image | IUPAC names |
|---|---|---|---|
| Guanine | N=c1nc(O)c2[nH]cnc2[nH]1 |
| 2-Imino-3,7-dihydropurin-6-ol 2-imino-1,7-dihydropurin-6-ol |
| Nc1nc(=O)c2nc[nH]c2[nH]1 |
| 2-Amino-3,9-dihydropurin-6-one | |
| Nc1nc(=O)c2[nH]cnc2[nH]1 |
| 2-Amino-3,7-dihydropurin-6-one 2-amino-6,7-dihydro-3H-purin-6-one 2-amino-3,6-dihydropurin-6-one 2-amino-7H-purin-6-one | |
| Nc1[nH]c(=O)c2[nH]cnc2n1 |
| 2-Amino-1,7-dihydropurin-6-one 2-amino-1,6-dihydropurin-6-one | |
| Nc1nc(O)c2[nH]cnc2n1 |
| 2-Amino-7H-purin-6-ol 2-aminopurin-6-ol | |
| Uracil | O=c1cc[nH]c(=O)[nH]1 |
| 1H-Pyrimidine-2,4-dione |
| Oc1ccnc(O)n1 |
| Pyrimidine-2,4-diol | |
| O=c1ccnc(O)[nH]1 |
| 2-Hydroxy-1H-pyrimidin-6-one 2-hydroxypyrimidin-6-one | |
| Oc1cc[nH]c(=O)n1 |
| 4-Hydroxy-1H-pyrimidin-2-one 4-hydroxypyrimidin-2-one |
Figure 9Two examples of challenging molecules for which Transformer generates correct names.