| Literature DB >> 33906675 |
Kohulan Rajan1, Achim Zielesny2, Christoph Steinbeck3.
Abstract
Chemical compounds can be identified through a graphical depiction, a suitable string representation, or a chemical name. A universally accepted naming scheme for chemistry was established by the International Union of Pure and Applied Chemistry (IUPAC) based on a set of rules. Due to the complexity of this ruleset a correct chemical name assignment remains challenging for human beings and there are only a few rule-based cheminformatics toolkits available that support this task in an automated manner. Here we present STOUT (SMILES-TO-IUPAC-name translator), a deep-learning neural machine translation approach to generate the IUPAC name for a given molecule from its SMILES string as well as the reverse translation, i.e. predicting the SMILES string from the IUPAC name. In both cases, the system is able to predict with an average BLEU score of about 90% and a Tanimoto similarity index of more than 0.9. Also incorrect predictions show a remarkable similarity between true and predicted compounds.Entities:
Keywords: Attention mechanism; Chemical language; Deep neural network; DeepSMILES; IUPAC names; Neural machine translation; Recurrent neural network; SELFIES; SMILES
Year: 2021 PMID: 33906675 PMCID: PMC8077691 DOI: 10.1186/s13321-021-00512-4
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Fig.1SMILES, DeepSMILES and SELFIES split into tokens which are separated by a space character
Fig. 2STOUT architecture for SMILES-to-IUPAC-name translation
Number of unique SELFIES and IUPAC-name tokens for each dataset
| Dataset size | Number of SELFIES tokens | Number of IUPAC tokens |
|---|---|---|
| 30 Million | 27 | 1190 |
| 60 Million | 27 | 1190 |
Fig. 3Average training time per epoch on different hardware (lower is better)
Fig. 4Average training time per epoch for different datasets using TPU V3-8
BLEU scores analysis
| Training dataset size | 30 Mio | 60 Mio |
| Average BLEU score | 0.89 | 0.94 |
| Total number of strings with BLEU 1.0 | 52.48% | 66.65% |
| BLEU-1 | 0.92 | 0.95 |
| BLEU-2 | 0.90 | 0.94 |
| BLEU-3 | 0.88 | 0.93 |
| BLEU-4 | 0.86 | 0.92 |
Tanimoto similarities
| Training dataset size | 30 Mio | 60 Mio |
| Invalid IUPAC names | 21.41% | 14.50% |
| Valid IUPAC names | 78.59% | 85.50% |
| Tanimoto 1.0 count on the total test dataset | 58.36% | 72.33% |
| Tanimoto 1.0 count on valid IUPAC names | 74.26% | 84.59% |
| Average Tanimoto (measured for total test dataset) | 0.75 | 0.83 |
| Average Tanimoto (measured for valid IUPAC names) | 0.96 | 0.98 |
Failed IUPAC-name-to-SMILES translations
| IUPAC names | Reason for failure (OPSIN error messages) |
|---|---|
| 1. | Atoms are in an unphysical valency state. Element: C valency: 5 |
| 2. 2-[({[(3-ethoxypropyl)amino]({[2-(2-fluorophenyl)ethyl]amino})methylidene}amino)- | Unmatched opening bracket found |
| 3. 3'-(propan-2-yl)-2',3',4',5',6',7',8',8'a-octahydro-2'H-spiro[imidazole-4,1'-indolizin]-2-amine | The following being uninterpretable: 2',3',4',5',6',7',8',8' |
| 4. ({2',6'-difluoro-2',6'-dimethyl-[1,1'-biphenyl]-4-yl}methyl)(propyl)amine | Failed to assign all double bonds |
| 5. 1,4,5-trimethyl-1-[1,2-dimethylpropyl)-2-methyl-1-propylbicyclo[12.2.1]tetradeca-1,5-diene | Disagreement between lengths of bridges and alkyl chain length |
Predicted IUPAC name strings with a Tanimoto similarity index of 1.0 but a low BLEU score
| No. | IUPAC names | BLEU Score | IUPAC names translated into SMILES using OPSIN | Tanimoto similarity Index | ||
|---|---|---|---|---|---|---|
| Original | Predicted | Original | Predicted | |||
| 1 | butyl3-methyl-12-methylidene-2,4,7,10-tetraoxatridecan-13-oate | butyl2-({2-[2-(1-methoxyethoxy)ethoxy]ethoxy}methyl)prop-2-enoate | 0.00 | O=C(OCCCC)C(=C)COCCOCCOC(OC)C | O=C(OCCCC)C(=C)COCCOCCOC(OC)C | 1.0 |
| 2 | ethyl3-[1,10-diiodo-9-(iodosulfanyl)-1,10-dithia-2,9-diazadecan-2-yl]propanoate | ethyl3-({6-[bis(iodosulfanyl)amino]hexyl}(iodosulfanyl)amino}propanoate | 0.10 | O=C(OCC)CCN(SI)CCCCCCN(SI)SI | O=C(OCC)CCN(SI)CCCCCCN(SI)SI | 1.0 |
| 3 | 0.24 | O=C(C=C)NC1(C(=O)N(C)C)CCCC1 | O=C(C=C)NC1(C(=O)N(C)C)CCCC1 | 1.0 | ||
| 4 | 6-[4-(2-cyanoethyl)phenyl]- | 2-{6-[4-(3-cyanopropyl)phenyl]hexanamido}- | 0.32 | N#CCCC1=CC=C(C=C1)CCCCCC(=O)NC(C(=O)NO)C | N#CCCCC1=CC=C(C=C1)CCCCCC(=O)NC(C(=O)NO)C | 1.0 |
| 5 | 12-aminochrysene-6-carboxylicacid | 6-aminotetraphene-11-carboxylicacid | 0.41 | O=C(O)C1=CC=2C=3C=CC=CC3C(N)=CC2C=4C=CC=CC41 | O=C(O)C1=CC=CC2=CC=3C(N)=CC=4C=CC=CC4C3C=C21 | 1.0 |
| 6 | 1,3-bis[(6-bromopyridin-2-yl)methyl]-1,3-diazinane | 2-bromo-6-({3-[(6-bromopyridin-2-yl)methyl]-1,3-diazinan-1-yl}methyl)pyridine | 0.50 | BrC=1N=C(C=CC1)CN2CN(CC3=NC(Br)=CC=C3)CCC2 | BrC=1N=C(C=CC1)CN2CN(CC3=NC(Br)=CC=C3)CCC2 | 1.0 |
| 7 | 2,3,7-trifluoro-5-methylocta-1,3,5-triene | 2,5-difluoro-3-methylocta-1,3,5-triene | 0.61 | FC(=C)C(F)=CC(=CC(F)C)C | FC(=C)C(=CC(F)=CCC)C | 1.0 |
| 8 | tert-butyl4-acetyl-2-[(acetyloxy)methyl]piperazine-1-carboxylate | tert-butyl2-[(acetyloxy)methyl]-4-acetylpiperazine-1-carboxylate | 0.72 | O=C(OC(C)(C)C)N1CCN(C(=O)C)CC1COC(=O)C | O=C(OC(C)(C)C)N1CCN(C(=O)C)CC1COC(=O)C | 1.0 |
| 9 | 0.83 | N#CC1=CC=CC(=C1)CNC(=O)C=2C=3N=CN=CC3N4C=CC=CC24 | N#CC1=CC=CC(=C1)CNC(=O)C=2C=3N=CN=CC3N4C=CC=CC24 | 1.0 | ||
| 10 | (5-benzylhexa-3,5-dien-2-ylidene)aminomethanesulfonate | (6-benzylhexa-3,5-dien-2-ylidene)aminomethanesulfonate | 0.92 | O=S(=O)([O-])CN=C(C=CC(=C)CC=1C=CC=CC1)C | O=S(=O)([O-])CN=C(C=CC=CCC=1C=CC=CC1)C | 1.0 |
Fig. 5Chemical structures depicted with the CDK depiction generator for predictions with Tanimoto similarity 1.0 but low BLEU score
Predicted IUPAC name strings with a BLEU score of 1.0 but a low Tanimoto similarity index
| No. | IUPAC names | BLEU Score | IUPAC names translated into SMILES using OPSIN | Tanimoto similarity Index | ||
|---|---|---|---|---|---|---|
| Original | Predicted | Original | Predicted | |||
| 1 | 4-[(4-amino-2,3,6-trimethylphenyl)methyl]-2,3,5-trimethylaniline | 4-[(4-amino-2,3,5-trimethylphenyl)methyl]-2,3,6-trimethylaniline | 1.0 | NC=1C=C(C(=C(C1C)C)CC=2C(=CC(N)=C(C2C)C)C)C | NC1=C(C=C(C(=C1C)C)CC2=CC(=C(N)C(=C2C)C)C)C | 0.97 |
| 2 | 3-[(3-amino-2,6-diethylphenyl)methyl]-2,4-diethylaniline | 3-[(3-amino-2,4-diethylphenyl)methyl]-2,6-diethylaniline | 1.0 | NC1=CC=C(C(=C1CC)CC=2C(=CC=C(N)C2CC)CC)CC | NC=1C(=CC=C(C1CC)CC2=CC=C(C(N)=C2CC)CC)CC | 0.92 |
| 3 | 2-{4-[(dimethylamino)methyl]-6-[(2,6-dimethylphenoxy)methyl]-6-hydroxycyclohexa-2,4-dien-1-yl}acetonitrile | 2-{4-[(2,6-dimethylphenoxy)methyl]-6-[(dimethylamino)methyl]-6-hydroxycyclohexa-2,4-dien-1-yl}acetonitrile | 1.0 | N#CCC1C=CC(=CC1(O)COC=2C(=CC=CC2C)C)CN(C)C | N#CCC1C=CC(=CC1(O)CN(C)C)COC=2C(=CC=CC2C)C | 0.93 |
| 4 | 4-[4-(3-hydroxycyclohepta-1,3,6-trien-1-yl)phenyl]- | 4-[4-(3-hydroxycyclohepta-1,4,6-trien-1-yl)phenyl]- | 1.0 | O=C(NC1=CCC=CC=C1C)CCCC=2C=CC(=CC2)C=3C=CCC=C(O)C3 | O=C(NC1=CC=CCC=C1C)CCCC=2C=CC(=CC2)C=3C=CC=CC(O)C3 | 0.95 |
| 5 | (but-1-en-2-yl)(prop-1-en-1-yl)amine | (but-1-en-1-yl)(prop-1-en-2-yl)amine | 1.0 | C=C(NC=CC)CC | C=C(NC=CCC)C | 0.97 |
Fig. 6Chemical structures depicted with the CDK depiction generator for predictions with BLEU score 1.0 but Tanimoto similarity less than 1.0
Analysis on test set using OPSIN
| OPSIN analysis on test set | Values |
|---|---|
| Invalid IUPAC names | 1.69% |
| Valid IUPAC names | 98.31% |
| Tanimoto 1.0 count on the total test dataset | 97.89% |
| Tanimoto 1.0 count on valid IUPAC names | 96.24% |
| Average Tanimoto (measured for total test dataset) | 0.99 |
| Average Tanimoto (measured for valid IUPAC names) | 0.98 |
Average BLEU scores, BLEU Scores, and Tanimoto similarity calculations
| 30 Mio | 60 Mio | |
|---|---|---|
| Average BLEU score | 0.90 | 0.94 |
| Total number of predicted strings with BLEU 1.0 | 46.78% | 68.47% |
| BLEU-1 | 0.94 | 0.97 |
| BLEU-2 | 0.91 | 0.95 |
| BLEU-3 | 0.89 | 0.94 |
| BLEU-4 | 0.85 | 0.92 |
| Tanimoto calculations | ||
| Average Tanimoto similarity index | 0.89 | 0.94 |
| Number of predicted strings with Tanimoto 1.0 | 52.27% | 73.26% |
Predicted SELFIES with low BLEU scores and Tanimoto similarity 1.0
| No. | SELFIES | BLEU Score | SELFIES decoded back into SMILES | Tanimoto similarity Index | ||
|---|---|---|---|---|---|---|
| Original | Predicted | Original | Predicted | |||
| 1. | [I][C][C][Branch1_2][Branch1_3][=C][N][C][Expl=Ring1][Branch1_1][C][C] | [I][C][=C][Branch1_1][Branch1_3][N][C][=C][Ring1][Branch1_1][C][C]: | 0.00 | IC=1C(=CNC1C)C | IC=1C(=CNC1C)C | 1.0 |
| 2. | [O][C][C][=C][C][=C][Branch1_1][Ring2][C][Expl=Ring1][Branch1_2][C][N][=N][C][=C][Branch1_1][Ring2][C][Expl=Ring1][Branch1_2][C] | [O][C][=C][C][=C][C][Branch1_2][Ring2][=C][Ring1][Branch1_2][C][=N][N][=C][C][Branch1_2][Ring2][=C][Ring1][Branch1_2][C]: | 0.18 | OC=1C=CC=C(C1)C=2N=NC=C(C2)C | OC=1C=CC=C(C1)C=2N=NC=C(C2)C | 1.0 |
| 3. | [C][Branch1_2][=C][=C][C][C][Branch1_2][Branch2_1][=C][C][Branch1_2][Ring1][=C][C][C][C][C] | [C][Branch1_1][=N][C][=C][Branch1_1][Branch1_3][C][=C][Branch1_1][C][C][C][C][C][=C][C]: | 0.21 | C(=CCC(=CC(=CC)C)C)C | C(C=C(C=C(C)C)CC)=CC | 1.0 |
| 4. | [N][=C][C][Branch1_2][N][=C][C][=C][Ring1][Branch1_2][O][C][Branch1_1][C][C][C][C][=N][N][C][C][=C][C][Branch1_1][Ring2][N][=C][N][=C][C][Ring1][N][Expl=Ring1][Branch2_2] | [N][Branch1_2][Ring1][=C][N][C][C][=C][C][N][N][=C][Branch1_1][P][C][C][=N][C][Branch1_1][Branch1_3][O][C][Branch1_1][C][C][C][=C][C][Expl=Ring1][Branch2_3][C][Expl=Ring1][#C][C][Expl=Ring2][Ring1][Ring1]: | 0.32 | N1=CC(=CC=C1OC(C)C)C2=NNC=3C=CC(N=CN)=CC23 | N1=CC(=CC=C1OC(C)C)C2=NNC=3C=CC(N=CN)=CC23 | 1.0 |
| 5. | [O][=C][N][C][=C][Branch1_1][Branch1_2][N][=C][Ring1][Branch1_2][C][C][=C][C][=C][C][Ring1][Branch2_3] | [O][=C][N][C][C][=C][C][=C][C][C][Expl=Ring1][Branch1_3][N][=C][Ring1][O][C]: | 0.45 | O=C1NC2=C(N=C1C)C=CC=CC2 | O=C1NC=2C=CC=CCC2N=C1C | 1.0 |
| 6. | [O][=N][C][Branch1_2][C][=O][C][C][=C][C][=C][C][Branch1_2][Branch2_2][=C][C][=C][Ring1][Branch1_2][C][Expl=Ring1][Branch2_3][C] | [O][=N][C][Branch1_2][C][=O][C][=C][C][=C][C][=C][Branch1_1][Branch2_2][C][=C][C][Ring1][Branch1_2][=C][Ring1][Branch2_3][C]: | 0.53 | O=NC(=O)C=1C=CC2=CC(=CC=C2C1)C | O=NC(=O)C=1C=CC2=CC(=CC=C2C1)C | 1.0 |
| 7. | [O][B][Branch1_1][C][O][C][=C][C][Branch1_2][=C][=C][C][=C][Ring1][Branch1_2][C][=C][C][=C][C][=C][Ring1][Branch1_2][C][=C][N][=C][C][=C][Ring1][Branch1_2] | [O][B][Branch1_1][C][O][C][C][=C][Branch1_1][=C][C][=C][C][Expl=Ring1][Branch1_2][C][C][=C][C][=C][C][Expl=Ring1][Branch1_2][C][=C][N][=C][C][=C][Ring1][Branch1_2]: | 0.60 | OB(O)C1=CC(=CC=C1C2=CC=CC=C2)C3=CN=CC=C3 | OB(O)C1=CC(=CC=C1C2=CC=CC=C2)C3=CN=CC=C3 | 1.0 |
| 8. | [O][=C][N][C][Branch2_1][Ring1][C][C][C][=C][C][Branch1_1][Ring1][O][C][=C][C][Expl=Ring1][Branch2_1][N][Branch1_1][C][C][C][=C][Branch1_1][Branch1_2][C][=C][Ring1][P][C][C][C] | [O][=C][N][C][Branch2_1][Ring1][C][C][=C][C][=C][Branch1_1][Ring1][O][C][C][=C][Ring1][Branch2_1][N][Branch1_1][C][C][C][=C][Branch1_1][Branch1_3][C][=C][Ring1][P][C][C][C]: | 0.71 | O=C1NC(C=2C=CC(OC)=CC2N(C)C)=C(C=C1C)CC | O=C1NC(C=2C=CC(OC)=CC2N(C)C)=C(C=C1CC)C | 1.0 |
| 9. | [O][=P][Branch2_1][Ring1][Branch1_2][C][=N][N][C][Branch1_2][Ring2][=C][Ring1][Branch1_1][C][Branch1_1][C][F][Branch1_1][C][F][C][Branch1_1][C][F][F][Branch1_1][Branch2_2][C][C][=C][C][=C][C][Expl=Ring1][Branch1_2][C][C][=C][C][=C][C][Expl=Ring1][Branch1_2] | [O][=P][Branch1_1][Branch2_2][C][C][=C][C][=C][C][Expl=Ring1][Branch1_2][Branch1_1][Branch2_2][C][C][=C][C][=C][C][Expl=Ring1][Branch1_2][C][=N][N][C][Branch1_2][Ring2][=C][Ring1][Branch1_1][C][Branch1_1][C][F][Branch1_1][C][F][C][Branch1_1][C][F][F]: | 0.86 | O=P(C1=NNC(=C1)C(F)(F)C(F)F)(C=2C=CC=CC2)C=3C=CC=CC3 | O=P(C1=NNC(=C1)C(F)(F)C(F)F)(C=2C=CC=CC2)C=3C=CC=CC3 | 1.0 |
| 10. | [O][=C][Branch2_1][Ring1][=N][O][C][=C][C][=C][C][Branch1_2][N][=C][Ring1][Branch1_2][O][C][Branch1_2][C][=O][C][C][C][C][C][C][C][C][C][C][C][C][C][C][C] | [O][=C][Branch2_1][Ring1][=C][O][C][=C][C][=C][C][Branch1_2][N][=C][Ring1][Branch1_2][O][C][Branch1_2][C][=O][C][C][C][C][C][C][C][C][C][C][C][C][C][C][C]: | 0.93 | O=C(OC1=CC=CC(=C1OC(=O)CCC)CCCCCCCCC)CCC | O=C(OC1=CC=CC(=C1OC(=O)CCC)CCCCCCCCCC)CC | 1.0 |
Fig. 7Chemical structures depicted with the CDK depiction generator for predictions with Tanimoto similarity 1.0 and low BLEU score