| Literature DB >> 33958589 |
Alain C Vaucher1, Philippe Schwaller2, Joppe Geluykens2, Vishnu H Nair2, Anna Iuliano3, Teodoro Laino2.
Abstract
The experimental execution of chemical reactions is a context-dependent and time-consuming process, often solved using the experience collected over multiple decades of laboratory work or searching similar, already executed, experimental protocols. Although data-driven schemes, such as retrosynthetic models, are becoming established technologies in synthetic organic chemistry, the conversion of proposed synthetic routes to experimental procedures remains a burden on the shoulder of domain experts. In this work, we present data-driven models for predicting the entire sequence of synthesis steps starting from a textual representation of a chemical equation, for application in batch organic chemistry. We generated a data set of 693,517 chemical equations and associated action sequences by extracting and processing experimental procedure text from patents, using state-of-the-art natural language models. We used the attained data set to train three different models: a nearest-neighbor model based on recently-introduced reaction fingerprints, and two deep-learning sequence-to-sequence models based on the Transformer and BART architectures. An analysis by a trained chemist revealed that the predicted action sequences are adequate for execution without human intervention in more than 50% of the cases.Entities:
Year: 2021 PMID: 33958589 PMCID: PMC8102565 DOI: 10.1038/s41467-021-22951-1
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 14.919
Fig. 1Overview of the data set generation and Smiles2Actions model.
The data set is generated in a sequence of processing and filtering steps, starting from information available in patent reaction records (on the left). The Smiles2Actions model is trained on this data set, after which it can predict the action sequences to execute arbitrary chemical equations (on the right).
Fig. 2Illustration of a chemical equation.
To the left of the arrow, one can identify all the precursor molecules, while the product molecule is shown to its right. On the left-hand side of the reaction, we also include molecules that play a role as reagents or solvents only, such as the first two entities: N,N′-dicyclohexylcarbodiimide and dichloromethane.
Possible action sequence for the chemical equation of Fig. 2.
| Action sequence | Equivalent human-readable sequence | |
|---|---|---|
| 1 | ADD $1$ | ADD N,N′-dicyclohexylcarbodiimide |
| 2 | ADD $4$ | ADD aniline |
| 3 | ADD $2$ | ADD dichloromethane |
| 4 | ADD $3$ | ADD 4,4-dimethyl-1,2,3,4-tetrahydro-2-oxo-7-quinolinecarboxylic acid |
| 5 | STIR for @3@ at #4# | STIR for 8 h at 25 °C |
| 6 | FILTER keep precipitate | FILTER keep precipitate |
| 7 | RECRYSTALLIZE from ethanol | RECRYSTALLIZE from ethanol |
| 8 | YIELD $-1$ | YIELD 4,4-Dimethyl-1,2,3,4-tetrahydro-N-phenyl-2-oxo-7-quinolinecarboxamide |
The tokens $1$, $2$, $3$, and $4$ refer to the compounds present in the chemical equation. Since ethanol is not part of the chemical equation, it is not replaced by a token. The token @3@ refers to the third duration range and corresponds to durations between 3 and 10 h. The token #4# refers to the fourth temperature range and corresponds to temperatures between 10 °C and 40 °C. More details about the token substitution can be found in the Methods section.
Reaction records ignored during the generation of the data set.
| Category | Number of reaction records |
|---|---|
| Incomplete mapping of molecules | 995,674 |
| Refers to other procedure | 690,484 |
| Contains InvalidAction | 131,461 |
| Error in duration extraction | 127,598 |
| Likely to contain multiple reaction steps | 120,073 |
| Too short action sequence | 67,631 |
| Error in action sequence extraction | 38,516 |
| Error in temperature extraction | 37,485 |
| Molecule present both in the precursors and the products | 6612 |
| Invalid molecule SMILES | 6280 |
| Invalid reaction SMILES | 1606 |
| Other errors | 3544 |
| Removed due to duplicate reaction SMILES | 544,183 |
| Final data set | 693,517 |
| Total | 3,464,664 |
Fig. 3Differences in class prevalence between the original data and the generated data set.
As an example, this figure can be read in the following manner: a total of 60 reaction classes, each occurring between 100 and 999 times in the original reaction data set, are represented between 25% and 50% more frequently in the data set of 693,517 reactions.
Metrics for the prediction of synthesis actions.
| Model | Validity | BLEU | 100% | 90% | 75% | 50% |
|---|---|---|---|---|---|---|
| Random (among all reactions) | 61.6 | 35.1 | 0.00 | 0.04 | 0.76 | 24.07 |
| Random (compatible pattern) | 38.5 | 0.01 | 0.18 | 1.51 | 30.01 | |
| Nearest neighbor | 99.6 | 53.2 | 20.30 | 55.46 | ||
| Transformer | 99.7 | 3.60 | 10.10 | |||
| BART | 99.6 | 54.5 | 0.98 | 5.00 | 17.57 | 66.04 |
All values are given in percentage, and the best values are indicated in bold. The ground truth is considered to be the only correct solution during the evaluation of the different metrics.
Action sequences predicted for a reaction from the test set.
| Ground truth | Transformer model | BART model | Nearest-neighbor model |
|---|---|---|---|
| ADD $2$ | ADD $2$ | ADD $2$ | ADD $4$ |
| ADD $4$ | ADD $4$ | ADD $4$ | ADD $3$ |
| ADD $3$ | ADD $3$ | ADD $3$ | ADD $5$ at #4# |
| ADD $1$ | ADD $1$ | ADD $1$ | ADD $1$ at #4# |
| ADD $5$ | STIR for @2@ at #4# | STIR for @1@ at #4# | STIR for @1@ at #4# |
| STIR for @4@ at #4# | ADD $5$ | ADD $5$ | ADD $2$ |
| CONCENTRATE | STIR for @4@ at #4# | STIR for @4@ at #4# | STIR for @4@ |
| PURIFY | CONCENTRATE | CONCENTRATE | QUENCH with water |
| YIELD $-1$ | PURIFY | PURIFY | CONCENTRATE |
| – | YIELD $-1$ | YIELD $-1$ | EXTRACT with ethyl acetate/THF |
| – | – | – | WASH with brine |
| – | – | – | DRYSOLUTION over Na2SO4 |
| – | – | – | FILTER keep filtrate |
| – | – | – | CONCENTRATE |
| – | – | – | ADD THF |
| – | – | – | PURIFY |
| – | – | – | YIELD $-1$ |
The considered reaction is a reductive amination of compound $2$ with the amine $4$. The other precursors are acetic acid ($1$), ethanol ($3$), and sodium cyanoborohydride ($5$). The remaining tokens refer to the product ($-1$), to a temperature of 25 °C (#4#), and to durations of 10 min (@1@), 1 h (@2@), and 1 day (@4@).
Fig. 4Distributions of the lengths of predicted action sequences.
a Comparison of the lengths of predicted action sequences for the different models. b Comparison of the lengths of predicted action sequences for different levels of accuracies.
Categorization of differences of predictions and ground truth.
| Category | All properties | Some properties |
|---|---|---|
| Exact match | 2498 | – |
| Actions in different order | 620 | – |
| Properties of one action are different: Stir | 1935 | – |
| Properties of one action are different: Add | 262 | – |
| Properties of one action are different: Reflux | 163 | – |
| Properties of another action type are different | 234 | – |
| Properties of multiple actions are different | 1123 | – |
| Actions are swapped: Stir and Reflux | 319 | 113 |
| Actions are swapped: Stir and Microwave | 99 | 11 |
| Actions are swapped: Stir and Wait | 70 | 64 |
| Actions are swapped: Add and MakeSolution | 118 | 277 |
| Other swap of a single action | 442 | 2207 |
| Action without counterpart: Purify | 286 | 747 |
| Action without counterpart: Wash | 154 | 737 |
| Action without counterpart: Set Temperature | 175 | 579 |
| Action without counterpart: Filter | 121 | 462 |
| Action without counterpart: Concentrate | 173 | 404 |
| Action without counterpart: Stir | 154 | 313 |
| Action without counterpart: Add | 117 | 308 |
| Action without counterpart: Collect Layer | 48 | 207 |
| Another action type without counterpart | 191 | 550 |
| Multiple actions only in the ground truth | 2027 | 8475 |
| Multiple actions only in the prediction | 401 | 2161 |
| Remaining cases | 7280 | 32,727 |
The differences are computed for the test set containing 69,352 reaction records.
Result of the chemist’s assessment of action sequences.
| Decision | Number of reactions |
|---|---|
| Both sequences are adequate | 191 |
| Predicted action sequence is adequate, ground truth is inadequate | 122 |
| Predicted action sequence is inadequate, ground truth is adequate | 108 |
| Both sequences are inadequate | 79 |
Out of the 500 analyzed reactions, 19 had an identical action sequence in the ground truth and in the prediction, 16 of them were considered adequate and 3 were considered inadequate.