| Literature DB >> 32681088 |
Alain C Vaucher1, Federico Zipoli2, Joppe Geluykens2, Vishnu H Nair2, Philippe Schwaller2, Teodoro Laino2.
Abstract
Experimental procedures for chemical synthesis are commonly reported in prose in patents or in the scientific literature. The extraction of the details necessary to reproduce and validate a synthesis in a chemical laboratory is often a tedious task requiring extensive human intervention. We present a method to convert unstructured experimental procedures written in English to structured synthetic steps (action sequences) reflecting all the operations needed to successfully conduct the corresponding chemical reactions. To achieve this, we design a set of synthesis actions with predefined properties and a deep-learning sequence to sequence model based on the transformer architecture to convert experimental procedures to action sequences. The model is pretrained on vast amounts of data generated automatically with a custom rule-based natural language processing approach and refined on manually annotated samples. Predictions on our test set result in a perfect (100%) match of the action sequence for 60.8% of sentences, a 90% match for 71.3% of sentences, and a 75% match for 82.4% of sentences.Entities:
Year: 2020 PMID: 32681088 PMCID: PMC7367864 DOI: 10.1038/s41467-020-17266-6
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 14.919
Action sequence extracted from an experimental procedure.
| 1 | MakeSolution with methyl 3-7-amino-2-[(2,4-dichlorophenyl)(hydroxy)methyl]-1H-benzimidazol-1-ylpropanoate (6.00 g, 14.7 mmol) and acetic acid (7.4 mL) and methanol (147 mL); |
| 2 | Add SLN; |
| 3 | Add acetaldehyde (4.95 mL, 88.2 mmol) at 0 °C; |
| 4 | Wait 30 min; |
| 5 | Add sodium acetoxyborohydride (18.7 g, 88.2 mmol); |
| 6 | Wait 2 h; |
| 7 | Quench with water; |
| 8 | Concentrate; |
| 9 | Add ethyl acetate; |
| 10 | Wash with aqueous sodium hydroxide (1 M); |
| 11 | Wash with brine; |
| 12 | DrySolution over sodium sulfate; |
| 13 | Filter keep filtrate; |
| 14 | Concentrate; |
| 15 | Purify; |
| 16 | Yield title compound (6.30 g, 13.6 mmol, 92%). |
The sequence corresponds to the example experimental procedure given above.
Action types for information extraction from experimental procedures.
| Action name | Description |
|---|---|
| Add | Add a substance to the reactor |
| CollectLayer | Select aqueous or organic fraction(s) |
| Concentrate | Evaporate the solvent (rotavap) |
| Degas | Purge the reaction mixture with a gas |
| DrySolid | Dry a solid |
| DrySolution | Dry an organic solution with a desiccant |
| Extract | Transfer compound into a different solvent |
| Filter | Separate solid and liquid phases |
| MakeSolution | Mix several substances to generate a mixture or solution |
| Microwave | Heat the reaction mixture in a microwave apparatus |
| Partition | Add two immiscible solvents for subsequent phase separation |
| PH | Change the pH of the reaction mixture |
| PhaseSeparation | Separate the aqueous and organic phases |
| Purify | Purification (chromatography) |
| Quench | Stop reaction by adding a substance |
| Recrystallize | Recrystallize a solid from a solvent or mixture of solvents |
| Reflux | Reflux the reaction mixture |
| SetTemperature | Change the temperature of the reaction mixture |
| Sonicate | Agitate the solution with sound waves |
| Stir | Stir the reaction mixture for a specified duration |
| Triturate | Triturate the residue |
| Wait | Leave the reaction mixture to stand for a specified duration |
| Wash | Wash (after filtration, or with immiscible solvent) |
| Yield | Phony action, indicates the product of a reaction |
| FollowOtherProcedure | The text refers to a procedure described elsewhere |
| InvalidAction | Unknown or unsupported action |
| OtherLanguage | The text is not written in English |
| NoAction | The text does not correspond to an actual action |
Metrics for the extraction of synthesis actions.
| Model | Validity | BLEU score | Levenshtein similarity | 100% accuracy | 90% accuracy | 75% accuracy |
|---|---|---|---|---|---|---|
| Combined rule-based model | 51.5 | 60.1 | 21.9 | 29.0 | 42.6 | |
| Pretrained translation model | 58.6 | 68.7 | 24.7 | 33.2 | 48.3 | |
| Model without pretraining | 98.9 | 64.7 | 76.4 | 37.8 | 47.7 | 62.8 |
| Refined translation model | 99.4 |
The metrics are evaluated on the annotation test set for the approaches introduced in this work. All values are given in %, and the best values are indicated in bold. An extended table showing the metrics for all the refinement experiments can be found in the Supplementary Note 2.
Example of extracted action sequences.
| (1) PH with 10% hydrochloric acid to pH 1.5; PHASESEPARATION; COLLECTLAYER organic; WASH with saturated aqueous sodium chloride; DRYSOLUTION over anhydrous magnesium sulfate. |
| (2) PH with 10% hydrochloric acid to pH 1.5; PHASESEPARATION; COLLECTLAYER organic; WASH with saturated aqueous sodium chloride; DRYSOLUTION over anhydrous magnesium sulfate. |
| (1) MAKESOLUTION with sodium metal (450 mg, 19.75 mmol) and EtOH; ADD SLN; ADD ethyl acetoacetate (103 g, 790 mmol) at 30 °C. |
| (2) MAKESOLUTION with |
| (1) ADD 3-Bromo-2-fluoroaniline (10 g, 52.63 mmol); ADD DCM (100 mL) under nitrogen. |
| (2) ADD 3-Bromo-2-fluoroaniline (10 g, 52.63 mmol) |
| (1) STIR for 12 h at room temperature. |
| (2) |
| (1) INVALIDACTION. |
| (2) |
For sentences picked from experimental procedures, the actions sequences predicted by the refined translation model (2) are compared to the annotated sequences (1). The errors in the prediction are highlighted in bold. The action sequences predicted by the other models, as well as predictions on other sentences, can be found in the Supplementary Data 1.
Prediction accuracy by action type.
| Action type | Type match | Full match | Only in prediction | Only in ground truth |
|---|---|---|---|---|
| Add | 246 | 185 | 21 | 9 |
| Stir | 112 | 100 | 2 | 6 |
| MakeSolution | 57 | 46 | 5 | 5 |
| SetTemperature | 55 | 52 | 6 | 5 |
| Concentrate | 48 | 48 | 3 | 6 |
| Wash | 44 | 43 | 3 | 1 |
| PH | 41 | 34 | 2 | 2 |
| CollectLayer | 35 | 35 | 4 | 2 |
| Extract | 32 | 31 | 0 | 2 |
| Filter | 32 | 29 | 4 | 2 |
| Yield | 31 | 25 | 5 | 6 |
| NoAction | 22 | 22 | 3 | 3 |
| DrySolution | 22 | 21 | 2 | 0 |
| Purify | 19 | 19 | 2 | 5 |
| Wait | 16 | 15 | 3 | 3 |
| FollowOtherProcedure | 14 | 14 | 3 | 1 |
| DrySolid | 10 | 9 | 2 | 2 |
| Quench | 7 | 7 | 0 | 1 |
| Reflux | 7 | 5 | 0 | 0 |
| Partition | 5 | 4 | 0 | 0 |
| PhaseSeparation | 4 | 4 | 0 | 0 |
| Triturate | 3 | 2 | 2 | 0 |
| OtherLanguage | 2 | 2 | 0 | 0 |
| Recrystallize | 2 | 0 | 2 | 0 |
| Degas | 1 | 1 | 1 | 0 |
| InvalidAction | 0 | 0 | 5 | 11 |
The table indicates the number of actions for which the type was predicted correctly (type match), the number of actions for which not only the type, but also the associated properties, were predicted correctly (full match), the number of actions of a given type that were present only in the prediction, and the number of actions of a given type that were present only in the ground truth.
Fig. 1Visualization of the correctness of predicted action types.
The action types predicted by the transformer model (labels on the x-axis) are compared to the actual action types of the ground truth (labels on the y-axis). This figure is generated by first counting all the correctly predicted action types (values on the diagonal); these values correspond to the column "Type match'' of Table 5. Then, the off-diagonal elements are determined from the remaining (incorrectly predicted) actions. Thereby, the last row and column gather actions that are present only in the predicted set or ground truth, respectively. For clarity, the color scale stops at 10, although many elements (especially on the diagonal) exceed this value.
Fig. 2Statistics of the Pistachio and annotation datasets.
a Distribution of the number of characters for sentences from Pistachio and from the annotation dataset. b Distribution of the number of actions per sentence. For the Pistachio dataset, this number is computed from the actions extracted by the rule-based model. For sentences from the annotation dataset, this number is determined from the ground truth (hand annotations). c Distribution of action types extracted by the rule-based model on the Pistachio dataset and on the annotated dataset. The action types are ordered by decreasing frequency for the Pistachio dataset. d Distribution of action types determined from hand annotations for the full annotation dataset and its test split. The action types are ordered by decreasing frequency for the full annotation dataset.
Fig. 3Distribution of action types of the annotation test set.
The action types are ordered by decreasing frequency for the hand annotations.
Fig. 4Screenshots for adding and editing actions with the annotation framework.
The sentence to annotate is displayed on the left-hand side, with the corresponding pre-annotations on the right-hand side. A Wash action is missing and can be added by clicking on the corresponding button at the top. Also, when clicking on the appropriate button, a new page open to edit the selected action.
Illustration of the data augmentation approach.
| Diisopropylazodicarboxylate (0.05 ml, 0.302 mmol) was added to the reaction mixture followed by stirring for 3 h at room temperature. |
|---|
| (1) |
| (2) Diisopropylazodicarboxylate (0.05 ml, |
| (3) |
| (4) |
A reference sentence (at the top) is augmented to produce four additional sentences. The substituted elements are written in italic. For data augmentation of the annotation dataset, the actions associated with the reference sentence are also subjected to substitution.