| Literature DB >> 35005421 |
Jingqi Wang1, Yuankai Ren2, Zhi Zhang2, Hua Xu3, Yaoyun Zhang1.
Abstract
Chemical reactions and experimental conditions are fundamental information for chemical research and pharmaceutical applications. However, the latest information of chemical reactions is usually embedded in the free text of patents. The rapidly accumulating chemical patents urge automatic tools based on natural language processing (NLP) techniques for efficient and accurate information extraction. This work describes the participation of the Melax Tech team in the CLEF 2020-ChEMU Task of Chemical Reaction Extraction from Patent. The task consisted of two subtasks: (1) named entity recognition to identify compounds and different semantic roles in the chemical reaction and (2) event extraction to identify event triggers of chemical reaction and their relations with the semantic roles recognized in subtask 1. To build an end-to-end system with high performance, multiple strategies tailored to chemical patents were applied and evaluated, ranging from optimizing the tokenization, pre-training patent language models based on self-supervision, to domain knowledge-based rules. Our hybrid approaches combining different strategies achieved state-of-the-art results in both subtasks, with the top-ranked F1 of 0.957 for entity recognition and the top-ranked F1 of 0.9536 for event extraction, indicating that the proposed approaches are promising.Entities:
Keywords: chemical patent; chemical reaction; event extraction; named entity recognition; relation extraction; self-supervision; tokenization for chemical patent
Year: 2021 PMID: 35005421 PMCID: PMC8727901 DOI: 10.3389/frma.2021.691105
Source DB: PubMed Journal: Front Res Metr Anal ISSN: 2504-0537
FIGURE 1Examples of chemical reaction elements and relations annotated in patents.
FIGURE 2Workflow of building information extraction systems for chemical reactions in patents.
FIGURE 3An example of a text snippet with hierarchical steps of chemical reactions.
Performances of semantic role extraction for chemical reaction. Both exact and relaxed matching results are reported.
| Method | Exact | Relax | ||||
|---|---|---|---|---|---|---|
| Precision | Recall | F1 | Precision | Recall | F1 | |
| Fine-tuning | 0.9571 | 0.957 | 0.957 | 0.969 | 0.9687 | 0.9688 |
| Ensemble | 0.9587 | 0.9529 | 0.9558 | 0.9697 | 0.9637 | 0.9667 |
| Merge-data | 0.9572 | 0.951 | 0.9541 | 0.9688 | 0.9624 | 0.9656 |
Performances of event extraction for chemical reaction. Both exact and relaxed matching results are reported.
| Method | Exact | Relax | ||||
|---|---|---|---|---|---|---|
| Precision | Recall | F1 | Precision | Recall | F1 | |
| Fine-tuning | 0.9568 | 0.9504 | 0.9536 | 0.958 | 0.9516 | 0.9548 |
| Ensemble | 0.9619 | 0.9402 | 0.9509 | 0.9632 | 0.9414 | 0.9522 |
| Merge-data | 0.9522 | 0.9437 | 0.9479 | 0.9534 | 0.9449 | 0.9491 |
Performances of end-to-end systems for chemical reaction extraction. Both exact and relaxed matching results are reported.
| Method | Exact | Relax | ||||
|---|---|---|---|---|---|---|
| Precision | Recall | F1 | Precision | Recall | F1 | |
| Fine-tuning | 0.9201 | 0.9147 | 0.9174 | 0.9319 | 0.9261 | 0.9290 |
Performances on NER of semantic roles and event triggers on the development set are reported. The fine-tuning method was used in the experiment. Event triggers are italic.
| Entity type | Exact | ||
|---|---|---|---|
| Precision | Recall | F1 | |
| EXAMPLE_LABEL | 0.979 | 0.986 | 0.982 |
| REACTION_PRODUCT | 0.899 | 0.904 | 0.902 |
| STARTING_MATERIAL | 0.896 | 0.926 | 0.911 |
| YIELD_OTHER | 0.99 | 0.965 | 0.977 |
| YIELD_PERCENT | 0.972 | 1 | 0.986 |
| REAGENT_CATALYST | 0.938 | 0.905 | 0.921 |
| SOLVENT | 0.963 | 0.93 | 0.946 |
| TEMPERATURE | 0.935 | 0.96 | 0.947 |
| OTHER_COMPOUND | 0.947 | 0.939 | 0.943 |
| TIME | 0.983 | 0.991 | 0.987 |
| REACTION_STEP | 0.952 | 0.944 | 0.948 |
| WORKUP | 0.931 | 0.93 | 0.931 |
| Overall_Semantic_Role | 0.949 | 0.937 | 0.943 |
| Overall | 0.943 | 0.941 | 0.942 |
Performances on each relation type and the overall performance on the development set are reported. The fine-tuning method was used in the experiment.
| Relation type | Exact | ||
| Precision | Recall | F1 | |
| ARG1|REACTION_STEP|OTHER_COMPOUND | 0.733 | 0.805 | 0.767 |
| ARG1|REACTION_STEP|REACTION_PRODUCT | 0.985 | 0.948 | 0.966 |
| ARG1|REACTION_STEP|REAGENT_CATALYST | 0.979 | 0.965 | 0.972 |
| ARG1|REACTION_STEP|SOLVENT | 0.975 | 0.9522 | 0.968 |
| ARG1|REACTION_STEP|STARTING_MATERIAL | 0.957 | 0.916 | 0.936 |
| ARG1|WORKUP|OTHER_COMPOUND | 0.965 | 0.961 | 0.963 |
| ARG1|WORKUP|REACTION_PRODUCT | 0 | 0 | 0 |
| ARG1|WORKUP|SOLVENT | 0.2 | 1 | 0.333 |
| ARG1|WORKUP|STARTING_MATERIAL | 0 | 0 | 0 |
| ARGM|REACTION_STEP|TEMPERATURE | 0.957 | 0.928 | 0.942 |
| ARGM|REACTION_STEP|TIME | 0.978 | 0.926 | 0.952 |
| ARGM|REACTION_STEP|YIELD_OTHER | 0.984 | 0.942 | 0.962 |
| ARGM|REACTION_STEP|YIELD_PERCENT | 0.982 | 0.943 | 0.962 |
| ARGM|WORKUP|TEMPERATURE | 0.893 | 0.909 | 0.901 |
| ARGM|WORKUP|TIME | 0.7 | 1 | 0.824 |
| ARGM|WORKUP|YIELD_OTHER | 0 | 0 | 0 |
| ARGM|WORKUP|YIELD_PERCENT | 0 | 0 | 0 |
| Overall | 0.963 | 0.944 | 0.953 |
Performances of semantic role recognition by adding different strategies incrementally. Exact matching results of the development and test data using the fine-tuning method are reported.
| Model | Development | Test | ||||
| Precision | Recall | F1 | Precision | Recall | F1 | |
| BioBERT | 0.8402 | 0.9556 | 0.8942 | 0.8587 | 0.9760 | 0.9136 |
| +2Step_Tokenization | 0.9364 | 0.9330 | 0.9347 (+4.05%) | 0.9514 | 0.9530 | 0.9522 (+3.86%) |
| +Rule | 0.9394 | 0.9351 | 0.9373 (+0.26%) | 0.9539 | 0.9554 | 0.9546 (+0.24%) |
| +Word2Vec | 0.9373 | 0.9394 | 0.9383 (+0.10%) | 0.95984 | 0.9609 | 0.9604 (+0.58%) |
| +Patent_BioBERT-ChEMU | 0.9491 | 0.9370 | 0.9430 (+0.47%) | 0.9645 | 0.9517 | 0.9574 (−0.30%) |
| +Patent_BioBERT_External | 0.9413 | 0.9383 | 0.9398 (−0.32%) | 0.9616 | 0.9605 | 0.9611 (+0.37%) |
Tokenization outputs of four tokenizers: CLAMP, chemtok, oscar4, and umlsgenechem.
| Tokenizer | Chemical | Numeric values |
| Preparation of 5-formyl-2-trifluoromethylbenzonitrile | 14.9 mg (58% yield); J = 8.42 Hz | |
| CLAMP | Preparation of 5-formyl-2-trifluoromethylbenzonitrile | 14.9 mg (58% yield); J = 8.42 Hz |
| chemtok | Preparation of 5-formyl-2-trifluoromethylbenzonitrile | 14.9 mg (58% yield); J = 8.42 Hz |
| oscar4 | Preparation of 5-formyl-2-trifluoromethylbenzonitrile | 14.9 mg (58% yield); J = 8.42 Hz |
| umlsgenechem | Preparation of 5-formyl-2-trifluoromethylbenzonitrile | 14.9 mg 58 yield; J = 8.42 Hz |