| Literature DB >> 35618736 |
Hyejin Cho1, Baeksoo Kim1, Wonjun Choi2, Doheon Lee3, Hyunju Lee4.
Abstract
Medicinal plants have demonstrated therapeutic potential for applicability for a wide range of observable characteristics in the human body, known as "phenotype," and have been considered favorably in clinical treatment. With an ever increasing interest in plants, many researchers have attempted to extract meaningful information by identifying relationships between plants and phenotypes from the existing literature. Although natural language processing (NLP) aims to extract useful information from unstructured textual data, there is no appropriate corpus available to train and evaluate the NLP model for plants and phenotypes. Therefore, in the present study, we have presented the plant-phenotype relationship (PPR) corpus, a high-quality resource that supports the development of various NLP fields; it includes information derived from 600 PubMed abstracts corresponding to 5,668 plant and 11,282 phenotype entities, and demonstrates a total of 9,709 relationships. We have also described benchmark results through named entity recognition and relation extraction systems to verify the quality of our data and to show the significant performance of NLP tasks in the PPR test set.Entities:
Mesh:
Year: 2022 PMID: 35618736 PMCID: PMC9135735 DOI: 10.1038/s41597-022-01350-1
Source DB: PubMed Journal: Sci Data ISSN: 2052-4463 Impact factor: 8.501
Fig. 1The pipeline of the PPR corpus construction.
Overall corpus statistics.
| Set | Abstracts | Entity | Relationships | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Plant | Phenotype | Total | Increase | Decrease | Association | Negative (No relation) | Total | ||||
| POS | NEG | NEU | |||||||||
| Train | 400 | 3752 (1031) | 1702 (410) | 5605 (1974) | 361 (190) | 11420 (3605) | 971 (855) | 2808 (2424) | 52 (49) | 2746 (2211) | 6577 (5539) |
| Development | 100 | 934 (281) | 466 (166) | 1283 (614) | 54 (42) | 2737 (1103) | 265 (239) | 608 (519) | 13 (13) | 677 (511) | 1563 (1282) |
| Test | 100 | 968 (322) | 437 (152) | 1339 (650) | 36 (26) | 2780 (1150) | 242 (210) | 663 (577) | 2 (2) | 662 (525) | 1569 (1314) |
| Total | 600 | 5654 (1634) | 2605 (728) | 8227 (3238) | 451 (258) | 16937 (5858) | 1478 (1304) | 4079 (3520) | 67 (64) | 4085 (3247) | 9709 (8135) |
The PPR corpus consists of training, development, and test sets for plant and phenotype name recognition and relation extraction tasks.
Fig. 2Example of the PPR corpus. The first line is a sentence obtained from the first sentence of an abstract (PubMed ID: 10072339), followed by annotated named entities and their relationships. The named entity information includes PubMed ID, start and end positions, annotated mention, and entity type. The relationship information consists of PubMed ID, relation type, and information related to the two entities.
Overall inter-annotator agreement (IAA) results of each phase.
| Annotation Phase | Plant | Phenotype | Relations | ||||
|---|---|---|---|---|---|---|---|
| Strict | Partial | Strict | Partial | Simple index | G-index | Cohen’s κ | |
| Phase 1 | 93.6 | 96.4 | 51.5 | 70.0 | 90.8 | 89.7 | 86.5 |
| Phase 2 | 93.0 | 97.0 | 64.0 | 93.2 | 92.5 | 90.2 | 85.6 |
| Phase 3 | 91.7 | 94.6 | 64.8 | 77.6 | 93.1 | 91.8 | 87.2 |
| Phase 4 | 93.5 | 95.8 | 72.4 | 82.0 | 92.8 | 90.3 | 86.4 |
| Phase 5 | 92.8 | 95.2 | 78.0 | 86.0 | 91.9 | 90.6 | 87.5 |
| Phase 6 | 84.4 | 89.5 | 67.5 | 76.3 | 92.2 | 91.5 | 88.3 |
| Average | 91.5 | 94.8 | 66.4 | 80.9 | 92.2 | 90.7 | 86.9 |
The PPR corpus was annotated by two annotators at the mention- and relation-levels. Therefore, IAA was independently calculated at each annotation level, and three different IAA measures were calculated to assess the accuracy of the corpus.
Statistics of the biomedical NER corpora for the annotated entities.
| Corpus | Entity type | Text type | # Documents | # Mentions | IAA (%) | IAA metric |
|---|---|---|---|---|---|---|
| 2010 i2b2/VA[ | Clinical concept | Reports | 826 | 72,846 | — | — |
| ShARe/CLEF[ | Clinical concept | Reports | 298 | 11,167 | — | — |
| NCBI disease[ | Disease | Abstracts | 793 | 6,892 | 87.5 | Strict |
| BC5CDR[ | Disease | Abstracts | 1,500 | 12,850 | 87.5 | Strict |
| Chemical | Abstracts | 1,500 | 15,935 | 96.1 | Strict | |
| CHEMDNER[ | Drug/Chemical | Abstracts | 10,000 | 84,355 | 91.0 | Strict |
| BC2GM[ | Gene/Protein | Sentences | 20,000 | 24,596 | — | — |
| JNLPBA[ | Gene/Protein | Abstracts | 2,404 | 59,963 | — | — |
| LINNAEUS[ | Species | Full-text | 100 | 4,259 | 89.0 | Cohen’s kappa |
| Species-800[ | Species | Abstracts | 800 | 3,708 | 80.0 | Cohen’s kappa |
| Plant[ | Plant | Abstracts | 208 | 3,985 | 98.5 | Strict |
| PPR | Plant | Abstracts | 600 | 5,654 | 91.5/94.8 | Strict/Partial |
| Phenotype | Abstracts | 600 | 11,283 | 66.4/80.9 | Strict/Partial |
Statistics of the biomedical RE corpora for the annotated relationships.
| Corpus | Relation type | Text type | # Documents | # Relations | IAA (%) | IAA metric |
|---|---|---|---|---|---|---|
| AIMed[ | Protein-Protein | Sentences | 1,955 | 5,834 | — | — |
| BioInfer[ | Protein-Protein | Sentences | 1,100 | 9,666 | — | — |
| BC5CDR[ | Chemical Disease | Abstracts | 1,500 | 3,116 | — | — |
| CHEMPROT[ | Chemical-Protein | Abstracts | 5,031 | 10,031 | — | — |
| DDI[ | Drug-Drug (Drugbank) | Documents | 792 | 4,701 | 83.9 | Cohen’s kappa |
| Drug-Drug (PubMed) | Abstracts | 233 | 327 | 62.1 | Cohen’s kappa | |
| EU-ADR[ | Drug-Disorder | Abstracts | 100 | 668 | 73.3 | (relaxed) Simple index |
| Target-Disorder | Abstracts | 100 | 941 | 74.0 | (relaxed) Simple index | |
| Target-Drug | Abstracts | 100 | 827 | 75.7 | (relaxed) Simple index | |
| GAD[ | Gene-Disease | Sentences | 5,330 | 5,330 | — | — |
| CoMAGC[ | Gene-Cancer | Abstracts | 408 | 821 | 75.7/56.8 | Cohen’s kappa |
| Plant-Disease[ | Plant-Disease | Abstracts | 199 | 1,309 | 86.9 | Cohen’s kappa |
| Plant-Chemical[ | Plant-Chemical | Sentences | 382 | 1,043 | 79.8 | Cohen’s kappa |
| PPR | Plant-Phenotype | Abstracts | 600 | 9,709 | 86.9 | Cohen’s kappa |
Fig. 3BERT-based fine-tuning model architectures. The input sentence is “The tumor specific cytotoxicity of dihydronitidine from Toddalia asiatica Lam (PubMed ID: 16465544).” In this case, “tumor” is annotated with the negative phenotype, and “Toddalia asiatica Lam” is the plant mention. Figure (A) represents the BERT-based NER model, and Figure (B) shows the BERT-based RE model.
Evaluation of the BERT fine-tuned models to recognize plant and phenotype mentions based on the conduction of two types of evaluation experiments.
| Model | Test (100 abstracts) | 5-fold cross validation | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| micro-F1 | macro-F1 | weighted-F1 | micro-F1 | macro-F1 | weighted-F1 | |||||||||||||
| p | r | f | p | r | f | p | r | f | p | r | f | p | r | f | p | r | f | |
| BERT | 81.25 | 85.58 | 83.36 | 74.82 | 83.61 | 78.55 | 81.46 | 85.58 | 83.43 | 82.03 | 84.57 | 83.28 | 77.78 | 80.33 | 78.99 | 82.06 | 84.57 | 83.28 |
| BioBERT | 86.80 | |||||||||||||||||
| BlueBERT | 82.54 | 86.87 | 84.65 | 73.80 | 83.70 | 77.72 | 82.90 | 86.87 | 84.79 | 83.44 | 86.46 | 84.92 | 72.57 | 75.71 | 74.07 | 83.51 | 86.46 | 84.95 |
| SciBERT | 84.13 | 88.13 | 86.09 | 74.38 | 83.01 | 77.92 | 84.48 | 88.13 | 86.23 | 85.26 | 87.64 | 86.43 | 77.71 | 79.91 | 78.75 | 85.28 | 87.64 | 86.43 |
| PubMedBERT | 82.84 | 87.70 | 85.20 | 74.89 | 79.55 | 83.23 | 87.70 | 85.33 | 83.90 | 87.08 | 85.46 | 79.40 | 82.79 | 81.00 | 83.96 | 87.08 | 85.48 | |
Performance comparison of BERT-based models based on target entity types used to divide plant and phenotype names: a total of four types for plants and three subtypes of phenotypes (ALL), only plant (PLT), the only phenotype (PHE), two types of named entities as plant and phenotype mentions regardless of the subtypes (PLT/PHE), and only phenotype mentions such as positive, negative, neutral phenotypes (POS/NEG/NEU).
| Model | macro-F1 | macro-F1 | weighted-F1 | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| p | r | f | p | r | f | p | r | f | ||
| BERT | All | 81.25 | 85.58 | 83.36 | 74.82 | 83.61 | 78.55 | 81.46 | 85.58 | 83.43 |
| PLT | 82.36 | 84.40 | 83.37 | 82.36 | 84.40 | 83.37 | 82.36 | 84.40 | 83.37 | |
| PHE | 81.89 | 84.60 | 83.22 | 81.89 | 84.60 | 83.22 | 81.89 | 84.60 | 83.22 | |
| PLT/PHE | 81.59 | 85.90 | 83.69 | 81.70 | 86.44 | 84.00 | 81.58 | 85.90 | 83.68 | |
| POS/NEG/NEU | 81.14 | 84.27 | 82.67 | 71.15 | 81.33 | 75.23 | 81.54 | 84.27 | 82.83 | |
| BioBERT | All | 86.80 | ||||||||
| PLT | ||||||||||
| PHE | ||||||||||
| PLT/PHE | ||||||||||
| POS/NEG/NEU | ||||||||||
| BlueBERT | All | 82.54 | 86.87 | 84.65 | 73.80 | 83.70 | 77.72 | 82.90 | 86.87 | 84.79 |
| PLT | 83.98 | 89.36 | 86.59 | 83.98 | 89.36 | 86.59 | 83.98 | 89.36 | 86.59 | |
| PHE | 84.24 | 87.64 | 85.91 | 84.24 | 87.64 | 85.91 | 84.24 | 87.64 | 85.91 | |
| PLT/PHE | 83.55 | 87.30 | 85.38 | 83.82 | 87.83 | 85.78 | 83.54 | 87.30 | 85.38 | |
| POS/NEG/NEU | 83.02 | 86.09 | 84.53 | 71.84 | 80.60 | 75.41 | 83.44 | 86.09 | 84.70 | |
| SciBERT | All | 84.13 | 88.13 | 86.09 | 74.38 | 83.01 | 77.92 | 84.48 | 88.13 | 86.23 |
| PLT | 85.91 | 89.46 | 87.65 | 85.91 | 89.46 | 87.65 | 85.91 | 89.46 | 87.65 | |
| PHE | 84.51 | 88.25 | 86.34 | 84.51 | 88.25 | 86.34 | 84.51 | 88.25 | 86.34 | |
| PLT/PHE | 84.79 | 89.21 | 86.94 | 84.78 | 89.72 | 87.18 | 84.79 | 89.21 | 86.93 | |
| POS/NEG/NEU | 84.86 | 87.86 | 86.33 | 73.95 | 83.46 | 77.89 | 85.20 | 87.86 | 86.47 | |
| PubMedBERT | All | 82.84 | 87.70 | 85.20 | 74.89 | 79.55 | 83.23 | 87.70 | 85.33 | |
| PLT | 83.14 | 88.64 | 85.80 | 83.14 | 88.64 | 85.80 | 83.14 | 88.64 | 85.80 | |
| PHE | 83.45 | 86.53 | 84.96 | 83.45 | 86.53 | 84.96 | 83.45 | 86.53 | 84.96 | |
| PLT/PHE | 83.12 | 87.70 | 85.35 | 83.12 | 87.99 | 85.48 | 83.12 | 87.70 | 85.35 | |
| POS/NEG/NEU | 83.27 | 84.88 | 84.07 | 74.78 | 82.50 | 78.01 | 83.53 | 84.88 | 84.16 | |
Evaluation of BERT fine-tuned models to extract information on the relationships between plant and phenotype mentions based on the conduction of two types of evaluation experiments.
| Model | Test (100 abstracts) | 5-fold cross validation | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| micro-F1 | macro-F1 | weighted-F1 | micro-F1 | macro-F1 | weighted-F1 | |||||||||||||
| p | r | f | p | r | f | p | r | f | p | r | f | p | r | f | p | r | f | |
| BERT | 83.88 | 83.88 | 83.88 | 62.70 | 61.46 | 61.97 | 84.16 | 83.88 | 83.88 | 85.97 | 85.97 | 85.97 | 67.60 | 65.93 | 66.20 | 86.08 | 85.97 | 85.84 |
| BioBERT | 87.83 | 87.83 | 87.83 | 65.41 | 65.57 | 65.44 | 87.83 | 87.83 | 87.75 | 87.16 | 87.16 | 87.16 | 68.38 | 66.69 | 67.05 | 87.24 | 87.16 | 86.99 |
| BlueBERT | 86.55 | 86.55 | 86.55 | 64.69 | 64.29 | 64.42 | 86.66 | 86.55 | 86.49 | 67.06 | 66.41 | 66.52 | 87.52 | 87.33 | ||||
| SciBERT | 83.88 | 83.88 | 83.88 | 62.50 | 61.59 | 61.97 | 84.07 | 83.88 | 83.86 | 86.19 | 86.19 | 86.19 | 67.56 | 65.87 | 66.26 | 86.34 | 86.19 | 86.05 |
| PubMedBERT | 87.48 | 87.48 | 87.48 | 87.48 | ||||||||||||||
| Measurement(s) | Report from Literature |
| Technology Type(s) | manual curation |
| Sample Characteristic - Organism | Homo sapiens |