| Literature DB >> 36068535 |
Yeawon Lee1, Jinseok Son2, Min Song3.
Abstract
The relationship between biomedical entities is complex, and many of them have not yet been identified. For many biomedical research areas including drug discovery, it is of paramount importance to identify the relationships that have already been established through a comprehensive literature survey. However, manually searching through literature is difficult as the amount of biomedical publications continues to increase. Therefore, the relation classification task, which automatically mines meaningful relations from the literature, is spotlighted in the field of biomedical text mining. By applying relation classification techniques to the accumulated biomedical literature, existing semantic relations between biomedical entities that can help to infer previously unknown relationships are efficiently grasped. To develop semantic relation classification models, which is a type of supervised machine learning, it is essential to construct a training dataset that is manually annotated by biomedical experts with semantic relations among biomedical entities. Any advanced model must be trained on a dataset with reliable quality and meaningful scale to be deployed in the real world and can assist biologists in their research. In addition, as the number of such public datasets increases, the performance of machine learning algorithms can be accurately revealed and compared by using those datasets as a benchmark for model development and improvement. In this paper, we aim to build such a dataset. Along with that, to validate the usability of the dataset as training data for relation classification models and to improve the performance of the relation extraction task, we built a relation classification model based on Bidirectional Encoder Representations from Transformers (BERT) trained on our dataset, applying our newly proposed fine-tuning methodology. In experiments comparing performance among several models based on different deep learning algorithms, our model with the proposed fine-tuning methodology showed the best performance. The experimental results show that the constructed training dataset is an important information resource for the development and evaluation of semantic relation extraction models. Furthermore, relation extraction performance can be improved by integrating our proposed fine-tuning methodology. Therefore, this can lead to the promotion of future text mining research in the biomedical field.Entities:
Keywords: Annotation method; BERT; Corpus construction; Deep learning; Fine-tuning; Relation extraction; Semantic relation classification
Mesh:
Year: 2022 PMID: 36068535 PMCID: PMC9446816 DOI: 10.1186/s12911-022-01977-5
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 3.298
Fig. 1A Sentence within the corpus that was subject to annotation. B Sentence annotation result visualized with TextAE. Annotation tool based on TextAE: TextAE is a text annotation tool that can annotate named entity and relation information in the text. Each term (entity) can be dragged or double-clicked in Term Edit Mode, and the corresponding type can be selected to annotate them. If the corresponding entity type does not exist at the time, a new type can be defined. Likewise, in Relation Edit Mode, a type of relation can be selected or created and visualized. It is also possible to annotate multiple relations where one entity is associated with multiple other entities at once. The results of the annotation can be downloaded locally in the form of a json file
Twelve entity types
| Explanation | |
|---|---|
| Genes, DNA, RNA, and Proteins | A gene is the functional unit of heredity and the nucleotide sequence of DNA or RNA that holds instructions for synthesizing either RNA or protein |
| Enzymes | Proteins that act as biological catalysts |
| Hormones | Signaling molecules that act distant from their site of production |
| Compounds | Additives such as drugs or chemicals |
| Molecular functions | Proteins with the role of binding such as hormones and antigen-antibodies |
| Phenotypes | Limited to human diseases |
| Biological processes | Processes/activities that occur within cells |
| Cells | Smallest functional units of an organism with which biological experiments are conducted to observe their mechanisms or to grow targeted compounds |
| Viruses | Used to indicate when experiments are conducted on a virus |
Eight types of relations
| Causality | Direction of causality | Expressions of quantity |
|---|---|---|
| Directed Link | Positive Cause | Positive Increase |
| Negative Decrease | ||
| Negative Cause | Positive Decrease | |
| Negative Increase | ||
| Undirected Link | – | |
Fig. 2Annotation workflow
Evaluation metrics
| Useful when target classes are well balanced | |
| The ability of a model to find all relevant cases within a dataset | |
| The ability of a model to identify only the relevant data points | |
Combination between Precision and Recall Used to punish extreme values |
Fig. 3Entity marker–entity start: Input sentence with additional marker tokens
Fig. 4Masked input: Input sentence masked as [MASK] for entities
Fig. 5Two masked sentence input
Fig. 6Two-sentence entity token input
Hyper parameters
| Hyper Parameter | |
|---|---|
| num_train_epochs | 10 |
| learning_rate | 5e-5 |
| per_device_train_batch_size | 16 |
| per_device_eval_batch_size | 64 |
| warmup_ratio | 0.1 |
| weight_decay | 0.01 |
| adam_beta1 | 0.9 |
| adam_beta2 | 0.999 |
| adam_epsilon | 1e-8 |
| max_grad_norm | 1 |
Fig. 7The overview of the study
A Entity types. B Relation types. Dataset overview
| Train | Validate | Test | |||||
|---|---|---|---|---|---|---|---|
| Left | Right | Left | Right | Left | Right | ||
| A | |||||||
| BIOLOGICAL PROCESS | 181 | 825 | 61 | 301 | 70 | 270 | 1,708 |
| CELL | 60 | 25 | 15 | 12 | 16 | 12 | 140 |
| COMPOUND | 769 | 225 | 232 | 75 | 266 | 72 | 1,639 |
| DNA | 4 | 1 | 4 | 9 | |||
| ENZYME | 67 | 34 | 28 | 8 | 20 | 17 | 174 |
| GENE | 1,017 | 598 | 347 | 180 | 328 | 185 | 2,655 |
| HORMONE | 25 | 10 | 6 | 3 | 7 | 4 | 55 |
| MOLECULAR FUNCTION | 59 | 71 | 24 | 26 | 21 | 20 | 221 |
| PHENOTYPE | 494 | 1053 | 168 | 355 | 160 | 368 | 2,598 |
| PROTEIN | 241 | 142 | 84 | 33 | 84 | 46 | 630 |
| RNA | 91 | 33 | 34 | 13 | 33 | 13 | 217 |
| VIRUS | 10 | 1 | 3 | 2 | 16 | ||
| 3,018 | 3,018 | 1,006 | 1,006 | 1,007 | 1,007 | 10,062 | |
Performance comparison of pre-trained language models
| Model | Accuracy | Precision | Recall | F1-score |
|---|---|---|---|---|
| BERT [ | *0.849 **(0.003) | 0.817 (0.010) | 0.822 (0.019) | 0.818 (0.011) |
| BioBERT [ | 0.861 (0.008) | 0.835 (0.017) | 0.846 (0.015) | |
| 0.833 (0.020) | ||||
| RoBERTa [ | 0.862 (0.009) | 0.835 (0.018) | 0.837 (0.009) | 0.835 (0.010) |
| SciBERT [ | 0.862 (0.010) | 0.843 (0.013) | 0.838 (0.013) |
The best scores are in bold
*Mean
**Standard deviation
aBert-base-uncased, Accessed July 20, 2022, Available from: https://huggingface.co/bert-base-uncased
bBiobert-base-cased-v1.2, Accessed July 20, 2022, Available from: https://huggingface.co/dmis-lab/biobert-base-cased-v1.2
cBiomedNLP-PubMedBERT-base-uncased-abstract-fulltext, Accessed July 20, 2022, Available from: https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext
dRoberta-base, Accessed July 20, 2022, Available from: https://huggingface.co/roberta-base
eScibert_scivocab_uncased, Accessed July 20, 2022, Available from: https://huggingface.co/allenai/scibert_scivocab_uncased
Performance comparison of masking input methods
| Method | Accuracy | Precision | Recall | F1-score |
|---|---|---|---|---|
| Default | *0.700 **(0.018) | 0.646 (0.019) | 0.647 (0.016) | 0.642 (0.013) |
| Entity Marker–Entity Start | 0.857 (0.012) | 0.825 (0.029) | 0.836 (0.007) | 0.828 (0.015) |
| Masked Input | 0.844 (0.013) | 0.811 (0.023) | 0.826 (0.011) | 0.817 (0.012) |
| 0.837 (0.018) | ||||
| Two Sentence Entity Token Input | 0.865 (0.012) | 0.845 (0.014) |
The best scores are in bold
*Mean
**Standard deviation
Performance comparison of downstream layers
| Layer architecture | Accuracy | Precision | Recall | F1-score |
|---|---|---|---|---|
| CLS token layer | *0.858 **(0.016) | 0.823 (0.026) | 0.835 (0.015) | 0.827 (0.017) |
| Three-token layer | 0.861 (0.013) | 0.835 (0.027) | 0.842 (0.009) | 0.836 (0.016) |
The best scores are in bold
*Mean
**Standard deviation
Performance comparison against models from related works
| Model | F1 |
|---|---|
| Word2vec + CNN [ | 0.708 |
| Entity Attention Bi-LSTM [ | 0.787 |
| Matching the Blanks [ | 0.799 |
The best scores are in bold
Fig. 8Best model: PubmedBERT-based model with the two masked sentence input method and two token layer applied
Per-class performance
| Class | Precision | Recall | F1-score | Support |
|---|---|---|---|---|
| Directed Link | 0.949 | 0.851 | 0.898 | 175 |
| Negative Cause | 0.902 | 0.888 | 0.895 | 156 |
| Negative Decrease | 0.909 | 0.918 | 0.914 | 74 |
| Negative Increase | 0.837 | 0.891 | 0.863 | 48 |
| Positive Cause | 0.907 | 0.895 | 0.901 | 206 |
| Positive Decrease | 0.714 | 0.741 | 0.727 | 26 |
| Positive Increase | 0.759 | 0.732 | 0.746 | 47 |
| Undirected Link | 0.840 | 0.906 | 0.872 | 275 |
| 0.852 | 1007 |
Support: Number of instances in the test data represents 20% of the full data set, which is proportional to each class