| Literature DB >> 34827589 |
Prashant Srivastava1, Saptarshi Bej1,2, Kristina Yordanova1, Olaf Wolkenhauer1,2.
Abstract
For any molecule, network, or process of interest, keeping up with new publications on these is becoming increasingly difficult. For many cellular processes, the amount molecules and their interactions that need to be considered can be very large. Automated mining of publications can support large-scale molecular interaction maps and database curation. Text mining and Natural-Language-Processing (NLP)-based techniques are finding their applications in mining the biological literature, handling problems such as Named Entity Recognition (NER) and Relationship Extraction (RE). Both rule-based and Machine-Learning (ML)-based NLP approaches have been popular in this context, with multiple research and review articles examining the scope of such models in Biological Literature Mining (BLM). In this review article, we explore self-attention-based models, a special type of Neural-Network (NN)-based architecture that has recently revitalized the field of NLP, applied to biological texts. We cover self-attention models operating either at the sentence level or an abstract level, in the context of molecular interaction extraction, published from 2019 onwards. We conducted a comparative study of the models in terms of their architecture. Moreover, we also discuss some limitations in the field of BLM that identifies opportunities for the extraction of molecular interactions from biological text.Entities:
Keywords: biological literature mining; natural language processing; relationship extraction; self-attention models; text mining
Mesh:
Year: 2021 PMID: 34827589 PMCID: PMC8615611 DOI: 10.3390/biom11111591
Source DB: PubMed Journal: Biomolecules ISSN: 2218-273X
Figure 1Workflow for Biological Literature Mining (BLM). Starting with a collection of texts from different sources to processing them into structured data for modeling. Choosing from a plethora of NLP models such as BioBERT to perform BLM tasks such as NER and RE to extract information from the text.
Key abbreviations for text mining. Rows with text mining problem names are marked in blue; model names are marked in red; biological interaction database names are marked in green.
| Abbreviation | Full Name | Description |
|---|---|---|
| BLM | Biological Literature Mining | Mining information from biological literature/publications |
| NLP | Natural Language Processing | Ability of a computer program to understand human language |
| RE | Relationship Extraction | Extracting related entities and the relationship type from biological texts |
| NER | Named Entity Recognition | NLP-based approaches to identify context-specific entity names from text |
| CNN | Convolutional Neural Network | A type of neural network popularly used in computer vision |
| RNN | Recurrent Neural Network | One of the neural network models designed to handle sequential data |
| LSTM | Long Short-Term Memory | A successor of RNN useful for handling sequential data |
| GRUs | Gated Recurrent Units | A successor of RNN useful for handling sequential data |
| BERT | Bidirectional Encoder Representations from Transformer | A pretrained neural network popularly used for NLP tasks |
| KAN | Knowledge-aware Attention Network | A self-attention-based network for RE problems |
| PPI | Protein–Protein Interaction | Interactions among proteins, a popular problem in RE |
| DDI | Drug–Drug Interaction | Interactions among drugs, a popular problem in RE |
| ChemProt | Chemical–Protein Interaction | Interactions among chemicals and proteins, a popular problem in RE |
Figure 2The evolution of sequence-to-sequence models for relationship extraction. Some pros and cons of the models are marked in green and red, respectively.
Figure 3Left: A schematic of the transformer model architecture with attention-based encoder–decoder architecture. The encoder’s output is passed into the decoder to be used as the key and query for the second attention layer. The symbol next to the transformer blocks in the encoder and the decoder represents N layers of the transformer block. Right: An example heat map of the attention mechanism. The heat map shows pairwise attention weights between pieces of strings in a sentence for a trained model. A hotter hue for a block in the heat map corresponds to higher attention between the string in the row and the string in the column, respective to the block.
Table summarizing several aspects of the compared studies. Several publications investigated different variants of the proposed models. We present the performance of only the best models among them.
| Work | Datasets | Model | Tasks Performed | Performance |
|---|---|---|---|---|
| Elangovan et al. (2020) [ | Processed version of the IntAct dataset with seven types of interactions | Ensemble of fine-tuned BioBERT models; no external knowledge used | Typed and untyped RE with relationship types such as phosphorylation, acetylation, etc. | Typed PPI: 0.540; untyped PPI: 0.717; metric: F1 score |
| Giles et al. (2020) [ | Manually curated from the MEDLINE database | Fine-tuned BioBERT model; used STRING database knowledge during dataset curation | Classification problem with classes coincidental mention, positive, negative, incorrect entity recognition, and unclear | Curated data and BioBERT: 0.889; metric: F1 score |
| Su et al. (2020) [ | Processed versions of the BioGRID, DrugBank, and IntAct datasets | Fine-tuned the BERT model integrated with LSTM and additive attention; no external knowledge used | Classification tasks on PPI (binary), DDI (multiclass), and ChemProt (multiclass) | PPI: 0.828; DDI: 0.807; ChemProt: 0.768; metric: F1 score |
| Su et al. (2021) [ | Processed versions of the BioGRID, DrugBank, and IntAct datasets | Contrastive learning model; no external knowledge used in dataset curation or as a part of the model | Classification tasks on PPI (binary), DDI (multiclass), and ChemProt (multiclass) | PPI: 0.827; DDI: 0.829; ChemProt: 0.787; metric: F1 score |
| Wang et al. (2020) [ | Processed versions of the BioCreative VI PPI dataset | A multitasking architecture based on BERT, BioBERT, BiLSTM, and text CNN; no external knowledge used | Document triage classification, NER (auxiliary tasks), and PPI RE (main task). | NER task: 0.936, PPI RE (exact match evaluation): 0.431; metric: F1 score |
| Zhou et al. (2019) [ | Processed versions of the BioCreative VI PPI dataset | KAN; TransE used to integrate prior knowledge from the BioGRID and IntAct datasets on triplets to the model | PPI-RE classification task from BioCreative VI | PPI RE (exact match evaluation): 0.382 PPI RE: (HomoloGene evaluation): 0.404; metric: F1 score |
Figure 4Architecture of the Knowledge-aware Attention Network (KAN). The symbol next to the marked blocks represents N copies of the respective blocks, where N can be defined by the modeler. The architecture has two parallel input channels, taking information on the source and target entity as the input along with the relevant text. External knowledge is integrated into the model in the form of entity-specific representations and and a representation of their known relationship .