| Literature DB >> 34986221 |
Helena Balabin1,2, Charles Tapley Hoyt3, Colin Birkenbihl1, Benjamin M Gyori3, John Bachman3, Alpha Tom Kodamullil1, Paul G Plöger2, Martin Hofmann-Apitius1, Daniel Domingo-Fernández1,4,5.
Abstract
MOTIVATION: The majority of biomedical knowledge is stored in structured databases or as unstructured text in scientific publications. This vast amount of information has led to numerous machine learning-based biological applications using either text through natural language processing (NLP) or structured data through knowledge graph embedding models (KGEMs). However, representations based on a single modality are inherently limited.Entities:
Year: 2022 PMID: 34986221 PMCID: PMC8896635 DOI: 10.1093/bioinformatics/btac001
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Methodology workflow. This figure illustrates the classification of the context annotation for a given text-triple pair. In this example, the models aim to predict the species in which a certain biological process was observed (e.g. mice). (A) The three models (i.e. the two baselines and the proposed STonKGs model) are trained and evaluated in a shared experimental setting. (B) For each text evidence and triple pair, the two baseline models exclusively use a single modality, whereas STonKGs leverages both
Fig. 2.Transforming KG embeddings into sequential inputs. For a given triple we generate the final random walk-based embedding representation based on the following steps: (i) Obtain the random walks based on the pre-trained node2vec model: . (ii) Embed each node in those random walks, resulting in two random walk-based embedding sequences: . (iii) Generate the final embedding sequence
Fig. 3.Cross-modal attention between text data (token sequences) and KG data (triple sequences). The input is a concatenation of a token and a triple sequence. Each element in the initial input sequence consists of its respective BioBERT embedding. The resulting hidden states are processed by two different heads for text tokens and KG nodes, respectively. While the MLM head is returning probabilities for each token of the NLP-backbone, the MEM head is converting the hidden states onto probabilities for each node of the KG-backbone
Overview on the fine-tuning classification tasks
| Task | Description | Number of classes | Classes | Example |
|---|---|---|---|---|
|
| Directionality effect of the source node on the target node | Binary | Increase and decrease | ‘HSP70 […] increases ENPP1 transcript and protein levels’ (PMID : 19083193) |
|
| Whether it is known to be a physical interaction between the source and the target node | Binary | Direct and indirect interaction | ‘SHP repressed […] transcription of PEPCK through direct interaction with C/EBPalpha protein’ (PMID : 17094771) |
|
| Cell line in which the given relation has been described | 10 | HEK293, DMS114, HeLa, NIH-3T3, HepG2, MCF7, COS-1, THP-1, LNCAP and U-937 | ‘We show that upon stimulation of HeLa cells by CXCL12, CXCR4 becomes tyrosine phosphorylated’ (PMID : 15819887) |
|
| Disease context in which the particular relation occurs | 10 | Neuroblastoma, breast cancer, lung cancer, atherosclerosis, multiple myeloma, leukemia, melanoma, osteosarcoma, lung non-small cell carcinoma | ‘ […] nicotine […] activates the MAPK signaling pathway in lung cancer’ (PMID : 14729617) |
|
| Cellular location in which the particular relation occurs | 5 | Cell nucleus, extracellular space, cell membrane, cytoplasm and extracellular matrix | ‘The activated MSK1 translocates to the nucleus and activates CREB […].’ (PMID : 9687510) |
|
| Species in which the particular relation has been described | 3 | Human, mouse and rat | ‘Mutation of putative GRK phosphorylation sites in the cannabinoid receptor 1 (CB1R) confers resistance to cannabinoid tolerance and hypersensitivity to cannabinoids in mice’ (PMID : 24719095) |
|
| Whether the extracted triple correctly corresponds to the text or not | Binary | Correct and incorrect | Examples are available at INDRA’s curation guidelines ( |
|
| Whether the extracted triple correctly corresponds to the text or not (including all error types) | 8 | Correct, no relation, wrong relation, grounding, polarity, act versus amt, entity boundaries, hypothesis |
Note: While the two binary tasks (i.e. the polarity and interaction-type tasks) intend to evaluate the models’ abilities to classify the relation type of the triple, the other four tasks deal with the classification of different types of contexts in which a given triple can appear in. Finally, the two tasks aim at predicting whether the triple has been correctly extracted from the text evidence.
Benchmark comparison of the baseline models and ablation variants of STonKGs on the chosen classification tasks
| Model | Relation type classification task | Context annotation classification task | Correct/incorrect classification task | |||||
|---|---|---|---|---|---|---|---|---|
| (1) Polarity | (2) Interaction type | (3) Cell line | (4) Disease | (5) Location | (6) Species | (7) Binary | (8) Multiclass | |
|
|
| 0.991 | 0.238 | 0.214 | 0.397 |
| 0.911 | 0.881 |
|
| 0.448 | 0.945 | 0.020 | 0.030 | 0.295 | 0.670 | 0.708 | 0.446 |
|
| N/A | N/A | 0.046 | 0.081 | 0.320 | 0.534 | 0.485 | 0.195 |
|
| 0.930 | 0.995 | 0.252 |
| 0.405 | 0.860 | 0.977 | 0.964 |
|
| 0.931 | 0.995 | 0.256 | 0.240 | 0.404 | 0.860 |
| 0.963 |
|
| 0.918 | 0.992 |
| 0.236 | 0.401 | 0.857 | 0.977 | 0.960 |
|
| N/A | N/A | 0.238 | 0.216 |
| 0.857 |
|
|
|
| −0.009 | +0.004 | +0.023 | +0.034 | +0.009 | −0.005 | +0.067 | +0.084 |
|
| −0.96% | +0.40% | +8.81% | +15.89% | +2.27% | −0.58% | +7.35% | +9.53% |
Note: Performance is measured as the average F1-score across the five cross-validation splits. For each classification task, the best model performance is highlighted in bold font. While the absolute performance gains are calculated based on the difference between the best STonKGs variant and the best baseline (i.e. the NLP baseline), the relative performance gains are obtained by dividing that difference by the F1-score of the best baseline and expressing the value as a percentage: . If one of the STonKGs variants outperforms the baselines on a given task, the respective absolute and relative differences are highlighted in green, otherwise, they are colored in red.