| Literature DB >> 32657389 |
Leon Weber1,2, Kirsten Thobe2, Oscar Arturo Migueles Lozano2, Jana Wolf2, Ulf Leser1.
Abstract
MOTIVATION: A significant portion of molecular biology investigates signalling pathways and thus depends on an up-to-date and complete resource of functional protein-protein associations (PPAs) that constitute such pathways. Despite extensive curation efforts, major pathway databases are still notoriously incomplete. Relation extraction can help to gather such pathway information from biomedical publications. Current methods for extracting PPAs typically rely exclusively on rare manually labelled data which severely limits their performance.Entities:
Mesh:
Substances:
Year: 2020 PMID: 32657389 PMCID: PMC7355289 DOI: 10.1093/bioinformatics/btaa430
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.(a) Overview of PEDL for the two tasks of relation prediction and evidence prediction. In this example, the model predicts relations for the protein pair BTC and ErbB4 given three text spans containing both proteins. First, the BERT component produces a score matrix containing a prediction for each text and relation type. The relation predictions are then generated by applying LSE column-wise to approximate the maximum score for a given PPA type across all spans. The evidence predictions are obtained by taking the row-wise maximum, which is the highest score assigned to this text span regardless of PPA type. (b) The generation of one row of the score matrix s. In each of BERT’s 12 transformer layers, each token receives a 768 dimensional embedding (u for the first and z for the last layer). The embedding of the prepended [CLS] token is used to summarize the text span in the single vector h, which is then transformed to one row of the score matrix by the output layer (W, b)
Statistics of the datasets BioNLP 2011, BioNLP 2013 and PID
| Relations | Pairs | Texts (Avg.) | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| expr. | phosph. | State | Transport | Complex | Total | pos. | neg. | pos. | neg. | |
| BioNLP 2011 | 245 | 44 | 136 | 38 | 278 | 741 | 615 | 1845 | 19.69 | 4.97 |
| BioNLP 2013 | 179 | 104 | 160 | 43 | 441 | 927 | 730 | 2190 | 17.44 | 4.85 |
| PID | 2376 | 2714 | 8425 | 1020 | 5799 | 20 622 | 16 369 | 54 261 | 53.60 | 16.32 |
Note: Relations gives the total number of protein pairs for the five considered relations controls-expression-of (expr.), controls-phosphorylation-of (phosph.), controls-state-change-of (state), controls-transport-of (transport) and in-complex-with (complex). Pairs denote the total number of protein pairs with at least one relation (pos.) and without any relation (neg.). Texts states the average number of text spans per protein pair for pairs with at least one relation (pos.) and without any relation (neg.).
Results on the two BioNLP datasets (E2 and E3)
| BioNLP ’11 | BioNLP ’13 | |||
|---|---|---|---|---|
| r-AP | e-mAP | r-AP | e-mAP | |
| comb-dist |
|
|
|
|
| − direct |
|
|
|
|
| PEDL |
|
|
|
|
| − direct |
|
|
|
|
Note: r-AP is the AP for relation prediction and e-mAP the mAP for evidence prediction. All results are averages of five runs with different random seeds, with standard deviations given in brackets. ‘- direct’ shows scores without directly supervised data. The best scores are displayed in bold.
APs for relation prediction on the PID data (E1) for the PPA types controls-expression-of (expr.), controls-phosphorylation-of (phosph.), controls-state-change-of (state), controls-transport-of (transport) and in-complex-with (complex)
| expr. | phosph. | State | Transport | Complex | Total | |
|---|---|---|---|---|---|---|
| comb-dist | 42.77 | 38.38 | 49.14 | 5.87 | 47.86 | 44.78 |
| PEDL |
|
|
|
|
|
|
| count | 694 | 817 | 2532 | 288 | 1668 | 5999 |
Note: Total gives the AP for all PPA types as a micro-average. The best score per relation-type is displayed in bold. Count denotes the number of protein pairs with this type of PPA in the test set. Note that total is computed on a ranking of predictions including all PPA types, which leads to the fact that the difference between both models is smaller than every distance of the individual PPAs. EVEX cannot be compared in this setting, because it does not consider texts published after 2013.
Fig. 2.(a) PR curve for the PID data. The left plot shows results for all available abstracts and full texts. The right plot displays the results using only abstracts and full texts published prior to 2013, which allows a fair comparison with EVEX. These results are based on a ranking that includes all types of PPA. The improvement of PEDL over comb-dist is larger for rankings of only one type of PPA (see Table 3 for numbers and explanation). (b) Results from the manual evaluation of evidence prediction on PID
Evaluation results for the top-10 predictions that cannot be found either in Reactome or in PID
| k | PPA | Text span (source PMID) | t | Evidence |
|---|---|---|---|---|
| 1 | IGF-II | ‘We have previously reported that IGF-II binds the extracellular matrix protein vitronectin (VN) […] ’ (12746303) | 1 |
|
| 2 | hnRNP-A1 | ‘These results suggest that hnRNP-A1 promotes transcription of human IL10.’ (19349988) | 1 |
|
| 4 | NCOR1 | ‘ChIP-reChIP assays revealed that NCOR and […] p300 are present in distinct AR complexes on the promoter of PSA gene […]’ (23518348) | 4 |
|
| 5 | ets-2 | ‘Conditional overproduction of ets-2 in MCF-7 cells resulted in repression of endogenous BRCA1 mRNA expression.’ (12637547) | 1 |
|
| 6 | c-Rel | ‘We further demonstrate […] that introduction of two downstream c-Rel target genes, Bcl-X […]’ (15922711) | 1 |
|
| 8 | C/EBP-beta | ‘C/EBP-beta is a transcription factor […] capable of inducing COX-2 expression […]’ (19124115) | 1 |
|
Note: The rank of the prediction is given by k. We provide the highest ranking evidence text span that actually expresses the relation and its rank in PEDL (t), as well as manually sourced literature evidence that provides strong biological evidence for the existence of the PPA. Note that this evidence need not be identical to the evidence span predicted by the model.
Fig. 3.Maximum possible recall for a given maximum character distance between the protein mentions. ‘Positive’ refers to protein pairs with at least one PPA in PID and ‘Negative’ to pairs without any. The dashed lines indicate the maximum recall that is possible for sentence level approaches. The red vertical line indicates our choice for the maximum distance between pairs