Literature DB >> 32657389

PEDL: extracting protein-protein associations using deep language models and distant supervision.

Leon Weber^1,2, Kirsten Thobe², Oscar Arturo Migueles Lozano², Jana Wolf², Ulf Leser¹.

Abstract

MOTIVATION: A significant portion of molecular biology investigates signalling pathways and thus depends on an up-to-date and complete resource of functional protein-protein associations (PPAs) that constitute such pathways. Despite extensive curation efforts, major pathway databases are still notoriously incomplete. Relation extraction can help to gather such pathway information from biomedical publications. Current methods for extracting PPAs typically rely exclusively on rare manually labelled data which severely limits their performance.
RESULTS: We propose PPA Extraction with Deep Language (PEDL), a method for predicting PPAs from text that combines deep language models and distant supervision. Due to the reliance on distant supervision, PEDL has access to an order of magnitude more training data than methods solely relying on manually labelled annotations. We introduce three different datasets for PPA prediction and evaluate PEDL for the two subtasks of predicting PPAs between two proteins, as well as identifying the text spans stating the PPA. We compared PEDL with a recently published state-of-the-art model and found that on average PEDL performs better in both tasks on all three datasets. An expert evaluation demonstrates that PEDL can be used to predict PPAs that are missing from major pathway databases and that it correctly identifies the text spans supporting the PPA.
AVAILABILITY AND IMPLEMENTATION: PEDL is freely available at https://github.com/leonweber/pedl. The repository also includes scripts to generate the used datasets and to reproduce the experiments from this article. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: CellLine Chemical Disease Gene Species

Mesh：

Substances：
Proteins

Year: 2020 PMID： 32657389 PMCID： PMC7355289 DOI： 10.1093/bioinformatics/btaa430

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Molecular biology explores chemical and physical interactions between key intermediates, mostly proteins, in cells. The biological function rarely depends on single interactions but on the complex interplay of many, for example in cellular signalling, metabolism or gene regulation. Techniques from network analysis are widely used to connect interactions between proteins to the functional organization of cells (Barabasi and Oltvai, 2004). A major challenge for building these networks is to gather all relevant information from the literature, since the quality of the model and the model predictions rely on completeness and correctness of the individual proteins and their interactions. It is important to not only have the knowledge that two proteins interact but also to know the exact type of interaction, such as kinase–substrate relation or gene–gene regulation. These functional protein–protein associations (PPAs) (Junge and Jensen, 2019) can be found in manually curated databases such as Reactome (Jassal ) or the Protein Interaction Database (PID) (Schaefer ). However, these databases are notoriously incomplete despite extensive curation efforts (Köksal ). For instance, we found that for a state-of-the-art model of p53 signalling (Hat ) 25% of the contained PPAs cannot be found, neither in Reactome nor in PID (see Supplementary Material S1 for details). Extracting PPAs from the biomedical literature has been a long-standing research goal. Early approaches focused on matching sentences to manually defined templates, usually leading to high-precision but low-recall results (Friedman ). Later methods used supervised machine learning to classify whether a sentence expresses a relation between a given pair of proteins, frequently relying on support-vector-machines (SVMs) with graph kernels (Miwa ; Tikk ). Similar techniques have been applied to biomedical event extraction, which aimed at not only extracting pairwise relations between two proteins but also complex biochemical reactions between proteins (Björne ; Miwa ). More recently, also approaches based on neural networks have been applied to sentence-wise supervised classification of protein–protein interactions (Peng and Lu, 2017) and to biomedical event extraction (Björne and Salakoski, 2018). None of these methods are capable of detecting relations between proteins mentioned in different sentences or make use of pre-trained language models that recently have led to large gains in other Natural Language Processing (NLP) tasks (Devlin ). Additionally, these models rely on manually annotated training data, which for PPA-extraction requires expert knowledge and thus is very costly. Consequently, the available manually labelled PPA datasets are rather small, typically containing at most a few thousand sentences (Pyysalo ). This data sparsity led to the introduction of distantly supervised approaches (Mintz ) for PPA prediction (Junge and Jensen, 2019; Poon ; Thomas ). However, both Thomas and Poon are based on non-neural models with manually defined features and Junge and Jensen (2019) use averaged word embeddings without leveraging multi-instance learning. Distantly supervised relation extraction methods generate noisy training data by aligning a knowledge base to a large collection of texts. To achieve this, a large knowledge base of relations (in our case PPAs) in the form is connected to a text by first linking the entities from the knowledge base e1, e2 to the entities in the text. Initially, the core assumption of distant supervision was that every sentence that contains the entities e1, e2 expresses the relation r. This assumption can be relaxed through the use of multi-instance learning (Hoffmann ; Riedel ; Surdeanu ). Multi-instance learning explicitly models the assumption that at least one sentence expresses the relation between the entity pair in question by selecting only a subset of the sentences to generate the prediction. Originally, probabilistic graphical models were used to achieve this, but recently deep learning-based models in the form of piece-wise convolutional neural networks (Zeng ) with selective attention (Lin ) were successfully applied. An orthogonal line of work also uses auxiliary directly supervised training examples, achieving significant improvements for graphical models (Angeli ; Pershina ) and for neural networks (Beltagy ; Liu ). However, all of these approaches only consider entity pairs that occur together in the same sentence, which severely limits recall (Quirk and Poon, 2017). Accordingly, there is growing interest in using text that spans multiple sentences for distantly supervised biomedical relation extraction. Verga used transformer-based models to predict all relations between chemicals, diseases and genes contained in one abstract but do not consider multiple abstracts simultaneously. Quirk and Poon (2017) used multi-instance learning to predict relations between drugs and genes that can be up to three sentences apart with an SVM-classifier on manually defined dependency graph features. Recently, deep language models have seen widespread success in NLP, including the biomedical domain (Beltagy ). The often used two-step process of training these models can be regarded as a type of transfer learning (Pratt ): The first step is pre-training, in which a large model, typically with hundreds of million parameters, is trained on a huge corpus of texts with a language modelling task. In the second step, the pre-trained model is applied to the target task, either by fine-tuning the model parameters or using the model to generate contextualized embeddings (Peters ). BERT (Devlin ) is a highly successful deep language model based on the transformer architecture (Vaswani ) which allows to train very large models efficiently by leveraging GPUs. Originally, BERT was trained on a large collection of books and English Wikipedia, but recently two BERT models trained on biomedical abstracts and full texts have been released, BioBERT (Lee ) and SciBERT (Beltagy ). As BERT uses WordPiece tokenization (Wu ), it learns a domain-dependent vocabulary that allows it to use sub-word information to relate similar words such as TRAF2 and TRAF3. PPA Extraction with Deep Language (PEDL) uses SciBERT as its pre-trained language model, because unlike BioBERT, its WordPiece vocabulary is optimized for scientific literature. In this work, we propose PEDL models, a model that predicts functional PPAs from biomedical publications. We approach this problem by combining pre-trained language models with distant supervision. Specifically, we source a large number of protein pairs together with their PPAs from the PID database and find texts mentioning these pairs in a collection of roughly 24 million abstracts of biomedical publications and 3 million full texts. The resulting PPA extraction dataset is distantly supervised, i.e. it only contains annotations for relations between the proteins but it is not known whether a text span actually confirms the relation. Given a protein pair, PEDL takes the text spans mentioning both proteins as input and predicts which PPAs hold for this pair, if any. Importantly, in what we call evidence prediction, PEDL predicts not only the PPAs but also which text span expresses it. We augment the training data of PEDL with data which additionally contains annotations for evidence predictions, which we generate from those gold standard datasets (Kim ) that include annotations for all PPA-types considered by us. Following Beltagy ), we call this type of data directly supervised. We compare the performance of PEDL to state-of-the-art approaches on three different datasets and find that, on average, it performs much better for both PPA and evidence prediction. In a manual evaluation of the top 10 predicted PPAs, conducted by three experts in Systems Biology, we find that PEDL can be used to predict PPAs that cannot be found in major pathway databases. Furthermore, the predicted evidence text spans actually express the relation and thus can be used for easy verification of the predicted PPAs, which is important for expert curation.

2 Materials and methods

In this work, we model PPA extraction following a multi-instance learning framework for relation extraction (Hoffmann ; Riedel ; Surdeanu ). Given two proteins p1 and p2, we aim to predict all PPAs relating p1 to p2 by leveraging a corpus of biomedical literature. We focus on a set R of five PPAs which is a subset of the Simple Interaction Format relations available in Pathway Commons: in-complex-with is true for a protein pair (A, B), if A and B occur together in at least one protein complex. controls-state-change-of means that A regulates some change of B. This can be a post-translational modification such as phosphorylation or ubiquitination or a transfer between cellular compartments. controls-phosphorylation-of is a subset of controls-state-change-of and means that A phosphorylates B. controls-transport-of is a subset of controls-state-change-of and denotes that A controls the transfer of B to a cellular compartment. controls-expression-of implies that A modulates the expression of B. Additionally, in what we term evidence prediction, we want the model to find the strongest possible evidence for these PPAs in the form of text expressing the relation between the proteins. This section describes how PEDL combines deep language models, distant supervision and auxiliary directly supervised data to approach these two tasks. A detailed graphical description of PEDL can be found in Figure 1.

Fig. 1.

(a) Overview of PEDL for the two tasks of relation prediction and evidence prediction. In this example, the model predicts relations for the protein pair BTC and ErbB4 given three text spans containing both proteins. First, the BERT component produces a score matrix containing a prediction for each text and relation type. The relation predictions are then generated by applying LSE column-wise to approximate the maximum score for a given PPA type across all spans. The evidence predictions are obtained by taking the row-wise maximum, which is the highest score assigned to this text span regardless of PPA type. (b) The generation of one row of the score matrix s. In each of BERT’s 12 transformer layers, each token receives a 768 dimensional embedding (u for the first and z for the last layer). The embedding of the prepended [CLS] token is used to summarize the text span in the single vector h, which is then transformed to one row of the score matrix by the output layer (W, b)

2.1 PPA prediction as multi-instance learning

To predict relations between proteins p1 and p2, the first step of PEDL is to collect all text spans T, up to a given length, mentioning p1 and p2 together. This requires the use of named entity recognition (NER) (Weber ) and named entity normalization (NEN) (Wei ) as a pre-processing step. For the two sub-tasks of relation prediction and evidence prediction, the model has to produce two vectors and , where R is the set of considered PPAs and T is the set of spans for the pair. The vector r contains scores reflecting the confidence of PEDL in each type of PPA. e contains scores , each modelling PEDL’s confidence that the corresponding text span expresses a relation between p1 and p2. For this, PEDL predicts a score-matrix for each text span, representing the confidence of the model that a text span supports a given PPA. To achieve this, we first mark the entity pair in each text span by surrounding the first entity with the entity markers and and the second entity with , . Then, each text span T is fed through BERT individually, to obtain the [CLS] embedding of the 768-dimensional final layer, which can be regarded as a summary of the whole text span. Finally, we use a single hidden layer to transform h to one row of the score matrix S containing logits reflecting the confidence of PEDL that the text span expresses a given PPA. See Figure 1 for a graphical description of this process. The relation prediction r for each PPA type is generated by aggregating the scores for the PPA over all spans, i.e. column-wise. Correspondingly, the evidence prediction e for an individual sentence is produced by aggregating the scores of all PPA predictions for this sentence, i.e. row-wise. Finally, both vectors are normalized by applying the sigmoid function. In preliminary experiments, we used maximum for both aggregations, but found that the resulting sparse gradient flow hampered optimization. Thus, we use the smooth approximation of maximum LogSumExp as aggregation function for PPA predictions, because it allows for gradient flow through all sentences and empirically works well in end-to-end training of transformer models (Verga ). Putting everything together, the formulae for predicting PPAs and evidence are the following: where log and exp denote element-wise application of logarithm and exponentiation, and are trainable parameters, and σ is the element-wise sigmoid function. Alternatively, S can be directly used as an evidence score per relation. For the training of PEDL, we assume that two types of data are available: Distantly supervised data which only has labels for relation prediction and directly supervised data which has labels for both relation and evidence prediction. Furthermore, we assume that both types of data share the same label space. The directly supervised data is used to give the model additional guidance on how text spans expressing PPAs look like. To achieve this, we combine both types of data using a multi-task learning framework. We introduce one loss term each type of data: for the distantly supervised and for the directly supervised data. The loss for the directly supervised data is composed of a loss term for relation prediction and another term for evidence prediction: . The loss for the distantly supervised data only consists of the loss term for the relation prediction task, because labels for evidence predictions are not available for this type of data: . The total loss for the batch is then a weighted average of the direct and distant losses: where is a hyperparameter controlling the relative importance of the direct loss and will be tuned on the development set of each considered dataset separately. At each optimization step, we sample a batch from the distant and one from the directly supervised data. Since we model PPA prediction as a multi-label task, all losses are computed with binary cross entropy. Note, that the only parameters of PEDL are those of BERT and one output layer (W, b). We optimize these parameters with Adam (Kingma and Ba, 2015). The detailed hyperparameter settings can be found in Supplementary Material S2. One training step on one batch (16 protein pairs with up to 100 text spans each) takes ∼9.5 s on four RTX 2080 Ti GPUs.

2.2 Data

The training of PEDL requires distantly and directly supervised data. To obtain the distantly supervised data, we follow the standard approach for creating a multi-instance learning dataset (Riedel ). First, we collect all protein pairs and the relations between each pair from a large knowledge base, where we opt for the PID data base (Schaefer ), due to its very high curation standards. We gather our data from the Simple Interaction Format version of PID provided by PathwayCommons (https://www.pathwaycommons.org/archives/PC2/v11/PathwayCommons11.pid.hgnc.txt.gz) (Cerami ). Then, for each protein pair p1 and p2, we collect all text spans up to the length of 300 characters that mention p1 and p2 together. To estimate the probability that a protein pair is related by none of the considered PPAs, we also require negative pairs which are not related by any PPA. We generate such negative examples by randomly sampling pairs, where is the number of pairs obtained from PID. As a text corpus, we use all 24 377 760 PubMed abstracts available through PubTator Central (ftp://ftp.ncbi.nlm.nih.gov/pub/lu/PubTatorCentral/bioconcepts2pubtatorcentral.offset.gz, Version of 2019/08/19) (Wei ) and 2 986 273 full texts available in the PubmedCentral BioC text mining collection (ftp://ftp.ncbi.nlm.nih.gov/pub/wilbur/BioC-PMC/, Version of 2019/05/24) (Comeau ). We use the NER and NEN annotations from PubTator Central for both abstracts and full texts. We transform the Entrez ids provided by PubTator Central to Uniprot identifiers with the mapping provided by Uniprot (ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping/by_organism/HUMAN_9606_idmapping.dat.gz) to relate them to the Uniprot identifiers from PID. Additionally, we expand the identified proteins with all homologous proteins obtained from the HomoloGene database (ftp://ftp.ncbi.nih.gov/pub/HomoloGene/build68/homologene.data), to increase the number of text spans per protein pair, considering only the taxa Homo Sapiens, Rattus norvegicus, Mus musculus, Oryctolagus cuniculus and Cricetulus longicaudatus. For protein pairs which occur together in more than 100 texts, we randomly sample 100 texts and discard the rest. Finally, we discard all (positive and negative) protein pairs which did not co-occur at least once. Detailed statistics of the resulting dataset can be found in Table 1.

Table 1.

Statistics of the datasets BioNLP 2011, BioNLP 2013 and PID

	Relations						Pairs		Texts (Avg.)
	expr.	phosph.	State	Transport	Complex	Total	pos.	neg.	pos.	neg.
BioNLP 2011	245	44	136	38	278	741	615	1845	19.69	4.97
BioNLP 2013	179	104	160	43	441	927	730	2190	17.44	4.85
PID	2376	2714	8425	1020	5799	20 622	16 369	54 261	53.60	16.32

Note: Relations gives the total number of protein pairs for the five considered relations controls-expression-of (expr.), controls-phosphorylation-of (phosph.), controls-state-change-of (state), controls-transport-of (transport) and in-complex-with (complex). Pairs denote the total number of protein pairs with at least one relation (pos.) and without any relation (neg.). Texts states the average number of text spans per protein pair for pairs with at least one relation (pos.) and without any relation (neg.).

Statistics of the datasets BioNLP 2011, BioNLP 2013 and PID Note: Relations gives the total number of protein pairs for the five considered relations controls-expression-of (expr.), controls-phosphorylation-of (phosph.), controls-state-change-of (state), controls-transport-of (transport) and in-complex-with (complex). Pairs denote the total number of protein pairs with at least one relation (pos.) and without any relation (neg.). Texts states the average number of text spans per protein pair for pairs with at least one relation (pos.) and without any relation (neg.). Next, we describe the generation of the directly supervised data, which we need for two different purposes. First, we use it as additional training data as explained above and second, it allows us to perform experiments with known relations for text spans, which then lets us evaluate the performance for evidence prediction without manual inspection of the predictions. To perform these experiments, we actually need two distinct directly supervised datasets, one for evaluation and one as additional training data for PEDL. To generate the directly supervised data, we transform sentence-level event extraction data from the BioNLP-shared tasks (Kim ; Nédellec ) into multi-instance learning data. We transform the BioNLP event structures into pairwise relations between proteins with the same five relation types as for the distant data. The details of this transformation can be found in Supplementary Material S3. Then, akin to the generation of the distant data, we normalize all protein mentions, collect all pairs of co-occurring proteins and sample non-interacting proteins as negative examples. We normalize protein mentions by querying MyGeneInfo (Xin ) for the human uniprot id. Tokenization and sentence splitting are performed with the en_core_sci_sm model of SciSpacy (Neumann ). We perform this transformation for the Genia (Kim ) and epigenetics (Ohta ) datasets from BioNLP 2011 as well as the Genia (Kim ) and Pathway Curation (Ohta ) tasks from BioNLP 2013. These BioNLP datasets were specifically selected since they were the only ones containing annotations for all considered PPA types. Finally, we aggregate the protein pairs of both 2011 and 2013 tasks, respectively. This yields two multi-instance learning datasets with the additional information of which text spans express relations between the proteins. Detailed statistics of both datasets can be found in Table 1. In preliminary experiments on the PID dataset, we found that the predictions of PEDL seemed to almost exclusively rely on the protein names appearing in the text span. While this led to good performance for relation prediction, this is most likely an artefact of the PID database, because if two proteins are related by a given PPA, then frequently, all members of the respective protein families are related by the same PPA. Ultimately, we are interested in predicting PPAs that are not contained in PID, and thus, we performed all further experiments on entity blinded data, which prevents PEDL from inferring family membership. To achieve this, we replaced all protein names recognized by the en_ner_jnlpba_md model of SciSpacy with dummy identifiers.

2.3 Baselines

We compare PEDL to the two competitor methods comb-dist (Beltagy ) and EVEX (Van Landeghem ), representing the state-of-the-art for distantly supervised relation extraction (comb-dist) and for sentence-level relation extraction applied on whole PubMed (EVEX). comb-dist is a recently published multi-instance learning method for distantly supervised relation extraction. It set a new state-of-the-art on a standard benchmark for distantly supervised relation extraction (Riedel ) strongly outperforming competitor methods by additionally integrating directly supervised data. As a base model, comb-dist uses a piece-wise convolutional neural network with selective attention and pre-trained word embeddings. comb-dist was not developed for biomedical applications and has never been applied to such data as far as we know. In all experiments with comb-dist, we use the (selective) attention distribution over the text spans as evidence predictions. A detailed discussion of the differences between PEDL and comb-dist is provided in Supplementary Material S6. For word embeddings, we equip comb-dist with wikipedia-pubmed-PMC (http://evexdb.org/pmresources/vec-space-models/wikipedia-pubmed-and-PMC-w2v.bin) embeddings of Pyysalo , because they performed well in our earlier work (Habibi ). The hyperparameter settings of comb-dist for each task are provided in Supplementary Material S2. EVEX is a database of text-mined biological events, accompanied by inferred pairwise PPAs and has annotations for whether an event was deemed speculative or negated. The database was created by applying a state-of-the-art biomedical event extraction tool (Björne, 2014) to a large collection of PubMed abstracts and PMC full texts. Since the EVEX database was last updated in 2013, we compare PEDL with EVEX on a modified test data of PID in which we only use texts published prior to 2013 to ensure a fair comparison. We apply a straight-forward mapping of EVEX’s types of PPAs to the five considered in our work (see Supplementary Material S4) and remove all relations with a detected negation, but retain speculative relations.

2.4 Evaluation details

We use the three datasets PID, BioNLP 2011 and BioNLP 2013 in three different experimental settings E1, E2 and E3. E1: PID is the distantly supervised data and the union of both BioNLP datasets are the directly supervised auxiliary training data. E2: BioNLP 2011 is the distantly supervised data (disregarding evidence annotations during training) and BioNLP 2013 is the directly supervised auxiliary training data. E3: BioNLP 2013 is the distantly supervised data and BioNLP 2011 is the directly supervised auxiliary training data. In both E2 and E3, we report the average of five runs with different seeds to compensate for the small dataset sizes. Note that results from the BioNLP shared tasks are not comparable to E2 and E3 because we do perform multi-instance learning (and not sentential prediction) and the label spaces are different. We use the directly supervised data only during training and remove all protein pairs occurring in the development and test set from the directly supervised data to prevent knowledge leaks. We split each dataset into train, development and test set by randomly dividing protein pairs with their associated text in a 60:10:30 ratio. For relation prediction, we compare models by plotting their precision–recall (PR) curves. These curves are computed by ranking all PPAs by the predicted confidence score of the model and computing the resulting (micro-averaged) precision and recall for all possible threshold values. We also report the average precision (AP) which is an approximation of the area under the PR-curve. We use mean average precision (mAP) and precision at ten (P@10) to evaluate evidence predictions, both for the automated evaluation in E2 and E3, as well as for the manual evaluation by domain experts in E1. mAP averages the individual APs of evidence predictions for each protein pair and P@10 is defined the mean precision of the top ten predictions.

3 Results

We evaluate PEDL, a method for predicting PPA-relations between proteins and the evidence for these relations, on three different datasets. The results are compared to two competitor methods: comb-dist, a recently published state-of-the-art multi-instance relation extraction method, and EVEX, a large data base of PPAs that was generated by applying biomedical event extraction to a large collection of abstracts and full texts.

3.1 Prediction of PPAs

At first, we investigate the results of PEDL for predicting PPAs between pairs of proteins. The results for the BioNLP datasets (E2 and E3) can be found in Table 2 and results for PID (E1) in Table 3. In terms of AP, PEDL performs better than the competitor methods on two of the three considered datasets and comparable on the third. On BioNLP 2013 (E3), PEDL achieves an AP score that is 6.07 pp higher than that of comb-dist, while on PID (E1, mixing predictions for all PPA types) it is 1.24 pp higher. If one considers predictions for each type of PPA on PID individually, the difference between both models is considerably larger. PEDL performs better than comb-dist on all five types with differences ranging from 1.84 pp for in-complex-with to 12.34 for controls-transport-of, with an average of 4.66 pp. On BioNLP 2011 (E2), the difference in AP of both models is marginal.

Table 2.

Results on the two BioNLP datasets (E2 and E3)

	BioNLP ’11		BioNLP ’13
	r-AP	e-mAP	r-AP	e-mAP
comb-dist	65.4(2.6)	75.86(1.6)	70.68(2.6)	79.35(0.9)
− direct	62.33(1.8)	54.38(26.9)	70.06(2.1)	54.64(27.2)
PEDL	65.59(4.9)	82.36(1.2)	76.75(2.0)	84.67(1.6)
− direct	60.65(4.1)	64.64(4.1)	71.03(3.0)	75.14(2.1)

Note: r-AP is the AP for relation prediction and e-mAP the mAP for evidence prediction. All results are averages of five runs with different random seeds, with standard deviations given in brackets. ‘- direct’ shows scores without directly supervised data. The best scores are displayed in bold.

Table 3.

APs for relation prediction on the PID data (E1) for the PPA types controls-expression-of (expr.), controls-phosphorylation-of (phosph.), controls-state-change-of (state), controls-transport-of (transport) and in-complex-with (complex)

	expr.	phosph.	State	Transport	Complex	Total
comb-dist	42.77	38.38	49.14	5.87	47.86	44.78
PEDL	46.45	40.26	52.70	18.21	49.70	46.02
count	694	817	2532	288	1668	5999

Note: Total gives the AP for all PPA types as a micro-average. The best score per relation-type is displayed in bold. Count denotes the number of protein pairs with this type of PPA in the test set. Note that total is computed on a ranking of predictions including all PPA types, which leads to the fact that the difference between both models is smaller than every distance of the individual PPAs. EVEX cannot be compared in this setting, because it does not consider texts published after 2013.

Results on the two BioNLP datasets (E2 and E3) Note: r-AP is the AP for relation prediction and e-mAP the mAP for evidence prediction. All results are averages of five runs with different random seeds, with standard deviations given in brackets. ‘- direct’ shows scores without directly supervised data. The best scores are displayed in bold. APs for relation prediction on the PID data (E1) for the PPA types controls-expression-of (expr.), controls-phosphorylation-of (phosph.), controls-state-change-of (state), controls-transport-of (transport) and in-complex-with (complex) Note: Total gives the AP for all PPA types as a micro-average. The best score per relation-type is displayed in bold. Count denotes the number of protein pairs with this type of PPA in the test set. Note that total is computed on a ranking of predictions including all PPA types, which leads to the fact that the difference between both models is smaller than every distance of the individual PPAs. EVEX cannot be compared in this setting, because it does not consider texts published after 2013. It is instructive to compare the PR-curves of PEDL, comb-dist and EVEX for relation prediction on the PID data (E1, see Fig. 2). We compare with the results of EVEX only on abstracts and full texts published prior to 2013 to account for the fact that EVEX was last updated in 2013. Both models strongly outperform EVEX on the before 2013 data, both in terms of recall and precision. The difference in recall is especially pronounced, because EVEX only generates positive predictions for fewer than 37% of the PPAs in the PID test set. PEDL performs better than comb-dist in the mid-precision regime but a little worse for low precisions when provided all articles and full texts. On the before 2013 subset, PEDL performs equal to comb-dist in the high-precision regime but worse for mid-to-low precision values, leading to 40.54% AP for PEDL and 44.24% AP for comb-dist (see Section 4.2 for a discussion of this).

Fig. 2.

(a) PR curve for the PID data. The left plot shows results for all available abstracts and full texts. The right plot displays the results using only abstracts and full texts published prior to 2013, which allows a fair comparison with EVEX. These results are based on a ranking that includes all types of PPA. The improvement of PEDL over comb-dist is larger for rankings of only one type of PPA (see Table 3 for numbers and explanation). (b) Results from the manual evaluation of evidence prediction on PID

3.2 Evidence prediction

In most biomedical applications, extracted PPAs are not accepted per se, but undergo confirmation through experts. The reason is the far from perfect performance of state-of-the-art approaches, and the fact that even a correctly extracted text needs not express biological truth, for instance due to weak experimental evidence. Therefore, it is important that methods not only predict the correct PPA, but also the text spans on which the model’s PPA prediction is based (which we call evidence prediction). Table 2 gives results for evidence prediction on the BioNLP datasets, where both PEDL and comb-dist achieve high mAP scores for evidence prediction. PEDL outperforms comb-dist on both datasets with 6.38 pp for BioNLP ‘11 and 5.32 pp for BioNLP’13. In contrast, PID is a distantly supervised dataset and does not have annotations to evaluate evidence predictions. For the predictions of comb-dist and PEDL, two domain experts evaluated the top ten evidence predictions for the top 10 predictions of each PPA-type, amounting to 500 evaluated evidence predictions (the annotation guidelines can be found in Supplementary Material S5). Note, that for this evaluation, we directly use the rows of the score matrix as evidence score per relation for PEDL. This refinement is not possible for comb-dist, because the attention distribution is computed independently of the relation type. This allows PEDL to rank the evidence specifically for one PPA type, while comb-dist only predicts whether there is a relation between the proteins at all. The results of this analysis show that PEDL performs better than comb-dist for predicting evidence for the three PPA-types controls-transport-of, in-complex-with and controls-expression-of (see Fig. 2). The results for controls-state-change-of are comparable and worse for controls-phosphorylation-of. The improvement over comb-dist is especially striking in the case of controls-transport-of, for which comb-dist produces almost no correct evidence predictions and PEDL achieves a mAP of 46%. The results in terms of P@10 are similar, with PEDL additionally achieving better results for controls-state-change-of. Moreover, the variability in performance across different PPA types is much larger for comb-dist than for PEDL. On average, PEDL achieves a 7.66 pp higher mAP and a 8.14 pp higher P@10 than comb-dist.

3.3 Analysis of new predictions

We also evaluated PEDL in a realistic application scenario, where three experts in systems biology manually analyzed the top 10 predictions that are not contained in the aforementioned PathwayCommons versions of neither Reactome nor PID. The results are summarized in Table 4, where we provide all predictions considered biologically justified by all experts together with the highest ranking true evidence text span. In the evaluation, 6 out of 10 are predicted correctly, while one prediction is wrong due to errors in the protein normalization pre-processing step, and the other three are errors of PEDL. It can be further observed, that for all correct predictions but one, the highest ranking text span (columns Text span and t) actually expresses the PPA and either states the finding of the PPA or refers to an earlier publication reporting it.

Table 4.

Evaluation results for the top-10 predictions that cannot be found either in Reactome or in PID

k	PPA	Text span (source PMID)	t	Evidence
1	IGF-II in-complex-with VN	‘We have previously reported that IGF-II binds the extracellular matrix protein vitronectin (VN) […] ’ (12746303)	1	Upton et al. (1999)
2	hnRNP-A1 controls-expression-of IL10	‘These results suggest that hnRNP-A1 promotes transcription of human IL10.’ (19349988)	1	Noguchi et al. (2009)
4	NCOR1 controls-expression-of PSA	‘ChIP-reChIP assays revealed that NCOR and […] p300 are present in distinct AR complexes on the promoter of PSA gene […]’ (23518348)	4	Qi et al. (2013)
5	ets-2 controls-expression-of BRCA1	‘Conditional overproduction of ets-2 in MCF-7 cells resulted in repression of endogenous BRCA1 mRNA expression.’ (12637547)	1	Baker et al. (2003)
6	c-Rel controls-expression-of Bcl-X	‘We further demonstrate […] that introduction of two downstream c-Rel target genes, Bcl-X […]’ (15922711)	1	Chen et al. (2000)/ Lee et al. (1999)
8	C/EBP-beta controls-expression-of COX-2	‘C/EBP-beta is a transcription factor […] capable of inducing COX-2 expression […]’ (19124115)	1	Kim and Fischer (1998)/ Zhu et al. (2002)

Note: The rank of the prediction is given by k. We provide the highest ranking evidence text span that actually expresses the relation and its rank in PEDL (t), as well as manually sourced literature evidence that provides strong biological evidence for the existence of the PPA. Note that this evidence need not be identical to the evidence span predicted by the model.

Evaluation results for the top-10 predictions that cannot be found either in Reactome or in PID Note: The rank of the prediction is given by k. We provide the highest ranking evidence text span that actually expresses the relation and its rank in PEDL (t), as well as manually sourced literature evidence that provides strong biological evidence for the existence of the PPA. Note that this evidence need not be identical to the evidence span predicted by the model.

4 Discussion

4.1 Importance of directly supervised data

The results given in Table 2 allow for interesting observations regarding the importance of directly supervised data. On the BioNLP datasets, the incorporation of directly supervised data improves results for both relation and evidence prediction. The improvement is much more pronounced for the evidence prediction task than for relation prediction. This supports our hypothesis, that we can improve evidence prediction specifically by including directly supervised data. Compared to comb-dist, PEDL has a much larger gain from directly supervised data in the relation prediction task (5.33 pp versus 1.85 pp). For BioNLP 2011, comb-dist even outperforms PEDL in relation prediction when no directly supervised data is available. This might partly be because the inclusion of directly supervised data stabilizes PEDL’s training process. In preliminary experiments on the PID dataset, we observed that without access to directly supervised data the model failed to converge, while setting the whole score matrix to zero. We attribute this to the fact that usually only a few of the (max.) 100 text spans actually express the annotated relation and think that the directly supervised data compensates for the resulting label imbalance for evidence prediction. Notably, PEDL achieves strong results for evidence prediction even without access to directly supervised data. This suggests that the constraint of only aggregating (logit-)scores, and not high-dimensional embeddings as in comb-dist’s selective attention, is more appropriate for evidence prediction in absence of directly supervised data. These scores also have a clear interpretation as the confidence of PEDL that the given text span supports a given PPA. The lower (average) performance of comb-dist in this setting can be attributed to strong performance drops for some random seeds (min. 24.03 versus max. 76.61), indicating a notable instability of the model. We furthermore found that running comb-dist with the most recent versions of PyTorch (1.4.0) and AllenNLP (0.9.0) leads to a performance drop of 1 to 5 pp. for both relation prediction and evidence prediction.

4.2 Comparison to EVEX

The comparison of the two distantly supervised methods to EVEX (cf. Fig. 2) is instructive, because it allows to compare methods trained only on directly supervised data to models with access to both types of data. Especially striking is the difference in recall between EVEX and the distantly supervised models, where EVEX only contains predictions for 36.15% of the positive protein pairs, while PEDL and comb-dist produce predictions for 95.1% and 95.33% of the protein pairs. This might be partially attributed to the advancements in NER and Normalization that were achieved since 2013—which we implicitly incorporate by using PubTator Central—but also stresses the importance of predicting relations for proteins that occur in different sentences. Recall that EVEX only considers single sentences. The importance of using multiple sentences will be further discussed in the next section. Notably, the increased recall does not come at the price of reduced precision, as both PEDL and comb-dist strongly outperform EVEX in all precision regimes. Together with the encouraging results of the evidence prediction, this indicates that distant supervision is a promising paradigm to train accurate classifiers for PPA prediction. A related interesting observation is that PEDL performs markedly worse on the before 2013 subset of the data, whereas comb-dist almost retains its performance. We hypothesized that the reason for this behaviour lies in the fact that PEDL does not model the semantic interactions between text spans via attention, making it more susceptible to violations of the at-least-once assumption. To validate this, we inspect the top 10 predictions of PEDL for true PPAs with the largest drop in ranking between the full and the before 2013 data. We found that for nine of the ten PPAs, none of the texts published prior to 2013 contains any mention of the PPA. Additionally, no text published prior to 2013 contained any mention of the associated protein pair for 5% of all true PPAs, which limits PEDL’s maximum recall to 95% for the before 2013 data.

4.3 Importance of using multiple sentences

We investigate the effect of considering protein mentions across sentences by measuring the fraction of protein-pairs in PID that are at most d characters away from each other in at least one text for different values of d. Additionally, we report this quantity considering only single sentences, again using the en_core_sci_sm model of SciSpacy to split the text into sentences. The results are depicted in Figure 3. It can be observed that considering only protein mentions that occur within the same sentence has a strong limiting effect on maximum recall. Using d = 300, PEDL can predict PPAs for 87.9% of the positive pairs in PID, which is a large gain over the 59.24% that would be achievable if we considered only single sentences. This, however, comes at the price of more included negative protein pairs. PEDL predicts PPAs for 50.01% of the considered negative pairs, whereas a sentence-level approach would predict PPAs for only 9.25%. This highlights the importance of using a strong machine learning model to rank the predicted PPAs instead of relying on simple co-occurrence statistics in the high-recall regime.

Fig. 3.

Maximum possible recall for a given maximum character distance between the protein mentions. ‘Positive’ refers to protein pairs with at least one PPA in PID and ‘Negative’ to pairs without any. The dashed lines indicate the maximum recall that is possible for sentence level approaches. The red vertical line indicates our choice for the maximum distance between pairs

5 Conclusion

We propose PEDL, a method for predicting PPAs and their textual evidence by integrating deep language models, distant supervision and auxiliary directly supervised data. We compare PEDL on three different datasets with two state-of-the-art methods and find that, on average, it outperforms them in most cases and performs comparably in the remaining ones. A manual evaluation of the predicted PPAs shows that PEDL can be used to identify PPAs that are missing in major pathway data bases. Furthermore, we demonstrate that the predicted evidence text spans actually express the relation and thus can be used to quickly verify the predicted PPAs. Owing to the incorporation of BERT, the method proposed in this article has very high runtime requirements which make it unsuitable for predicting PPAs between all possible pairs. This problem could be solved using recently published model distillation techniques for BERT (Sanh ). We only address pairwise PPA prediction in which a relation holds between exactly two proteins. Actual biochemical reactions are much more complex than that, as they can have multiple reactants, products and regulators, which can also be protein complexes or completely different molecules (Berg ). It would be worthwhile to study whether biomedical event extraction (Ohta ) can be combined with distant supervision to predict such complex biochemical reactions. Finally, PEDL could also be used to predict evidence for known PPAs, for instance those from the distantly supervised training data, which we did not investigate in this work. Click here for additional data file.

27 in total

1. GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles.

Authors: C Friedman; P Kra; H Yu; M Krauthammer; A Rzhetsky
Journal: Bioinformatics Date: 2001 Impact factor: 6.937

2. Dynamic regulation of cyclooxygenase-2 promoter activity by isoforms of CCAAT/enhancer-binding proteins.

Authors: Ying Zhu; Michael A Saunders; Howard Yeh; Wu-guo Deng; Kenneth K Wu
Journal: J Biol Chem Date: 2001-12-10 Impact factor: 5.157

Review 3. Network biology: understanding the cell's functional organization.

Authors: Albert-László Barabási; Zoltán N Oltvai
Journal: Nat Rev Genet Date: 2004-02 Impact factor: 53.242

4. The Rel/NF-kappaB family directly activates expression of the apoptosis inhibitor Bcl-x(L).

Authors: C Chen; L C Edelstein; C Gélinas
Journal: Mol Cell Biol Date: 2000-04 Impact factor: 4.272

5. Transcriptional regulation of cyclooxygenase-2 in mouse skin carcinoma cells. Regulatory role of CCAAT/enhancer-binding proteins in the differential expression of cyclooxygenase-2 in normal and neoplastic tissues.

Authors: Y Kim; S M Fischer
Journal: J Biol Chem Date: 1998-10-16 Impact factor: 5.157

6. The E3 ubiquitin ligase Siah2 contributes to castration-resistant prostate cancer by regulation of androgen receptor transcriptional activity.

Authors: Jianfei Qi; Manisha Tripathi; Rajeev Mishra; Natasha Sahgal; Ladan Fazli; Ladan Fazil; Susan Ettinger; William J Placzek; Giuseppina Claps; Leland W K Chung; David Bowtell; Martin Gleave; Neil Bhowmick; Ze'ev A Ronai
Journal: Cancer Cell Date: 2013-03-18 Impact factor: 31.743

7. High-performance web services for querying gene and variant annotation.

Authors: Jiwen Xin; Adam Mark; Cyrus Afrasiabi; Ginger Tsueng; Moritz Juchler; Nikhil Gopal; Gregory S Stupp; Timothy E Putman; Benjamin J Ainscough; Obi L Griffith; Ali Torkamani; Patricia L Whetzel; Christopher J Mungall; Sean D Mooney; Andrew I Su; Chunlei Wu
Journal: Genome Biol Date: 2016-05-06 Impact factor: 13.583

8. Feedbacks, Bifurcations, and Cell Fate Decision-Making in the p53 System.

Authors: Beata Hat; Marek Kochańczyk; Marta N Bogdał; Tomasz Lipniacki
Journal: PLoS Comput Biol Date: 2016-02-29 Impact factor: 4.475

9. Synthesizing Signaling Pathways from Temporal Phosphoproteomic Data.

Authors: Ali Sinan Köksal; Kirsten Beck; Dylan R Cronin; Aaron McKenna; Nathan D Camp; Saurabh Srivastava; Matthew E MacGilvray; Rastislav Bodík; Alejandro Wolf-Yadlin; Ernest Fraenkel; Jasmin Fisher; Anthony Gitter
Journal: Cell Rep Date: 2018-09-25 Impact factor: 9.423

10. BioBERT: a pre-trained biomedical language representation model for biomedical text mining.

Authors: Jinhyuk Lee; Wonjin Yoon; Sungdong Kim; Donghyeon Kim; Sunkyu Kim; Chan Ho So; Jaewoo Kang
Journal: Bioinformatics Date: 2020-02-15 Impact factor: 6.937