| Literature DB >> 34983371 |
Aparna Elangovan1, Yuan Li1, Douglas E V Pires1, Melissa J Davis2,3, Karin Verspoor4,5.
Abstract
MOTIVATION: Protein-protein interactions (PPIs) are critical to normal cellular function and are related to many disease pathways. A range of protein functions are mediated and regulated by protein interactions through post-translational modifications (PTM). However, only 4% of PPIs are annotated with PTMs in biological knowledge databases such as IntAct, mainly performed through manual curation, which is neither time- nor cost-effective. Here we aim to facilitate annotation by extracting PPIs along with their pairwise PTM from the literature by using distantly supervised training data using deep learning to aid human curation.Entities:
Keywords: BioBERT; Deep learning; Distant supervision; Natural language processing; Post-translational modifications; Protein-protein interaction
Mesh:
Substances:
Year: 2022 PMID: 34983371 PMCID: PMC8729035 DOI: 10.1186/s12859-021-04504-x
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Illustration of data preparation
| IntAct | |
|---|---|
| Corresponding abstract | |
| Genes and normalised NCBI ids | |
| NCBI to Uniprot map | |
| Normalised abstract for pair | |
| Negative sample pairs |
The bold highlights where the relationship is specified within the abstract
The IntAct database has the PubMedId, the Uniprot identifiers of the participating proteins and the interaction type. More than 1 pair of interacting proteins can be annotated against a given PubMed Id and not all of these interactions are described in the abstract. This forms the noisily labelled training data
Train/test/val positive samples for each interaction type.
| Train | Val | Test | Total | |||||
|---|---|---|---|---|---|---|---|---|
| – | + | – | + | – | + | – | + | |
| Acetylation | 31 | 5 | 2 | 1 | 9 | 1 | 42 | 7 |
| Dephosphorylation | 167 | 28 | 36 | 10 | 33 | 6 | 236 | 44 |
| Deubiquitination | 4 | 2 | 0 | 0 | 0 | 0 | 4 | 2 |
| Methylation | 22 | 10 | 5 | 1 | 22 | 4 | 49 | 15 |
| Phosphorylation | 862 | 139 | 118 | 21 | 227 | 44 | 1207 | 204 |
| Ubiquitination | 30 | 5 | 5 | 1 | 5 | 1 | 40 | 7 |
| Total | 1116 | 189 | 166 | 34 | 296 | 56 | 1578 | 279 |
The negative samples count against a interaction type is solely to indicate how many were derived from abstracts describing the interaction type. All negative samples belong to a single class “other/Negative”
The performance of ensemble PPI-BioBERT-x10 on the test and validation set
| Dataset | Interaction | P | R | F1 | ECE | SD | Support |
|---|---|---|---|---|---|---|---|
| Test | Acetylation | 100.00 | 100.00 | 100.00 | 0.49 | 0.25 | 1 |
| Test | Dephosphorylation | 50.00 | 16.67 | 25.00 | 0.67 | 0.40 | 6 |
| Test | Methylation | 25.00 | 25.00 | 25.00 | 0.60 | 0.28 | 4 |
| Test | Phosphorylation | 62.50 | 34.09 | 44.12 | 0.79 | 0.26 | 44 |
| Test | Ubiquitination | 0.00 | 0.00 | 0.00 | – | – | 1 |
| Test | ECE | – | – | - | 0.75 | – | 31 |
| Test | Average SD | – | – | – | – | 0.28 | 31 |
| Test | Macro avg | 47.50 | 35.15 | 38.82 | – | – | 56 |
| Test | Micro avg | 58.06 | 32.14 | 41.38 | – | - | 56 |
| Val | Acetylation | 100.00 | 100.00 | 100.00 | 0.53 | 0.16 | 1 |
| Val | Dephosphorylation | 66.67 | 20.00 | 30.77 | 0.61 | 0.37 | 10 |
| Val | Methylation | 0.00 | 0.00 | 0.00 | 0.53 | 0.29 | 1 |
| Val | Phosphorylation | 77.78 | 66.67 | 71.79 | 0.78 | 0.26 | 21 |
| Val | Ubiquitination | 0.00 | 0.00 | 0.00 | – | – | 1 |
| Val | ECE | – | – | – | 0.73 | – | 24 |
| Val | Average SD | – | – | – | – | 0.28 | 24 |
| Val | Macro avg | 48.89 | 37.33 | 40.51 | – | – | 34 |
| Val | Micro avg | 70.83 | 50.00 | 58.62 | – | – | 34 |
ECE is the expected calibration error. SD denotes the average standard deviation within the ensemble
Fig. 1Confusion matrix on the test and validation set on PPI-BioBERT-x10
Fig. 2Confidence reliability and confidence histogram on the test and validation set using PPI-BioBERT-x10. The average predicted confidence score across the ensemble, for a given input, falls into any one of the 10 equally spaced confidence bins
Fig. 3Confidence scores predicted by each model within the ensemble for each sample on the test and validation set using PPI-BioBERT-x10 where left) shows the standard deviation of correct predictions and right) standard deviation of incorrect predictions. Incorrect predictions do not have low variation in the confidence score except for the 3 test samples in phosphorylation which on manual verification are in fact correct
Manually verified relationships in the validation and test set with ensemble prediction standard deviation less than interaction-wise threshold and predicted confidence greater than interaction-wise threshold, taking the precision to 100.0
| PubMed | Phrases describing relationships in the abstract | Label | Prediction |
|---|---|---|---|
| Validation | |||
| 12150926 | phosphorylation | phosphorylation | |
| 15733869 | phosphorylation | phosphorylation | |
| 15557335 | phosphorylation | phosphorylation | |
| 15527798 | Phosphorylation of | phosphorylation | phosphorylation |
| 10864201 | Radiation-induced phosphorylation of | Phosphorylation | phosphorylation |
| 24548923 | phosphorylation | phosphorylation | |
| 19407811 | acetylation | acetylation | |
| Test | |||
| 21920476 | phosphorylation | phosphorylation | |
| 19424295 | Previous studies of | phosphorylation | phosphorylation |
| 22726438 | phosphorylation | phosphorylation | |
| 11154276 | phosphorylation | phosphorylation | |
| 25605758 | Phosphorylation of | phosphorylation | phosphorylation |
| 20856200 | Binding of | phosphorylation | phosphorylation |
| 21986944 | Negative | phosphorylation | |
| 15862297 | Negative | phosphorylation | |
| 21887822 | Negative | phosphorylation | |
The bold highlights where the mentions of participating proteins
The results of large scale prediction
| PTM | All | All (U) | HQ | HQ (U) | HQ MA | HQ MA (U) |
|---|---|---|---|---|---|---|
| Acetylation | 7807 | 6113 | 1 | 1 | 0 | 0 |
| Dephosphorylation | 85965 | 50004 | 29 | 29 | 1 | 1 |
| Deubiquitination | 510 | 460 | 0 | 0 | 0 | 0 |
| Methylation | 52612 | 29914 | 20 | 18 | 4 | 2 |
| Phosphorylation | 1300930 | 381157 | 5654 | 4532 | 1659 | 537 |
| Ubiquitination | 152048 | 78859 | 4 | 4 | 0 | 0 |
| Total | 1599872 | 546507 | 5708 | 4584 | 1664 | 540 |
All predictions indicate all the predictions from PPI-BioBERT-x10. Unique represents unique PTM-PPI triplet predictions. HQ represents high quality PTM-PPI after thresholding. HQ MA represents high quality PTM-PPI available in multiple abstracts
Fig. 4Distribution of PPI-BioBERT-x10 model prediction confidence (C) and confidence standard deviation(V) during large scale prediction. T is the total number of predictions and, HQ is the percentage of high quality predictions (the orange region). High quality predictions are selected using a threshold of both high confidence and low standard deviation based on the train set confidence distribution
Human evaluation on randomly sampled subset (30 interactions per PTM, unless there are fewer predictions) selected after thresholding on average confidence and standard deviation
| Acety. | Dephosph. | Methy. | Phosph. | Ubiquit. | Total | |
|---|---|---|---|---|---|---|
| Correct | 0 | 11 | 11 | 6 | 0 | 28 |
| Incorrect—DNA methylation | 0 | 0 | 2 | 0 | 0 | 2 |
| Incorrect—NER | 0 | 2 | 1 | 3 | 0 | 6 |
| Incorrect—no trigger word | 0 | 1 | 0 | 2 | 4 | 7 |
| Incorrect—opposite type | 0 | 1 | 0 | 0 | 0 | 1 |
| Incorrect—relationship not described | 0 | 14 | 4 | 19 | 0 | 37 |
| Not—sure | 1 | 0 | 1 | 0 | 0 | 2 |
| Total | 1 | 29 | 19 | 30 | 4 | 83 |
Include multiple abstracts filter: human evaluation on randomly sampled subset selected after thresholding on average confidence and standard deviation, with the additional condition that these predictions are present in multiple abstracts
| Methyl. | Phosphoryl. | Total | |
|---|---|---|---|
| Correct | 4 | 16 | 20 |
| Incorrect—NER | 0 | 2 | 2 |
| Incorrect—Not related to PPI | 0 | 1 | 1 |
| Incorrect—relationship not described | 0 | 7 | 7 |
| Not—sure | 0 | 4 | 4 |
| Total | 4 | 30 | 34 |
We select 30 interactions per PTM, unless there are fewer predictions
Comparison of PPI-BioBERT-x10 predictions with iPTMnet
| PTM | iP Total | iP Unique | iP Uniprots | Ours | Ours HQ | iP RLIMS |
|---|---|---|---|---|---|---|
| Acetylation | 141 | 73 | 12 | 0 | 0 | 0 |
| Methylation | 7 | 4 | 4 | 0 | 0 | 0 |
| Phosphorylation | 21050 | 8949 | 8805 | 3270 | 815 | 358 |
| Ubiquitination | 2 | 1 | 0 | 0 | 0 | 0 |
Of all PTM-PPI entries in iPTMnet (iP Total), iP Unique represents the subset of unique entries. Of the unique PTM-PPIs the subset that has associated UniProt identfiers is in column iP Uniprots. iP RLIMS is the number of unique PPI-PTM sourced from RLIMS+ . The number of all the PPI-BioBERT-x10 predictions that can be recalled in iPTMnet is in Ours. Ours HQ represents the High Quality PPI-BioBERT-x10 predictions after confidence thresholding
Results of noise levels, after noise reduction, in training data to verify if the PPI relationship is described in the abstract
| Interaction type | Correct | Not—sure | Total |
|---|---|---|---|
| Acetylation | 4 | 1 | 5 |
| Dephosphorylation | 6 | 4 | 10 |
| Deubiquitination | 1 | 1 | 2 |
| Methylation | 4 | 6 | 10 |
| Phosphorylation | 6 | 4 | 10 |
| Ubiquitination | 2 | 3 | 5 |
| Total | 23 | 19 | 42 |
The training data is randomly sampled (10 samples per interaction type unless the number of available training samples is lower) and verified by a human annotator
Fig. 5UI built using Amazon SageMaker Ground Truth for verifying PTM-PPI
Fig. 6PTM-wise cosine similarity of test and large-scale abstracts with the train set. The blue region is the similarity of the test set with train set. The orange region is the similarity of the abstracts from large scale predictions with the train set. The count vector representation of the abstracts from the large scale predictions with high quality (top) are more similar to test and train abstracts compared to the low quality predictions (bottom). Note that acetylation and ubiquitination have very low test samples
Fig. 7PTM-wise common words in train, test, large scale predictions. The high quality predictions from the large scale extraction have picked up “key terms” associated with the interaction type, e.g. phosphorylation predictions have commons words such as phosphorylation and kinase. The low quality predictions from the large scale extraction on the other hand have generic words such as cell, activity and expression. For ubiquitination, the model seems to not have picked up the “key terms” across test and large scale predictions, and even in the training set, the term “ubiquitination” has relatively low representation compared to terms such as “protein(s)” and “cell(s)”