| Literature DB >> 32126064 |
Kersten Döring1, Ammar Qaseem1, Michael Becer1, Jianyu Li1, Pankaj Mishra1, Mingjie Gao1, Pascal Kirchner1, Florian Sauter1, Kiran K Telukunta1, Aurélien F A Moumbock1, Philippe Thomas2, Stefan Günther1.
Abstract
MOTIVATION: Much effort has been invested in the identification of protein-protein interactions using text mining and machine learning methods. The extraction of functional relationships between chemical compounds and proteins from literature has received much less attention, and no ready-to-use open-source software is so far available for this task.Entities:
Year: 2020 PMID: 32126064 PMCID: PMC7053725 DOI: 10.1371/journal.pone.0220925
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1a) Direct functional relation with interaction verb. The orange-coloured verb is enclosed by the compound 7-ketocholesterol, shown in blue, and the protein interleukin-6, shown in green. The pair was annotated as functional. b) Indirect functional relation with interaction verb. Diallyl sulfide is influencing cyclooxygenase 2 indirectly by inhibiting its expression. The pair was annotated as functional. The compound diallyl sulfide and the protein IL-1beta enclose an interaction verb, but do not describe a functional relation. c) Direct functional relation without interaction verb. The molecule cholesterol is metabolised to pregnenolone by CYP11A. This is indicated by the word conversion. The pair was annotated as functional.
Analysis of the CPI benchmark dataset.
| DS | #Sent. | #CPIs | #No-CPIs | Total | Rec. | Spec. | Prec. | F1 |
|---|---|---|---|---|---|---|---|---|
| CPI-DS | 2613 | 2931 | 2631 | 5562 | 100.0 | 0.0 | 52.7 | 69.0 |
Baseline results for precision, recall, and F1 score based on simple co-occurrences. Results are shown in percent (DS—dataset, Sent.—sentences, Rec.—recall, Spec.—specificity, Prec.—precision, F1—F1 score).
Shallow linguistic kernel results on the datasets CPI-DS.
| n | w | Rec. | Spec. | Prec. | F1 | AUC |
|---|---|---|---|---|---|---|
| 1 | 1 | 76.6 | 69.8 | 74.0 | 75.2 | 80.7 |
| 1 | 2 | 85.1 | 61.0 | 71.1 | 77.4 | 80.5 |
| 1 | 3 | 87.2 | 56.3 | 69.1 | 77.0 | 80.3 |
| 2 | 1 | 78.5 | 70.8 | 75.1 | 76.7 | 81.8 |
| 2 | 2 | 85.6 | 62.7 | 72.1 | 78.2 | 81.5 |
| 2 | 3 | 87.0 | 59.8 | 70.9 | 78.1 | 81.3 |
| 3 | 1 | 79.5 | 70.2 | 75.0 | 77.2 | 82.5 |
| 3 | 2 | 86.6 | 62.8 | 72.4 | 78.8 | 82.2 |
| 3 | 3 | 87.3 | 60.0 | 71.1 | 78.3 | 82.1 |
| nes. | cr. val. | 82.3 | 66.2 | 73.4 | 77.5 | 82.2 |
The first parameter shows the n-gram value, the second represents the window size. Values in percent: Rec.—recall, Spec.—specificity, Prec.—precision, F1—F1 score, AUC—area under the curve, nes. cr. val.—nested cross validation.
APG kernel results on the datasets CPI-DS.
| c | Rec. | Spec. | Prec. | F1 | AUC |
|---|---|---|---|---|---|
| 0.25 | 81.7 | 71.8 | 76.6 | 79.0 | 84.6 |
| 0.50 | 82.7 | 70.2 | 75.8 | 79.0 | 84.6 |
| 1.00 | 81.4 | 72.0 | 76.6 | 78.8 | 84.4 |
| 2.00 | 79.7 | 73.2 | 77.0 | 78.2 | 84.1 |
| nes. cr. val. | 81.7 | 71.7 | 76.4 | 78.9 | 84.5 |
c is hyperplane optimization parameter. Values in percent: Rec.—recall, Spec.—specificity, Prec.—precision, F1—F1 score, AUC—area under the curve, nes. cr. val.—nested cross validation.
Fig 2SL and APG kernel comparison.
Area under the curve (AUC) of SL kernel (n = 3, w = 1) and APG kernel (c = 0.25).
Comparison of the combined kernels to each individual kernel.
The precision of each kernel was set to the same level as in the combination by majority voting.
| Kernel | Rec. | Spec. | Prec. | Acc. | F1 |
|---|---|---|---|---|---|
| 61.0 | 83.6 | 71.7 | 69.4 | ||
| 68.4 | 81.6 | 74.6 | 73.9 | ||
| 68.5 | 81.6 | 74.7 | 74.0 |
Values in percent: Rec.—recall, Spec.—specificity, Prec.—precision, F1—F1 score.
Fig 3Ratios of CPI-pairs in sentences with and without interaction verbs.
Basic statistics of the two compound-protein interaction corpora.
| DS | #Sent. | #CPIs | #No-CPIs | Total | Rec. | Spec. | Prec. | F1 |
|---|---|---|---|---|---|---|---|---|
| CPI-DS_IV | 1209 | 1598 | 1269 | 2867 | 100.0 | 0.0 | 55.7 | 71.6 |
| CPI-DS_NIV | 1404 | 1333 | 1362 | 2695 | 100.0 | 0.0 | 49.5 | 66.2 |
Baseline results for precision, recall, and F1 score are derived by using co-occurrences. Values in percent (DS—dataset, Sent.—sentences, Rec.—recall, Spec.—specificity, Prec.—precision, F1—F1 score).
SL kernel results on the datasets CPI-DS_IV and CPI-DS_NIV.
| Param. | CPI-DS_IV | CPI-DS_NIV | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| n | w | Rec. | Spec. | Prec. | F1 | AUC | Rec. | Spec. | Prec. | F1 | AUC |
| 1 | 1 | 77.5 | 67.4 | 75.7 | 76.5 | 79.4 | 78.7 | 70.7 | 73.1 | 75.6 | 80.6 |
| 1 | 2 | 81.3 | 65.0 | 75.3 | 78.1 | 79.9 | 82.2 | 64.7 | 70.2 | 75.6 | 79.7 |
| 1 | 3 | 80.6 | 64.3 | 74.6 | 77.4 | 79.6 | 84.0 | 63.0 | 69.5 | 75.9 | 79.2 |
| 2 | 1 | 78.1 | 70.0 | 77.3 | 77.6 | 80.8 | 78.4 | 71.7 | 73.9 | 75.9 | 81.7 |
| 2 | 2 | 80.5 | 66.3 | 75.6 | 77.9 | 80.8 | 84.0 | 63.9 | 70.3 | 76.4 | 80.9 |
| 2 | 3 | 80.2 | 65.8 | 75.1 | 77.5 | 80.2 | 85.0 | 63.7 | 70.3 | 76.8 | 80.4 |
| 3 | 1 | 77.9 | 71.1 | 78.0 | 77.8 | 81.3 | 80.3 | 70.5 | 73.4 | 76.6 | 82.5 |
| 3 | 2 | 81.2 | 66.5 | 76.0 | 78.4 | 81.1 | 85.7 | 63.4 | 70.4 | 77.2 | 81.8 |
| 3 | 3 | 80.1 | 66.9 | 75.9 | 77.8 | 80.8 | 86.7 | 62.0 | 69.7 | 77.2 | 81.4 |
The first parameter shows the n-gram value, and the second number represents the window size. Values in percent: Rec.—recall, Spec.—specificity, Prec.—precision, F1—F1 score, AUC—area under the curve.
CPI-DS_IV and CPI-DS_NIV results for the APG kernel pipeline.
| Param. | CPI-DS_IV | CPI-DS_NIV | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| c | Rec. | Spec. | Prec. | F1 | AUC | Rec. | Spec. | Prec. | F1 | AUC |
| 0.25 | 82.8 | 70.0 | 77.9 | 80.1 | 84.4 | 80.4 | 69.4 | 73.4 | 76.4 | 82.2 |
| 0.50 | 82.0 | 70.0 | 78.0 | 79.8 | 84.5 | 77.9 | 70.6 | 73.4 | 75.1 | 82.4 |
| 1.00 | 81.2 | 72.0 | 78.9 | 79.8 | 84.3 | 75.1 | 74.3 | 75.1 | 74.9 | 82.6 |
| 2.00 | 80.5 | 71.3 | 78.7 | 79.4 | 83.8 | 74.9 | 74.2 | 74.7 | 74.6 | 82.8 |
Values in percent: Rec.—recall, Prec.—precision, F1—F1 score, AUC—area under the curve.
Application of CPI-Pipeline on PubMed dataset.
| Kernel | APG | SL |
|---|---|---|
| PubMed articles | 29M | |
| Number of sentences | 125M | |
| Number of sentences with candidate pairs | 6.1M | |
| Number of candidate pairs | 14.5M | |
| Functional relations | 7.9M = 54.9% | 8.1M = 56.2% |
| Non-functional relations | 6.5M = 45.1% | 6.3M = 43.8% |
| Number of identical predictions | 11M = 76.6% | |
| Number of predicted distinct functional relations | 2.1M | |
| Pre-processing elapsed time | 21.0 days | 14.4 days |
| Kernel elapsed time | 4.3 days | 1.6 days |
| Total elapsed time | 25.3 days | 16.0 days |
PubMed dataset statistics for the selected APG (c = 0.25) and SL (n = 3, w = 1) models.