| Literature DB >> 31603193 |
Sumit Madan1, Justyna Szostak2, Ravikumar Komandur Elayavilli3, Richard Tzong-Han Tsai4, Mehdi Ali5, Longhua Qian6, Majid Rastegar-Mojarad3, Julia Hoeng2, Juliane Fluck1.
Abstract
Knowledge of the molecular interactions of biological and chemical entities and their involvement in biological processes or clinical phenotypes is important for data interpretation. Unfortunately, this knowledge is mostly embedded in the literature in such a way that it is unavailable for automated data analysis procedures. Biological expression language (BEL) is a syntax representation allowing for the structured representation of a broad range of biological relationships. It is used in various situations to extract such knowledge and transform it into BEL networks. To support the tedious and time-intensive extraction work of curators with automated methods, we developed the BEL track within the framework of BioCreative Challenges. Within the BEL track, we provide training data and an evaluation environment to encourage the text mining community to tackle the automatic extraction of complex BEL relationships. In 2017 BioCreative VI, the 2015 BEL track was repeated with new test data. Although only minor improvements in text snippet retrieval for given statements were achieved during this second BEL task iteration, a significant increase of BEL statement extraction performance from provided sentences could be seen. The best performing system reached a 32% F-score for the extraction of complete BEL statements and with the given named entities this increased to 49%. This time, besides rule-based systems, new methods involving hierarchical sequence labeling and neural networks were applied for BEL statement extraction.Entities:
Mesh:
Year: 2019 PMID: 31603193 PMCID: PMC6787548 DOI: 10.1093/database/baz084
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1Example of a BEL statement extracted from the sentence ‘IL-1β caused a time-dependent increase in Caco-2 ATF-2 phosphorylation, starting at 10 min of treatment (Fig. 3A).’ (46) (PMID: 23656735) (BEL:201720027; identifier in BEL corpus). The BEL statement consists of two protein abundances [p (HGNC:IL1B) and p (HGNC:ATF2)], one protein modification function representing a phosphorylation event [pmod(P)] and a relationship type (increases).
Figure 3BELMiner 2.0 architecture.
Figure 2An example of a candidate evaluation. The example shows the candidate sentence, the gold standard and predicted statements. The scores are provided for all primary and secondary levels (8). Abbreviations: PMID (PubMed identifier), true positive (TP), false positive (FP), false negative (FN), recall (R), precision (P). Adapted and reprinted with permission from Fluck et al. (7).
Distribution of term, function and relationship types in the training and test corpora
|
|
|
|
|
|---|---|---|---|
|
| |||
| p() | 19.918 | 346 | 328 |
| a() | 1.927 | 37 | 52 |
| bp() | 877 | 31 | 23 |
| path() | 244 | 15 | 2 |
|
| |||
| act() | 6.332 | 36 | 79 |
| pmod() | 1.411 | 9 | 36 |
| complex() | 750 | 15 | 5 |
| tloc() | 406 | 13 | 10 |
| deg() | 205 | 6 | 4 |
| sub() | 23 | 0 | 0 |
| trunc() | 6 | 0 | 0 |
|
| |||
| increases | 8.112 | 155 | 130 |
| decreases | 2.956 | 53 | 68 |
Figure 4BelSmile workflow.
Figure 5Hierarchical sequence labeling system pipeline.
Figure 6Architecture of the neural network-based system.
Figure 7Best results of each system of BioCreative VI (2017) and BioCreative V (2015) in each structured level of task 1.
Evaluation of stage 1 of task 1 (prediction of BEL statements without gold standard entities). F, P and R stand for F-score, precision and recall, respectively
|
|
|
|
|
|
|
| |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
|
| r1 | 63.24 | 84.62 | 50.49 | 33.99 | 44.83 | 27.37 | 51.24 | 67.39 | 41.33 | 40.22 | 55.38 | 31.58 | 62.92 | 88.19 | 48.91 | 22.99 | 33.33 | 17.54 |
|
| r2 | 57.75 | 81.93 | 44.59 | 31.08 | 43.4 | 24.21 | 38.67 | 69.05 | 38.67 | 36.78 | 53.33 | 28.07 | 57.73 | 86.84 | 43.23 | 20.71 | 31.82 | 15.35 |
| r3 | 61.24 | 88.27 | 46.89 | 32.88 | 47.06 | 25.26 | 46.15 | 64.29 | 36 | 37.43 | 56.14 | 28.07 | 62.03 | 92.24 | 46.72 | 21.15 | 33.98 | 15.35 | |
|
| r1 | 50.88 | 76.82 | 38.03 | 6 | 60 | 3.16 | 7.5 | 60 | 4 | 16.77 | 31.71 | 11.4 | 45.14 | 80 | 31.44 | 7.38 | 15.71 | 4.82 |
|
| r2 | 55.29 | 81.01 | 41.97 | 6.06 | 75 | 3.16 | 7.59 | 75 | 4 | 21.52 | 38.64 | 14.91 | 51.06 | 84 | 36.68 | 10.67 | 22.22 | 7.02 |
| r3 | 67.83 | 72.22 | 63.93 | 20.17 | 50 | 12.63 | 31.25 | 71.43 | 20 | 24.69 | 28.25 | 21.93 | 62.25 | 70.95 | 55.46 | 10.44 | 12.9 | 8.77 | |
|
| r1 | 74.14 | 78.18 | 70.49 |
| 56.6 | 31.58 |
| 70.83 | 45.33 | 43.65 | 51.81 | 37.72 | 86.17 | 89.62 | 82.97 | 32.28 | 40.67 | 26.75 |
|
|
| 72.89 | 78.71 | 67.87 | 40.29 | 63.64 | 29.47 | 54.39 | 79.49 | 41.33 |
| 52.12 | 37.72 |
| 93 | 81.22 |
| 41.22 | 26.75 |
|
| r1 |
| 81.18 | 72.13 | 0 | 0 | 0 | 0 | 0 | 0 | 29.87 | 25.55 | 35.96 | 65.19 | 60.45 | 70.74 | 18.08 | 16.1 | 20.61 |
|
| r2 |
| 81.18 | 72.13 | 0 | 0 | 0 | 0 | 0 | 0 | 28.92 | 24.19 | 35.96 | 65.23 | 59.29 | 72.49 | 17.88 | 15.53 | 21.05 |
Evaluation of stage 2 of task 1 (prediction of BEL statements with gold standard entities)
|
|
|
|
|
|
|
| |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
|
| r1 | 83.93 | 99.11 | 72.79 | 36.36 | 47.46 | 29.47 | 46.77 | 59.18 | 38.67 | 57.22 | 73.29 | 46.93 | 83.33 | 98.8 | 72.05 | 31.30 | 46.15 | 23.68 |
|
| r2 | 86.09 | 99.15 | 76.07 | 40.51 | 50.79 | 33.68 | 51.16 | 61.11 | 44 | 56.08 | 70.67 | 46.49 | 83.92 | 98.82 | 72.93 | 30.95 | 44.63 | 23.68 |
| r3 | 85.45 | 99.13 | 75.08 | 39.24 | 49.21 | 32.63 | 50 | 60.38 | 42.67 | 57.6 | 73.47 | 47.37 | 83.63 | 98.81 | 72.49 | 31.79 | 46.61 | 24.12 | |
|
| r1 | 90.20 | 98.83 | 82.95 | 12.39 | 38.89 | 7.37 | 21.74 | 58.82 | 13.33 | 42.52 | 52.94 | 35.53 | 84.24 | 96.61 | 74.67 | 22.66 | 32 | 17.54 |
|
| r1 | 87.65 | 90.56 | 84.92 | 51.75 | 77.08 | 38.95 | 57.63 | 79.07 | 45.33 |
| 76.7 | 59.21 |
| 95.75 | 88.65 | 49.2 | 63.01 | 40.35 |
|
|
| 86.4 | 90.94 | 82.3 |
| 85.71 | 37.89 |
| 89.19 | 44 |
| 76.7 | 59.21 | 91.92 | 97.55 | 86.9 |
| 64.34 | 40.35 |
| r3 | 87.41 | 90.81 | 84.26 | 51.75 | 77.08 | 38.95 | 57.63 | 79.07 | 45.33 |
| 76.7 | 59.21 | 91.99 | 96.63 | 87.77 | 49.2 | 63.01 | 40.35 | |
|
| r1 |
| 99.23 | 84.59 | 0 | 0 | 0 | 0 | 0 | 0 | 43.51 | 41.6 | 45.61 | 86.36 | 90.05 | 82.97 | 23.61 | 25 | 22.37 |
|
| r2 | 88.36 | 99.18 | 79.67 | 0 | 0 | 0 | 0 | 0 | 0 | 42.47 | 44.29 | 40.79 | 83.41 | 91.19 | 76.86 | 24.06 | 28.07 | 21.05 |
| r3 | 76.71 | 98.96 | 62.62 | 0 | 0 | 0 | 0 | 0 | 0 | 25.68 | 29.38 | 22.81 | 71.76 | 85.98 | 61.57 | 15.06 | 18.47 | 12.72 | |
Figure 8Number of sentences for each structured level on each stage for which no correct prediction was produced by any run of any participant system.
Evaluation results of task 2 including MAP
|
|
|
|
|
|
|
|---|---|---|---|---|---|
| Run 1 | Full | 117 | 265 | 30.6% | 59.6% |
| Partial | 175 | 207 | 45.8% | 77.5% | |
| Run 2 | Full | 121 | 261 | 31.7% | 50.2% |
| Partial | 192 | 190 | 50.3% | 76.7% |