| Literature DB >> 27173525 |
Majid Rastegar-Mojarad1, Ravikumar Komandur Elayavilli2, Hongfang Liu2.
Abstract
Biological expression language (BEL) is one of the main formal representation models of biological networks. The primary source of information for curating biological networks in BEL representation has been literature. It remains a challenge to identify relevant articles and the corresponding evidence statements for curating and validating BEL statements. In this paper, we describe BELTracker, a tool used to retrieve and rank evidence sentences from PubMed abstracts and full-text articles for a given BEL statement (per the 2015 task requirements of BioCreative V BEL Task). The system is comprised of three main components, (i) translation of a given BEL statement to an information retrieval (IR) query, (ii) retrieval of relevant PubMed citations and (iii) finding and ranking the evidence sentences in those citations. BELTracker uses a combination of multiple approaches based on traditional IR, machine learning, and heuristics to accomplish the task. The system identified and ranked at least one fully relevant evidence sentence in the top 10 retrieved sentences for 72 out of 97 BEL statements in the test set. BELTracker achieved a precision of 0.392, 0.532 and 0.615 when evaluated with three criteria, namely full, relaxed and context criteria, respectively, by the task organizers. Our team at Mayo Clinic was the only participant in this task. BELTracker is available as a RESTful API and is available for public use.Database URL: http://www.openbionlp.org:8080/BelTracker/finder/Given_BEL_Statement.Entities:
Mesh:
Year: 2016 PMID: 27173525 PMCID: PMC4865361 DOI: 10.1093/database/baw079
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Sample BEL curation from evidence sentences
| Sentences | BEL statement |
|---|---|
| We showed that HSF 1 is phosphorylated by the protein kinase RSK2 in vitro we demonstrate that RSK2 slightly represses activation of HSF1 in vivo | 1: kin (p (HGNC: RPS6KA3)) increases p (HGNC: HSF1, pmod (P))2: kin (p (HGNC: RPS6KA3)) decreases tscript (p (HGNC: HSF1)) |
| Whereas exposure of neutrophils to LPS or TNF-a resulted in increased levels of the transcriptionally active serine 133-phosphorylated form of CREB | p(MGI: TNF) increases p (MGI: CREB1, pmod (P, S, 133)) |
| BEL Elements: Relationship, Function, Entity, Namespace, Sequence position | |
Table 1 shows example BEL statements curated from evidence sentences. Components of a BEL statement are highlighted using different colors.
Figure 1.Overall workflow of BELTracker.
Figure 2.The query translation component.
Figure 3.The ranking component.
Performance of the binary relation classifier against training data set
| Model | Features | F-Measure |
|---|---|---|
| Naïve Bayes | Unigram | 0.682 |
| Naïve Bayes | Unigram + POS | 0.711 |
| Naïve Bayes | Unigram + POS + Bi-gram | 0.714 |
| Random Forest | Unigram | 0.810 |
| Random Forest | Unigram + POS | 0.813 |
| Random Forest | Unigram + POS + Bi-gram | 0.822 |
| SVM | Unigram | 0.623 |
| SVM | Unigram + POS | 0.646 |
| SVM | Unigram + POS + Bi-gram | 0.651 |
This table shows the performance of our relation classifier using different feature sets and learning models. The relation classifer is used in the ranking component to classify the evidence sentences into two main BEL relations, ‘increase’ and ‘decrease’. The results show that using the combination of unigrams, part-of-speech tags, and bi-grams obtained the highest F-measure for all three learning models. Among the three models, Random Forest achieved better F-measure using different feature sets. We have highlighted the classifier with the best performance in the above table.
BELTracker’s performance
| Criteria | True positive | False positive | Precision |
|---|---|---|---|
| Full | 316 | 490 | 39.20 |
| Relaxed | 429 | 377 | 53.22 |
| Context | 496 | 310 | 61.53 |
BELTracker performance evaluation under three criteria, full, relaxed, and context.
Full: if the identified sentence contains the complete BEL statement.
Relaxed: The retrieved sentence may have necessary context and/or biological background to enable extraction of full BEL statement.
Context: Even though the complete or partial BEL statement cannot be extracted from the sentence, it provides the necessary context for the BEL statement.
Mean average precision comparison of BELTracker against baseline [1]
| Criteria | BELTracker (%) | Worst (%) | Random (%) | Best (%) |
|---|---|---|---|---|
| Full | 49.0 | 31.7 | 46.5 | 74.2 |
| Relaxed | 62.1 | 45.9 | 58.4 | 80.4 |
| Context | 68.9 | 55.2 | 65.7 | 83.5 |
Comparison of BELTracker’s ranking against three alternative ranking baseline scenarios: Worst, Random and Best, and compared MAP of our system with these scenarios.
Worst: All TP are ranked after all false positives.
Random: Randomly reordered the results 2000 times and computed the average MAP for all these variants.
Best: All TP are ranked before all FP.
TP, true positives, FP, false positives; MAP, mean average precision.
BELTracker’s performance for finding at most K evidence sentence
| K | Full | Relaxed | Context | |||
|---|---|---|---|---|---|---|
| TP | FP | TP | FP | TP | FP | |
| 1 | 38 | 59 | 53 | 44 | 63 | 34 |
| 2 | 72 | 120 | 97 | 95 | 118 | 74 |
| 3 | 111 | 170 | 149 | 132 | 176 | 105 |
| 4 | 145 | 218 | 197 | 166 | 230 | 133 |
| 5 | 179 | 262 | 244 | 197 | 279 | 162 |
| 6 | 212 | 306 | 290 | 228 | 330 | 188 |
| 7 | 244 | 348 | 328 | 264 | 374 | 218 |
| 8 | 267 | 397 | 363 | 301 | 414 | 250 |
| 9 | 298 | 437 | 401 | 334 | 456 | 279 |
| 10 | 316 | 490 | 429 | 377 | 496 | 310 |
K, Maximum number of returned sentences for each query; TP, true Positives; FP, false positives.
Figure 4.The system precision for at most K evidence sentence (K = 1:10).
Percentage frequency of entities from different namespaces in the statements with and without retrieved evidence sentences
| Namespace | Percentage frequency of entities | |
|---|---|---|
| Statements with evidence sentence (Group A) | Statements without evidence sentence (Group B) | |
| HGNC (Gene) | 47 (58%) | 34 (42%) |
| MGI (Gene) | 71 (81%) | 16 (19%) |
| Gene Ontology (biological processes) | 13 (81%) | 3 (19%) |
| CHEBI | 9 (81%) | 2 (19%) |
| MESHD | 8 (100%) | 0 |