| Literature DB >> 30576489 |
Jia Ren1, Gang Li2, Karen Ross3, Cecilia Arighi1,2, Peter McGarvey3,4, Shruti Rao4, Julie Cowart1, Subha Madhavan4,5, K Vijay-Shanker2, Cathy H Wu1,2,3.
Abstract
Numerous efforts have been made for developing text-mining tools to extract information from biomedical text automatically. They have assisted in many biological tasks, such as database curation and hypothesis generation. Text-mining tools are usually different from each other in terms of programming language, system dependency and input/output format. There are few previous works that concern the integration of different text-mining tools and their results from large-scale text processing. In this paper, we describe the iTextMine system with an automated workflow to run multiple text-mining tools on large-scale text for knowledge extraction. We employ parallel processing with dockerized text-mining tools with a standardized JSON output format and implement a text alignment algorithm to solve the text discrepancy for result integration. iTextMine presently integrates four relation extraction tools, which have been used to process all the Medline abstracts and PMC open access full-length articles. The website allows users to browse the text evidence and view integrated results for knowledge discovery through a network view. We demonstrate the utilities of iTextMine with two use cases involving the gene PTEN and breast cancer and the gene SATB1.Entities:
Mesh:
Year: 2018 PMID: 30576489 PMCID: PMC6301332 DOI: 10.1093/database/bay128
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1iTextMine system overview.
Figure 2Standardized JSON format.
iTextMine full-scale processing results of Medline abstracts
|
|
|
|
| |
|---|---|---|---|---|
| Entities/triggers | Entities + relations | |||
| RLIMS-P | 322 955 | 202 579 | phosphorylation (kinase substrate site): 383 413 | 10.7 h |
| eFIP | 294 915 | 23 584 | phosphorylation-dependent PPI: 35 368 | 1.9 h |
| miRTex | 158 127 | 27 462 | miRNA target: 33 636 | 8.1 h |
| miRNA–gene regulation: 46 115 | ||||
| gene–miRNA regulation: 8328 | ||||
| eGARD | 629 696 | 40 225 | gene–disease drug response: 82 402 | 47.9 h |
iTextMine full-scale processing results of PMC open access full-length articles
|
|
|
|
| |
|---|---|---|---|---|
| Entities/triggers | Entities + relations | |||
| RLIMS-P | 645 080 | 588 693 (in 112 003 articles) | phosphorylation (kinase substrate site): 510 764 | 22 h |
| eFIP | 588 693 | 70 250 (in 29 236 articles) | phosphorylation-dependent PPI: 70 250 | 3.5 h |
| miRTex | 718 927 | 84 607 (in 20 110 articles) | miRNA target: 96 155 | 37 h |
| miRNA–gene regulation: 129 820 | ||||
| gene–miRNA regulation: 21 427 | ||||
Medline abstracts and PMC articles with extracted relations
|
|
|
| ||
|---|---|---|---|---|
| 1 | 236 683 | 89.29% | 94 076 | 74.08% |
| 2 | 27 998 | 10.56% | 31 499 | 24.80% |
| 3 | 393 | 0.15% | 1425 | 1.12% |
| 4 | 1 | 0.00% | 0a | 0.00% |
| Total | 265 075 | 100.00% | 127 000 | 100.00% |
aeGARD hasn’t been applied to PMC articles.
Figure 3Tabular view of text-mining results. (A) Summarized results for query: SATB1 OR ‘Special AT-rich sequence-binding protein 1’. (B) Search result for RLIMS-P.
Figure 4(A) iTextMine network for human SATB1. (B) Sub-network of the human SATB1 network focusing on therapeutic response. (C) Sub-network highlighting the SATB1 regulation by miRNAs.
Figure 5Example of the integration of multiple text-mining tools (PMID: 22547075).