| Literature DB >> 25062914 |
Ritu Khare1, Chih-Hsuan Wei1, Yuqing Mao1, Robert Leaman1, Zhiyong Lu2.
Abstract
The lack of interoperability among biomedical text-mining tools is a major bottleneck in creating more complex applications. Despite the availability of numerous methods and techniques for various text-mining tasks, combining different tools requires substantial efforts and time owing to heterogeneity and variety in data formats. In response, BioC is a recent proposal that offers a minimalistic approach to tool interoperability by stipulating minimal changes to existing tools and applications. BioC is a family of XML formats that define how to present text documents and annotations, and also provides easy-to-use functions to read/write documents in the BioC format. In this study, we introduce our text-mining toolkit, which is designed to perform several challenging and significant tasks in the biomedical domain, and repackage the toolkit into BioC to enhance its interoperability. Our toolkit consists of six state-of-the-art tools for named-entity recognition, normalization and annotation (PubTator) of genes (GenNorm), diseases (DNorm), mutations (tmVar), species (SR4GN) and chemicals (tmChem). Although developed within the same group, each tool is designed to process input articles and output annotations in a different format. We modify these tools and enable them to read/write data in the proposed BioC format. We find that, using the BioC family of formats and functions, only minimal changes were required to build the newer versions of the tools. The resulting BioC wrapped toolkit, which we have named tmBioC, consists of our tools in BioC, an annotated full-text corpus in BioC, and a format detection and conversion tool. Furthermore, through participation in the 2013 BioCreative IV Interoperability Track, we empirically demonstrate that the tools in tmBioC can be more efficiently integrated with each other as well as with external tools: Our experimental results show that using BioC reduces >60% in lines of code for text-mining tool integration. The tmBioC toolkit is publicly available at http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmTools/. Database URL: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmTools/. Published by Oxford University Press 2014. This work is written by US Government employees and is in the public domain in the US.Entities:
Mesh:
Year: 2014 PMID: 25062914 PMCID: PMC4110697 DOI: 10.1093/database/bau073
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1.Visual summary of our text-mining toolkit.
Input/output formats supported by our text-mining toolkit
| tmTools | Formats supported (I = Input, O = Output) | ||||
|---|---|---|---|---|---|
| PMC XML | Free text | Tool-specific format 1 | Tool-specific format 2 | Tool-nonspecific format (BioC) | |
| I | I | I/O | |||
| I | I | I/O | |||
| I | I | I/O | I/O | ||
| I | I | O | I/O | ||
| I | I | O | I/O | ||
| I | I | I/O | I/O | ||
ahttp://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/PubTator/import.example.html.
bhttp://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmTools/Format.html#GenNorm.
Figure 2.A snippet of the BioC article file for PMID 20085714.
Figure 3.A snippet from the BioC annotation file for PMID 20085714 (integrated result of applying our five concept recognition tools on the abstract).
Figure 4.A Snippet from the file annotation_23840682.xml from the BC4GO corpus
Figure 5.The intra-toolkit interoperability experiment. (a) Integrating tools in their native formats. (b) Integrating BIOC compatible tools.