| Literature DB >> 24935050 |
Donald C Comeau1, Haibin Liu2, Rezarta Islamaj Doğan2, W John Wilbur2.
Abstract
BioC is a new format and associated code libraries for sharing text and annotations. We have implemented BioC natural language preprocessing pipelines in two popular programming languages: C++ and Java. The current implementations interface with the well-known MedPost and Stanford natural language processing tool sets. The pipeline functionality includes sentence segmentation, tokenization, part-of-speech tagging, lemmatization and sentence parsing. These pipelines can be easily integrated along with other BioC programs into any BioC compliant text mining systems. As an application, we converted the NCBI disease corpus to BioC format, and the pipelines have successfully run on this corpus to demonstrate their functionality. Code and data can be downloaded from http://bioc.sourceforge.net. Database URL: http://bioc.sourceforge.net.Entities:
Mesh:
Year: 2014 PMID: 24935050 PMCID: PMC4058794 DOI: 10.1093/database/bau056
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1.BioC Text-preprocessing Pipeline. All components are available in both C++ and Java, except for Lemmatization, which is only available in Java.
Figure 2.Illustration of annotations in the enriched NCBI disease corpus, manual annotations of disease mentions and concepts, and BioC-tool-produced annotations from text preprocessing.
Figure 3.Dependency graph of ‘Familial deficiency of the seventh component of complement associated with recurrent bacteremic infections due to Neisseria’ using C&C parser.
Figure 4.Dependency graph of ‘Familial deficiency of the seventh component of complement associated with recurrent bacteremic infections due to Neisseria’ using Stanford parser.