Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 A reappraisal of sentence and token splitting for life sciences documents.

Literature DB >> 17911772

A reappraisal of sentence and token splitting for life sciences documents.

Katrin Tomanek¹, Joachim Wermter, Udo Hahn.

Abstract

Natural language processing of real-world documents requires several low-level tasks such as splitting a piece of text into its constituent sentences, and splitting each sentence into its constituent tokens to be performed by some preprocessor (prior to linguistic analysis). While this task is often considered as unsophisticated clerical work, in the life sciences domain it poses enormous problems due to complex naming conventions. In this paper, we first introduce an annotation framework for sentence and token splitting underlying a newly constructed sentence- and token-tagged biomedical text corpus. This corpus serves as a training environment and test bed for machine-learning based sentence and token splitters using Conditional Random Fields (CRFs). Our evaluation experiments reveal that CRFs with a rich feature set substantially increase sentence and token detection performance.

Mesh：

Year: 2007 PMID： 17911772

Source DB: PubMed Journal: Stud Health Technol Inform ISSN： 0926-9630

Keyword Cloud
Cited

7 in total

1. Mining the pharmacogenomics literature--a survey of the state of the art.

Authors: Udo Hahn; K Bretonnel Cohen; Yael Garten; Nigam H Shah
Journal: Brief Bioinform Date: 2012-07 Impact factor: 11.622

2. NCBI disease corpus: a resource for disease name recognition and concept normalization.

Authors: Rezarta Islamaj Doğan; Robert Leaman; Zhiyong Lu
Journal: J Biomed Inform Date: 2014-01-03 Impact factor: 6.317

3. Detection of protein catalytic sites in the biomedical literature.

Authors: Karin Verspoor; Andrew Mackinlay; Judith D Cohn; Michael E Wall
Journal: Pac Symp Biocomput Date: 2013

4. Detection of IUPAC and IUPAC-like chemical names.

Authors: Roman Klinger; Corinna Kolárik; Juliane Fluck; Martin Hofmann-Apitius; Christoph M Friedrich
Journal: Bioinformatics Date: 2008-07-01 Impact factor: 6.937

5. Concept annotation in the CRAFT corpus.

Authors: Michael Bada; Miriam Eckert; Donald Evans; Kristin Garcia; Krista Shipley; Dmitry Sitnikov; William A Baumgartner; K Bretonnel Cohen; Karin Verspoor; Judith A Blake; Lawrence E Hunter
Journal: BMC Bioinformatics Date: 2012-07-09 Impact factor: 3.169

Review 6. Linking genes to literature: text mining, information extraction, and retrieval applications for biology.

Authors: Martin Krallinger; Alfonso Valencia; Lynette Hirschman
Journal: Genome Biol Date: 2008-09-01 Impact factor: 13.583

7. P-Hacking Lexical Richness Through Definitions of "Type" and "Token".

Authors: K Bretonnel Cohen; Lawrence E Hunter; Peter S Pressman
Journal: Stud Health Technol Inform Date: 2019-08-21

7 in total