Literature DB >> 17911772

A reappraisal of sentence and token splitting for life sciences documents.

Katrin Tomanek1, Joachim Wermter, Udo Hahn.   

Abstract

Natural language processing of real-world documents requires several low-level tasks such as splitting a piece of text into its constituent sentences, and splitting each sentence into its constituent tokens to be performed by some preprocessor (prior to linguistic analysis). While this task is often considered as unsophisticated clerical work, in the life sciences domain it poses enormous problems due to complex naming conventions. In this paper, we first introduce an annotation framework for sentence and token splitting underlying a newly constructed sentence- and token-tagged biomedical text corpus. This corpus serves as a training environment and test bed for machine-learning based sentence and token splitters using Conditional Random Fields (CRFs). Our evaluation experiments reveal that CRFs with a rich feature set substantially increase sentence and token detection performance.

Mesh:

Year:  2007        PMID: 17911772

Source DB:  PubMed          Journal:  Stud Health Technol Inform        ISSN: 0926-9630


  7 in total

1.  Mining the pharmacogenomics literature--a survey of the state of the art.

Authors:  Udo Hahn; K Bretonnel Cohen; Yael Garten; Nigam H Shah
Journal:  Brief Bioinform       Date:  2012-07       Impact factor: 11.622

2.  NCBI disease corpus: a resource for disease name recognition and concept normalization.

Authors:  Rezarta Islamaj Doğan; Robert Leaman; Zhiyong Lu
Journal:  J Biomed Inform       Date:  2014-01-03       Impact factor: 6.317

3.  Detection of protein catalytic sites in the biomedical literature.

Authors:  Karin Verspoor; Andrew Mackinlay; Judith D Cohn; Michael E Wall
Journal:  Pac Symp Biocomput       Date:  2013

4.  Detection of IUPAC and IUPAC-like chemical names.

Authors:  Roman Klinger; Corinna Kolárik; Juliane Fluck; Martin Hofmann-Apitius; Christoph M Friedrich
Journal:  Bioinformatics       Date:  2008-07-01       Impact factor: 6.937

5.  Concept annotation in the CRAFT corpus.

Authors:  Michael Bada; Miriam Eckert; Donald Evans; Kristin Garcia; Krista Shipley; Dmitry Sitnikov; William A Baumgartner; K Bretonnel Cohen; Karin Verspoor; Judith A Blake; Lawrence E Hunter
Journal:  BMC Bioinformatics       Date:  2012-07-09       Impact factor: 3.169

Review 6.  Linking genes to literature: text mining, information extraction, and retrieval applications for biology.

Authors:  Martin Krallinger; Alfonso Valencia; Lynette Hirschman
Journal:  Genome Biol       Date:  2008-09-01       Impact factor: 13.583

7.  P-Hacking Lexical Richness Through Definitions of "Type" and "Token".

Authors:  K Bretonnel Cohen; Lawrence E Hunter; Peter S Pressman
Journal:  Stud Health Technol Inform       Date:  2019-08-21
  7 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.