Literature DB >> 18693949

An unsupervised machine learning approach to segmentation of clinician-entered free text.

Jesse O Wrenn1, Peter D Stetson, Stephen B Johnson.   

Abstract

Natural language processing, an important tool in biomedicine, fails without successful segmentation of words and sentences. Tokenization is a form of segmentation that identifies boundaries separating semantic units, for example words, dates, numbers and symbols, within a text. We sought to construct a highly generalizeable tokenization algorithm with no prior knowledge of characters or their function, based solely on the inherent statistical properties of token and sentence boundaries. Tokenizing clinician-entered free text, we achieved precision and recall of 92% and 93%, respectively compared to a whitespace token boundary detection algorithm. We classified over 80% of punctuation characters correctly, based on manual disambiguation with high inter-rater agreement (kappa=0.916). Our algorithm effectively discovered properties of whitespace and punctuation in the corpus without prior knowledge of either. Given the dynamic nature of biomedical language, and the variety of distinct sublanguages used, the effectiveness and generalizability of our novel tokenization algorithm make it a valuable tool.

Entities:  

Mesh:

Year:  2007        PMID: 18693949      PMCID: PMC2655800     

Source DB:  PubMed          Journal:  AMIA Annu Symp Proc        ISSN: 1559-4076


  6 in total

1.  GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles.

Authors:  C Friedman; P Kra; H Yu; M Krauthammer; A Rzhetsky
Journal:  Bioinformatics       Date:  2001       Impact factor: 6.937

2.  The sublanguage of cross-coverage.

Authors:  Peter D Stetson; Stephen B Johnson; Matthew Scotch; George Hripcsak
Journal:  Proc AMIA Symp       Date:  2002

3.  Automatic learning of the morphology of medical language using information compression.

Authors:  Shamim Ara Mollah; Stephen B Johnson
Journal:  AMIA Annu Symp Proc       Date:  2003

4.  Limited parsing of notational text visit notes: ad-hoc vs. NLP approaches.

Authors:  R C Barrows Jr; M Busuioc; C Friedman
Journal:  Proc AMIA Symp       Date:  2000

5.  Assessing the validity of national quality measures for coronary artery disease using an electronic health record.

Authors:  Stephen D Persell; Jennifer M Wright; Jason A Thompson; Karen S Kmetik; David W Baker
Journal:  Arch Intern Med       Date:  2006-11-13

6.  Unlocking clinical data from narrative reports: a study of natural language processing.

Authors:  G Hripcsak; C Friedman; P O Alderson; W DuMouchel; S B Johnson; P D Clayton
Journal:  Ann Intern Med       Date:  1995-05-01       Impact factor: 25.391

  6 in total
  1 in total

1.  Building a biomedical tokenizer using the token lattice design pattern and the adapted Viterbi algorithm.

Authors:  Neil Barrett; Jens Weber-Jahnke
Journal:  BMC Bioinformatics       Date:  2011-06-09       Impact factor: 3.169

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.