Literature DB >> 17600099

Heuristic sample selection to minimize reference standard training set for a part-of-speech tagger.

Kaihong Liu1, Wendy Chapman, Rebecca Hwa, Rebecca S Crowley.   

Abstract

Part-of-speech tagging represents an important first step for most medical natural language processing (NLP) systems. The majority of current statistically-based POS taggers are trained using a general English corpus. Consequently, these systems perform poorly on medical text. Annotated medical corpora are difficult to develop because of the time and labor required. We investigated a heuristic-based sample selection method to minimize annotated corpus size for retraining a Maximum Entropy (ME) POS tagger. We developed a manually annotated domain specific corpus (DSC) of surgical pathology reports and a domain specific lexicon (DL). We sampled the DSC using two heuristics to produce smaller training sets and compared the retrained performance against (1) the original ME modeled tagger trained on general English, (2) the ME tagger retrained on the DL, and (3) the MedPost tagger trained on MEDLINE abstracts. RESULTS showed that the ME tagger retrained with a DSC was superior to the tagger retrained with the DL, and also superior to MedPost. Heuristic methods for sample selection produced performance equivalent to use of the entire training set, but with many fewer sentences. Learning curve analysis showed that sample selection would enable an 84% decrease in the size of the training set without a decrement in performance. We conclude that heuristic sample selection can be used to markedly reduce human annotation requirements for training of medical NLP systems.

Entities:  

Mesh:

Year:  2007        PMID: 17600099      PMCID: PMC1975798          DOI: 10.1197/jamia.M2392

Source DB:  PubMed          Journal:  J Am Med Inform Assoc        ISSN: 1067-5027            Impact factor:   4.497


  10 in total

1.  Automatic extraction of linguistic knowledge from an international classification.

Authors:  R Baud; C Lovis; A M Rassinoux; P A Michel; J R Scherrer
Journal:  Stud Health Technol Inform       Date:  1998

2.  Automatic structuring of radiology free-text reports.

Authors:  R K Taira; S G Soderland; R M Jakobovits
Journal:  Radiographics       Date:  2001 Jan-Feb       Impact factor: 5.333

3.  Comparing syntactic complexity in medical and non-medical corpora.

Authors:  D A Campbell; S B Johnson
Journal:  Proc AMIA Symp       Date:  2001

4.  Evaluation of negation phrases in narrative clinical reports.

Authors:  W W Chapman; W Bridewell; P Hanbury; G F Cooper; B G Buchanan
Journal:  Proc AMIA Symp       Date:  2001

5.  The sublanguage of cross-coverage.

Authors:  Peter D Stetson; Stephen B Johnson; Matthew Scotch; George Hripcsak
Journal:  Proc AMIA Symp       Date:  2002

6.  Extracting structured information from free text pathology reports.

Authors:  Gunther Schadow; Clement J McDonald
Journal:  AMIA Annu Symp Proc       Date:  2003

7.  MedPost: a part-of-speech tagger for bioMedical text.

Authors:  L Smith; T Rindflesch; W J Wilbur
Journal:  Bioinformatics       Date:  2004-04-08       Impact factor: 6.937

8.  Domain-specific language models and lexicons for tagging.

Authors:  Anni R Coden; Serguei V Pakhomov; Rie K Ando; Patrick H Duffy; Christopher G Chute
Journal:  J Biomed Inform       Date:  2005-04-02       Impact factor: 6.317

9.  dTagger: a POS tagger.

Authors:  Guy Divita; Allen C Browne; Russell Loane
Journal:  AMIA Annu Symp Proc       Date:  2006

10.  Evaluation of a deidentification (De-Id) software engine to share pathology reports and clinical documents for research.

Authors:  Dilip Gupta; Melissa Saul; John Gilbertson
Journal:  Am J Clin Pathol       Date:  2004-02       Impact factor: 2.493

  10 in total
  8 in total

1.  Part-of-speech tagging for clinical text: wall or bridge between institutions?

Authors:  Jung-wei Fan; Rashmi Prasad; Rommel M Yabut; Richard M Loomis; Daniel S Zisook; John E Mattison; Yang Huang
Journal:  AMIA Annu Symp Proc       Date:  2011-10-22

2.  Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications.

Authors:  Guergana K Savova; James J Masanz; Philip V Ogren; Jiaping Zheng; Sunghwan Sohn; Karin C Kipper-Schuler; Christopher G Chute
Journal:  J Am Med Inform Assoc       Date:  2010 Sep-Oct       Impact factor: 4.497

3.  Effectiveness of lexico-syntactic pattern matching for ontology enrichment with clinical documents.

Authors:  K Liu; W W Chapman; G Savova; C G Chute; N Sioutos; R S Crowley
Journal:  Methods Inf Med       Date:  2010-11-08       Impact factor: 2.176

4.  Active learning for clinical text classification: is it better than random sampling?

Authors:  Rosa L Figueroa; Qing Zeng-Treitler; Long H Ngo; Sergey Goryachev; Eduardo P Wiechmann
Journal:  J Am Med Inform Assoc       Date:  2012-06-15       Impact factor: 4.497

5.  Improving performance of natural language processing part-of-speech tagging on clinical narratives through domain adaptation.

Authors:  Jeffrey P Ferraro; Hal Daumé; Scott L Duvall; Wendy W Chapman; Henk Harkema; Peter J Haug
Journal:  J Am Med Inform Assoc       Date:  2013-03-13       Impact factor: 4.497

Review 6.  What can natural language processing do for clinical decision support?

Authors:  Dina Demner-Fushman; Wendy W Chapman; Clement J McDonald
Journal:  J Biomed Inform       Date:  2009-08-13       Impact factor: 6.317

7.  Automated extraction of precise protein expression patterns in lymphoma by text mining abstracts of immunohistochemical studies.

Authors:  Jia-Fu Chang; Mihail Popescu; Gerald L Arthur
Journal:  J Pathol Inform       Date:  2013-07-31

8.  Natural language processing pipelines to annotate BioC collections with an application to the NCBI disease corpus.

Authors:  Donald C Comeau; Haibin Liu; Rezarta Islamaj Doğan; W John Wilbur
Journal:  Database (Oxford)       Date:  2014-06-16       Impact factor: 3.451

  8 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.