Literature DB >> 22195091

Part-of-speech tagging for clinical text: wall or bridge between institutions?

Jung-wei Fan1, Rashmi Prasad, Rommel M Yabut, Richard M Loomis, Daniel S Zisook, John E Mattison, Yang Huang.   

Abstract

Part-of-speech (POS) tagging is a fundamental step required by various NLP systems. The training of a POS tagger relies on sufficient quality annotations. However, the annotation process is both knowledge-intensive and time-consuming in the clinical domain. A promising solution appears to be for institutions to share their annotation efforts, and yet there is little research on associated issues. We performed experiments to understand how POS tagging performance would be affected by using a pre-trained tagger versus raw training data across different institutions. We manually annotated a set of clinical notes at Kaiser Permanente Southern California (KPSC) and a set from the University of Pittsburg Medical Center (UPMC), and trained/tested POS taggers with intra- and inter-institution settings. The cTAKES POS tagger was also included in the comparison to represent a tagger partially trained from the notes of a third institution, Mayo Clinic at Rochester. Intra-institution 5-fold cross-validation estimated an accuracy of 0.953 and 0.945 on the KPSC and UPMC notes respectively. Trained purely on KPSC notes, the accuracy was 0.897 when tested on UPMC notes. Trained purely on UPMC notes, the accuracy was 0.904 when tested on KPSC notes. Applying the cTAKES tagger pre-trained with Mayo Clinic's notes, the accuracy was 0.881 on KPSC notes and 0.883 on UPMC notes. After adding UPMC annotations to KPSC training data, the average accuracy on tested KPSC notes increased to 0.965. After adding KPSC annotations to UPMC training data, the average accuracy on tested UPMC notes increased to 0.953. The results indicated: first, the performance of pre-trained POS taggers dropped about 5% when applied directly across the institutions; second, mixing annotations from another institution following the same guideline increased tagging accuracy for about 1%. Our findings suggest that institutions can benefit more from sharing raw annotations but less from sharing pre-trained models for the POS tagging task. We believe the study could also provide general insights on cross-institution data sharing for other types of NLP tasks.

Entities:  

Mesh:

Year:  2011        PMID: 22195091      PMCID: PMC3243258     

Source DB:  PubMed          Journal:  AMIA Annu Symp Proc        ISSN: 1559-4076


  14 in total

1.  Automated encoding of clinical documents based on natural language processing.

Authors:  Carol Friedman; Lyudmila Shagina; Yves Lussier; George Hripcsak
Journal:  J Am Med Inform Assoc       Date:  2004-06-07       Impact factor: 4.497

2.  MedPost: a part-of-speech tagger for bioMedical text.

Authors:  L Smith; T Rindflesch; W J Wilbur
Journal:  Bioinformatics       Date:  2004-04-08       Impact factor: 6.937

3.  Developing a corpus of clinical notes manually annotated for part-of-speech.

Authors:  Serguei V Pakhomov; Anni Coden; Christopher G Chute
Journal:  Int J Med Inform       Date:  2005-09-19       Impact factor: 4.046

4.  Domain-specific language models and lexicons for tagging.

Authors:  Anni R Coden; Serguei V Pakhomov; Rie K Ando; Patrick H Duffy; Christopher G Chute
Journal:  J Biomed Inform       Date:  2005-04-02       Impact factor: 6.317

5.  RelEx--relation extraction using dependency parse trees.

Authors:  Katrin Fundel; Robert Küffner; Ralf Zimmer
Journal:  Bioinformatics       Date:  2006-12-01       Impact factor: 6.937

6.  Heuristic sample selection to minimize reference standard training set for a part-of-speech tagger.

Authors:  Kaihong Liu; Wendy Chapman; Rebecca Hwa; Rebecca S Crowley
Journal:  J Am Med Inform Assoc       Date:  2007-06-28       Impact factor: 4.497

7.  Deriving a probabilistic syntacto-semantic grammar for biomedicine based on domain-specific terminologies.

Authors:  Jung-Wei Fan; Carol Friedman
Journal:  J Biomed Inform       Date:  2011-04-28       Impact factor: 6.317

8.  Lexical methods for managing variation in biomedical terminologies.

Authors:  A T McCray; S Srinivasan; A C Browne
Journal:  Proc Annu Symp Comput Appl Med Care       Date:  1994

9.  Computerized extraction of information on the quality of diabetes care from free text in electronic patient records of general practitioners.

Authors:  Jaco Voorham; Petra Denig
Journal:  J Am Med Inform Assoc       Date:  2007-02-28       Impact factor: 4.497

10.  Extracting principal diagnosis, co-morbidity and smoking status for asthma research: evaluation of a natural language processing system.

Authors:  Qing T Zeng; Sergey Goryachev; Scott Weiss; Margarita Sordo; Shawn N Murphy; Ross Lazarus
Journal:  BMC Med Inform Decis Mak       Date:  2006-07-26       Impact factor: 2.796

View more
  16 in total

1.  Coreference analysis in clinical notes: a multi-pass sieve with alternate anaphora resolution modules.

Authors:  Siddhartha Reddy Jonnalagadda; Dingcheng Li; Sunghwan Sohn; Stephen Tze-Inn Wu; Kavishwar Wagholikar; Manabu Torii; Hongfang Liu
Journal:  J Am Med Inform Assoc       Date:  2012-06-16       Impact factor: 4.497

2.  Risk factor detection for heart disease by applying text analytics in electronic medical records.

Authors:  Manabu Torii; Jung-Wei Fan; Wei-Li Yang; Theodore Lee; Matthew T Wiley; Daniel S Zisook; Yang Huang
Journal:  J Biomed Inform       Date:  2015-08-14       Impact factor: 6.317

3.  Syntactic parsing of clinical text: guideline and corpus development with handling ill-formed sentences.

Authors:  Jung-wei Fan; Elly W Yang; Min Jiang; Rashmi Prasad; Richard M Loomis; Daniel S Zisook; Josh C Denny; Hua Xu; Yang Huang
Journal:  J Am Med Inform Assoc       Date:  2013-08-01       Impact factor: 4.497

4.  A comparative study of current Clinical Natural Language Processing systems on handling abbreviations in discharge summaries.

Authors:  Yonghui Wu; Joshua C Denny; S Trent Rosenbloom; Randolph A Miller; Dario A Giuse; Hua Xu
Journal:  AMIA Annu Symp Proc       Date:  2012-11-03

5.  Agile text mining for the 2014 i2b2/UTHealth Cardiac risk factors challenge.

Authors:  James Cormack; Chinmoy Nath; David Milward; Kalpana Raja; Siddhartha R Jonnalagadda
Journal:  J Biomed Inform       Date:  2015-07-22       Impact factor: 6.317

6.  Improving performance of natural language processing part-of-speech tagging on clinical narratives through domain adaptation.

Authors:  Jeffrey P Ferraro; Hal Daumé; Scott L Duvall; Wendy W Chapman; Henk Harkema; Peter J Haug
Journal:  J Am Med Inform Assoc       Date:  2013-03-13       Impact factor: 4.497

7.  Ensembles of natural language processing systems for portable phenotyping solutions.

Authors:  Cong Liu; Casey N Ta; James R Rogers; Ziran Li; Junghwan Lee; Alex M Butler; Ning Shang; Fabricio Sampaio Peres Kury; Liwei Wang; Feichen Shen; Hongfang Liu; Lyudmila Ena; Carol Friedman; Chunhua Weng
Journal:  J Biomed Inform       Date:  2019-10-23       Impact factor: 6.317

8.  Pooling annotated corpora for clinical concept extraction.

Authors:  Kavishwar B Wagholikar; Manabu Torii; Siddhartha R Jonnalagadda; Hongfang Liu
Journal:  J Biomed Semantics       Date:  2013-01-08

9.  Analysis of cross-institutional medication description patterns in clinical narratives.

Authors:  Sunghwan Sohn; Cheryl Clark; Scott R Halgrim; Sean P Murphy; Siddhartha R Jonnalagadda; Kavishwar B Wagholikar; Stephen T Wu; Christopher G Chute; Hongfang Liu
Journal:  Biomed Inform Insights       Date:  2013-06-24

10.  Using empirically constructed lexical resources for named entity recognition.

Authors:  Siddhartha Jonnalagadda; Trevor Cohen; Stephen Wu; Hongfang Liu; Graciela Gonzalez
Journal:  Biomed Inform Insights       Date:  2013-06-24
View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.