Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 Part-of-speech tagging for clinical text: wall or bridge between institutions?

Literature DB >> 22195091

Part-of-speech tagging for clinical text: wall or bridge between institutions?

Jung-wei Fan¹, Rashmi Prasad, Rommel M Yabut, Richard M Loomis, Daniel S Zisook, John E Mattison, Yang Huang.

Abstract

Part-of-speech (POS) tagging is a fundamental step required by various NLP systems. The training of a POS tagger relies on sufficient quality annotations. However, the annotation process is both knowledge-intensive and time-consuming in the clinical domain. A promising solution appears to be for institutions to share their annotation efforts, and yet there is little research on associated issues. We performed experiments to understand how POS tagging performance would be affected by using a pre-trained tagger versus raw training data across different institutions. We manually annotated a set of clinical notes at Kaiser Permanente Southern California (KPSC) and a set from the University of Pittsburg Medical Center (UPMC), and trained/tested POS taggers with intra- and inter-institution settings. The cTAKES POS tagger was also included in the comparison to represent a tagger partially trained from the notes of a third institution, Mayo Clinic at Rochester. Intra-institution 5-fold cross-validation estimated an accuracy of 0.953 and 0.945 on the KPSC and UPMC notes respectively. Trained purely on KPSC notes, the accuracy was 0.897 when tested on UPMC notes. Trained purely on UPMC notes, the accuracy was 0.904 when tested on KPSC notes. Applying the cTAKES tagger pre-trained with Mayo Clinic's notes, the accuracy was 0.881 on KPSC notes and 0.883 on UPMC notes. After adding UPMC annotations to KPSC training data, the average accuracy on tested KPSC notes increased to 0.965. After adding KPSC annotations to UPMC training data, the average accuracy on tested UPMC notes increased to 0.953. The results indicated: first, the performance of pre-trained POS taggers dropped about 5% when applied directly across the institutions; second, mixing annotations from another institution following the same guideline increased tagging accuracy for about 1%. Our findings suggest that institutions can benefit more from sharing raw annotations but less from sharing pre-trained models for the POS tagging task. We believe the study could also provide general insights on cross-institution data sharing for other types of NLP tasks.

Entities: Chemical Species

Mesh：

Year: 2011 PMID： 22195091 PMCID： PMC3243258

Source DB: PubMed Journal: AMIA Annu Symp Proc ISSN： 1559-4076

14 in total

1. Automated encoding of clinical documents based on natural language processing.

Authors: Carol Friedman; Lyudmila Shagina; Yves Lussier; George Hripcsak
Journal: J Am Med Inform Assoc Date: 2004-06-07 Impact factor: 4.497

2. MedPost: a part-of-speech tagger for bioMedical text.

Authors: L Smith; T Rindflesch; W J Wilbur
Journal: Bioinformatics Date: 2004-04-08 Impact factor: 6.937

3. Developing a corpus of clinical notes manually annotated for part-of-speech.

Authors: Serguei V Pakhomov; Anni Coden; Christopher G Chute
Journal: Int J Med Inform Date: 2005-09-19 Impact factor: 4.046

4. Domain-specific language models and lexicons for tagging.

Authors: Anni R Coden; Serguei V Pakhomov; Rie K Ando; Patrick H Duffy; Christopher G Chute
Journal: J Biomed Inform Date: 2005-04-02 Impact factor: 6.317

5. RelEx--relation extraction using dependency parse trees.

Authors: Katrin Fundel; Robert Küffner; Ralf Zimmer
Journal: Bioinformatics Date: 2006-12-01 Impact factor: 6.937

6. Heuristic sample selection to minimize reference standard training set for a part-of-speech tagger.

Authors: Kaihong Liu; Wendy Chapman; Rebecca Hwa; Rebecca S Crowley
Journal: J Am Med Inform Assoc Date: 2007-06-28 Impact factor: 4.497

7. Deriving a probabilistic syntacto-semantic grammar for biomedicine based on domain-specific terminologies.

Authors: Jung-Wei Fan; Carol Friedman
Journal: J Biomed Inform Date: 2011-04-28 Impact factor: 6.317

8. Lexical methods for managing variation in biomedical terminologies.

Authors: A T McCray; S Srinivasan; A C Browne
Journal: Proc Annu Symp Comput Appl Med Care Date: 1994

9. Computerized extraction of information on the quality of diabetes care from free text in electronic patient records of general practitioners.

Authors: Jaco Voorham; Petra Denig
Journal: J Am Med Inform Assoc Date: 2007-02-28 Impact factor: 4.497

10. Extracting principal diagnosis, co-morbidity and smoking status for asthma research: evaluation of a natural language processing system.

Authors: Qing T Zeng; Sergey Goryachev; Scott Weiss; Margarita Sordo; Shawn N Murphy; Ross Lazarus
Journal: BMC Med Inform Decis Mak Date: 2006-07-26 Impact factor: 2.796

16 in total

1. Coreference analysis in clinical notes: a multi-pass sieve with alternate anaphora resolution modules.

Authors: Siddhartha Reddy Jonnalagadda; Dingcheng Li; Sunghwan Sohn; Stephen Tze-Inn Wu; Kavishwar Wagholikar; Manabu Torii; Hongfang Liu
Journal: J Am Med Inform Assoc Date: 2012-06-16 Impact factor: 4.497

2. Risk factor detection for heart disease by applying text analytics in electronic medical records.

Authors: Manabu Torii; Jung-Wei Fan; Wei-Li Yang; Theodore Lee; Matthew T Wiley; Daniel S Zisook; Yang Huang
Journal: J Biomed Inform Date: 2015-08-14 Impact factor: 6.317

3. Syntactic parsing of clinical text: guideline and corpus development with handling ill-formed sentences.

Authors: Jung-wei Fan; Elly W Yang; Min Jiang; Rashmi Prasad; Richard M Loomis; Daniel S Zisook; Josh C Denny; Hua Xu; Yang Huang
Journal: J Am Med Inform Assoc Date: 2013-08-01 Impact factor: 4.497

4. A comparative study of current Clinical Natural Language Processing systems on handling abbreviations in discharge summaries.

Authors: Yonghui Wu; Joshua C Denny; S Trent Rosenbloom; Randolph A Miller; Dario A Giuse; Hua Xu
Journal: AMIA Annu Symp Proc Date: 2012-11-03

5. Agile text mining for the 2014 i2b2/UTHealth Cardiac risk factors challenge.

Authors: James Cormack; Chinmoy Nath; David Milward; Kalpana Raja; Siddhartha R Jonnalagadda
Journal: J Biomed Inform Date: 2015-07-22 Impact factor: 6.317

6. Improving performance of natural language processing part-of-speech tagging on clinical narratives through domain adaptation.

Authors: Jeffrey P Ferraro; Hal Daumé; Scott L Duvall; Wendy W Chapman; Henk Harkema; Peter J Haug
Journal: J Am Med Inform Assoc Date: 2013-03-13 Impact factor: 4.497

7. Ensembles of natural language processing systems for portable phenotyping solutions.

Authors: Cong Liu; Casey N Ta; James R Rogers; Ziran Li; Junghwan Lee; Alex M Butler; Ning Shang; Fabricio Sampaio Peres Kury; Liwei Wang; Feichen Shen; Hongfang Liu; Lyudmila Ena; Carol Friedman; Chunhua Weng
Journal: J Biomed Inform Date: 2019-10-23 Impact factor: 6.317