Literature DB >> 22779047

Feasibility of pooling annotated corpora for clinical concept extraction.

Kavishwar Wagholikar¹, Manabu Torii, Siddhartha Jonnalagadda, Hongfang Liu.
1. Mayo Clinic, Rochester, MN;

Abstract

Availability of annotated corpora has facilitated application of machine learning algorithms to concept extraction from clinical notes. However, it is expensive to prepare annotated corpora in individual institutions, and pooling of annotated corpora from other institutions is a potential solution. In this paper we investigate whether pooling of corpora from two different sources, can improve performance and portability of resultant machine learning taggers for medical problem detection. Specifically, we pool corpora from 2010 i2b2/VA NLP challenge and Mayo Clinic Rochester, to evaluate taggers for recognition of medical problems. Contrary to our expectations, pooling of corpora is found to decrease the F1-score. We examine the annotation guidelines to identify factors for incompatibility of the corpora and suggest development of a standard annotation guideline by the clinical NLP community to allow compatibility of annotated corpora.

Entities: Disease Species

Year: 2012 PMID： 22779047 PMCID： PMC3392069

Source DB: PubMed Journal: AMIA Jt Summits Transl Sci Proc

Introduction and Background

Many medical institutions have an interest in using Natural Language Processing (NLP) to utilize unstructured text in their electronic medical record (EMR) systems. Individual institutions could benefit from using shared and publicly available resources. There have been similar efforts to pool datasets in the biomedical domain. In this paper we investigate whether pooling of similar datasets from two different sources, can improve performance and portability of resultant machine learning taggers for medical problem detection.

Methods

We trained and tested taggers on a dataset from Mayo Clinic, Rochester (MCR) and a dataset from the 2010 i2b2/VA NLP challenge. The taggers were trained to recognize medical problems, including signs/symptoms and disorders. Firstly, we trained the tagger on i2b2 dataset and tested it on MCR dataset and vice versa. We then performed 5 fold cross-validation on each of the datasets. We repeated the cross-validation on MCR dataset after supplementing the training fraction with the i2b2 dataset. This design was repeated for the i2b2 dataset by using MCR data to supplement the training. Precision, recall and F1-score performance measures were computed for the experiments.

Results and Discussion

Taggers trained on annotated corpus from the same institution performed the best. Pooling of corpora decreased the F1-score. We examined the annotation guidelines to identify factors that led to the incompatibility of the datasets. These included differences in concept definition, and whether articles, possessive pronouns, prepositional phrases and conjunctions where included in the concept spans. We suggest the development of a standard annotation guideline by clinical NLP community to allow compatibility of annotated corpora.

8 in total

1. Coreference analysis in clinical notes: a multi-pass sieve with alternate anaphora resolution modules.

Authors: Siddhartha Reddy Jonnalagadda; Dingcheng Li; Sunghwan Sohn; Stephen Tze-Inn Wu; Kavishwar Wagholikar; Manabu Torii; Hongfang Liu
Journal: J Am Med Inform Assoc Date: 2012-06-16 Impact factor: 4.497

2. Cohort Profile: The Right Drug, Right Dose, Right Time: Using Genomic Data to Individualize Treatment Protocol (RIGHT Protocol).

Authors: Suzette J Bielinski; Jennifer L St Sauver; Janet E Olson; Nicholas B Larson; John L Black; Steven E Scherer; Matthew E Bernard; Eric Boerwinkle; Bijan J Borah; Pedro J Caraballo; Timothy B Curry; HarshaVardhan Doddapaneni; Christine M Formea; Robert R Freimuth; Richard A Gibbs; Jyothsna Giri; Matthew A Hathcock; Jianhong Hu; Debra J Jacobson; Leila A Jones; Sara Kalla; Tyler H Koep; Viktoriya Korchina; Christie L Kovar; Sandra Lee; Hongfang Liu; Eric T Matey; Michaela E McGree; Tammy M McAllister; Ann M Moyer; Donna M Muzny; Wayne T Nicholson; Lance J Oyen; Xiang Qin; Ritika Raj; Véronique L Roger; Carolyn R Rohrer Vitek; Jason L Ross; Richard R Sharp; Paul Y Takahashi; Eric Venner; Kimberly Walker; Liwei Wang; Qiaoyan Wang; Jessica A Wright; Tsung-Jung Wu; Liewei Wang; Richard M Weinshilboum
Journal: Int J Epidemiol Date: 2020-02-01 Impact factor: 7.196

3. Comprehensive temporal information detection from clinical text: medical events, time, and TLINK identification.

Authors: Sunghwan Sohn; Kavishwar B Wagholikar; Dingcheng Li; Siddhartha R Jonnalagadda; Cui Tao; Ravikumar Komandur Elayavilli; Hongfang Liu
Journal: J Am Med Inform Assoc Date: 2013-04-04 Impact factor: 4.497

Review 4. Clinical concept extraction: A methodology review.

Authors: Sunyang Fu; David Chen; Huan He; Sijia Liu; Sungrim Moon; Kevin J Peterson; Feichen Shen; Liwei Wang; Yanshan Wang; Andrew Wen; Yiqing Zhao; Sunghwan Sohn; Hongfang Liu
Journal: J Biomed Inform Date: 2020-08-06 Impact factor: 6.317

5. Ensembles of natural language processing systems for portable phenotyping solutions.

Authors: Cong Liu; Casey N Ta; James R Rogers; Ziran Li; Junghwan Lee; Alex M Butler; Ning Shang; Fabricio Sampaio Peres Kury; Liwei Wang; Feichen Shen; Hongfang Liu; Lyudmila Ena; Carol Friedman; Chunhua Weng
Journal: J Biomed Inform Date: 2019-10-23 Impact factor: 6.317

6. Pooling annotated corpora for clinical concept extraction.

Authors: Kavishwar B Wagholikar; Manabu Torii; Siddhartha R Jonnalagadda; Hongfang Liu
Journal: J Biomed Semantics Date: 2013-01-08

7. Clinical documentation variations and NLP system portability: a case study in asthma birth cohorts across institutions.

Authors: Sunghwan Sohn; Yanshan Wang; Chung-Il Wi; Elizabeth A Krusemark; Euijung Ryu; Mir H Ali; Young J Juhn; Hongfang Liu
Journal: J Am Med Inform Assoc Date: 2018-03-01 Impact factor: 4.497

8. Analysis of cross-institutional medication description patterns in clinical narratives.

Authors: Sunghwan Sohn; Cheryl Clark; Scott R Halgrim; Sean P Murphy; Siddhartha R Jonnalagadda; Kavishwar B Wagholikar; Stephen T Wu; Christopher G Chute; Hongfang Liu
Journal: Biomed Inform Insights Date: 2013-06-24

8 in total