OBJECTIVE: To develop algorithms to improve efficiency of patient phenotyping using natural language processing (NLP) on text data. Of a large number of note titles available in our database, we sought to determine those with highest yield and precision for psychosocial concepts. MATERIALS AND METHODS: From a database of over 1 billion documents from US Department of Veterans Affairs medical facilities, a random sample of 1500 documents from each of 218 enterprise note titles were chosen. Psychosocial concepts were extracted using a UIMA-AS-based NLP pipeline (v3NLP), using a lexicon of relevant concepts with negation and template format annotators. Human reviewers evaluated a subset of documents for false positives and sensitivity. High-yield documents were identified by hit rate and precision. Reasons for false positivity were characterized. RESULTS: A total of 58 707 psychosocial concepts were identified from 316 355 documents for an overall hit rate of 0.2 concepts per document (median 0.1, range 1.6-0). Of 6031 concepts reviewed from a high-yield set of note titles, the overall precision for all concept categories was 80%, with variability among note titles and concept categories. Reasons for false positivity included templating, negation, context, and alternate meaning of words. The sensitivity of the NLP system was noted to be 49% (95% CI 43% to 55%). CONCLUSIONS: Phenotyping using NLP need not involve the entire document corpus. Our methods offer a generalizable strategy for scaling NLP pipelines to large free text corpora with complex linguistic annotations in attempts to identify patients of a certain phenotype.
OBJECTIVE: To develop algorithms to improve efficiency of patient phenotyping using natural language processing (NLP) on text data. Of a large number of note titles available in our database, we sought to determine those with highest yield and precision for psychosocial concepts. MATERIALS AND METHODS: From a database of over 1 billion documents from US Department of Veterans Affairs medical facilities, a random sample of 1500 documents from each of 218 enterprise note titles were chosen. Psychosocial concepts were extracted using a UIMA-AS-based NLP pipeline (v3NLP), using a lexicon of relevant concepts with negation and template format annotators. Human reviewers evaluated a subset of documents for false positives and sensitivity. High-yield documents were identified by hit rate and precision. Reasons for false positivity were characterized. RESULTS: A total of 58 707 psychosocial concepts were identified from 316 355 documents for an overall hit rate of 0.2 concepts per document (median 0.1, range 1.6-0). Of 6031 concepts reviewed from a high-yield set of note titles, the overall precision for all concept categories was 80%, with variability among note titles and concept categories. Reasons for false positivity included templating, negation, context, and alternate meaning of words. The sensitivity of the NLP system was noted to be 49% (95% CI 43% to 55%). CONCLUSIONS: Phenotyping using NLP need not involve the entire document corpus. Our methods offer a generalizable strategy for scaling NLP pipelines to large free text corpora with complex linguistic annotations in attempts to identify patients of a certain phenotype.
Entities:
Keywords:
clinical informatics; high through-put; natural language processing; patient phenotype; psychosocial concepts
Authors: Guergana K Savova; James J Masanz; Philip V Ogren; Jiaping Zheng; Sunghwan Sohn; Karin C Kipper-Schuler; Christopher G Chute Journal: J Am Med Inform Assoc Date: 2010 Sep-Oct Impact factor: 4.497
Authors: Brett R South; Brett Ray South; Wendy W Chapman; Wendy Chapman; Sylvain Delisle; Shuying Shen; Ericka Kalp; Trish Perl; Matthew H Samore; Adi V Gundlapalli Journal: AMIA Annu Symp Proc Date: 2008-11-06
Authors: Joshua C Denny; Marylyn D Ritchie; Melissa A Basford; Jill M Pulley; Lisa Bastarache; Kristin Brown-Gentry; Deede Wang; Dan R Masys; Dan M Roden; Dana C Crawford Journal: Bioinformatics Date: 2010-03-24 Impact factor: 6.937
Authors: Adi V Gundlapalli; Marjorie E Carter; Guy Divita; Shuying Shen; Miland Palmer; Brett South; B S Begum Durgahee; Andrew Redd; Matthew Samore Journal: AMIA Annu Symp Proc Date: 2014-11-14
Authors: Miyong T Kim; Kavita Radhakrishnan; Elizabeth M Heitkemper; Eunju Choi; Marissa Burgermaster Journal: Am J Transl Res Date: 2021-03-15 Impact factor: 4.060
Authors: Barbara E Jones; Dave S Collingridge; Caroline G Vines; Herman Post; John Holmen; Todd L Allen; Peter Haug; Charlene R Weir; Nathan C Dean Journal: Appl Clin Inform Date: 2019-01-02 Impact factor: 2.342
Authors: Adi V Gundlapalli; Marjorie E Carter; Miland Palmer; Thomas Ginter; Andrew Redd; Steven Pickard; Shuying Shen; Brett South; Guy Divita; Scott Duvall; Thien M Nguyen; Leonard W D'Avolio; Matthew Samore Journal: AMIA Annu Symp Proc Date: 2013-11-16
Authors: Cosmin A Bejan; John Angiolillo; Douglas Conway; Robertson Nash; Jana K Shirey-Rice; Loren Lipworth; Robert M Cronin; Jill Pulley; Sunil Kripalani; Shari Barkin; Kevin B Johnson; Joshua C Denny Journal: J Am Med Inform Assoc Date: 2018-01-01 Impact factor: 4.497