You Chen1, Joydeep Ghosh2, Cosmin Adrian Bejan3, Carl A Gunter4, Siddharth Gupta4, Abel Kho5, David Liebovitz5, Jimeng Sun6, Joshua Denny7, Bradley Malin8. 1. Dept. of Biomedical Informatics, School of Medicine, Vanderbilt University, Nashville, TN, USA. Electronic address: you.chen@vanderbilt.edu. 2. Dept. of Electrical & Computer Engineering, University of Texas, Austin, TX, USA. 3. Dept. of Biomedical Informatics, School of Medicine, Vanderbilt University, Nashville, TN, USA. 4. Dept. of Computer Science, University of Illinois at Urbana-Champagne, Champaign, IL, USA. 5. School of Medicine, Northwestern University, Chicago, IL, USA. 6. School of Computational Science & Engineering, Georgia Institute of Technology, Atlanta, GA, USA. 7. Dept. of Biomedical Informatics, School of Medicine, Vanderbilt University, Nashville, TN, USA; Department of Medicine, Vanderbilt University, Nashville, TN, USA. 8. Dept. of Biomedical Informatics, School of Medicine, Vanderbilt University, Nashville, TN, USA; Dept. of Electrical Engineering & Computer Science, School of Engineering, Vanderbilt University, Nashville, TN, USA.
Abstract
OBJECTIVE: Data in electronic health records (EHRs) is being increasingly leveraged for secondary uses, ranging from biomedical association studies to comparative effectiveness. To perform studies at scale and transfer knowledge from one institution to another in a meaningful way, we need to harmonize the phenotypes in such systems. Traditionally, this has been accomplished through expert specification of phenotypes via standardized terminologies, such as billing codes. However, this approach may be biased by the experience and expectations of the experts, as well as the vocabulary used to describe such patients. The goal of this work is to develop a data-driven strategy to (1) infer phenotypic topics within patient populations and (2) assess the degree to which such topics facilitate a mapping across populations in disparate healthcare systems. METHODS: We adapt a generative topic modeling strategy, based on latent Dirichlet allocation, to infer phenotypic topics. We utilize a variance analysis to assess the projection of a patient population from one healthcare system onto the topics learned from another system. The consistency of learned phenotypic topics was evaluated using (1) the similarity of topics, (2) the stability of a patient population across topics, and (3) the transferability of a topic across sites. We evaluated our approaches using four months of inpatient data from two geographically distinct healthcare systems: (1) Northwestern Memorial Hospital (NMH) and (2) Vanderbilt University Medical Center (VUMC). RESULTS: The method learned 25 phenotypic topics from each healthcare system. The average cosine similarity between matched topics across the two sites was 0.39, a remarkably high value given the very high dimensionality of the feature space. The average stability of VUMC and NMH patients across the topics of two sites was 0.988 and 0.812, respectively, as measured by the Pearson correlation coefficient. Also the VUMC and NMH topics have smaller variance of characterizing patient population of two sites than standard clinical terminologies (e.g., ICD9), suggesting they may be more reliably transferred across hospital systems. CONCLUSIONS: Phenotypic topics learned from EHR data can be more stable and transferable than billing codes for characterizing the general status of a patient population. This suggests that EHR-based research may be able to leverage such phenotypic topics as variables when pooling patient populations in predictive models.
OBJECTIVE: Data in electronic health records (EHRs) is being increasingly leveraged for secondary uses, ranging from biomedical association studies to comparative effectiveness. To perform studies at scale and transfer knowledge from one institution to another in a meaningful way, we need to harmonize the phenotypes in such systems. Traditionally, this has been accomplished through expert specification of phenotypes via standardized terminologies, such as billing codes. However, this approach may be biased by the experience and expectations of the experts, as well as the vocabulary used to describe such patients. The goal of this work is to develop a data-driven strategy to (1) infer phenotypic topics within patient populations and (2) assess the degree to which such topics facilitate a mapping across populations in disparate healthcare systems. METHODS: We adapt a generative topic modeling strategy, based on latent Dirichlet allocation, to infer phenotypic topics. We utilize a variance analysis to assess the projection of a patient population from one healthcare system onto the topics learned from another system. The consistency of learned phenotypic topics was evaluated using (1) the similarity of topics, (2) the stability of a patient population across topics, and (3) the transferability of a topic across sites. We evaluated our approaches using four months of inpatient data from two geographically distinct healthcare systems: (1) Northwestern Memorial Hospital (NMH) and (2) Vanderbilt University Medical Center (VUMC). RESULTS: The method learned 25 phenotypic topics from each healthcare system. The average cosine similarity between matched topics across the two sites was 0.39, a remarkably high value given the very high dimensionality of the feature space. The average stability of VUMC and NMHpatients across the topics of two sites was 0.988 and 0.812, respectively, as measured by the Pearson correlation coefficient. Also the VUMC and NMH topics have smaller variance of characterizing patient population of two sites than standard clinical terminologies (e.g., ICD9), suggesting they may be more reliably transferred across hospital systems. CONCLUSIONS: Phenotypic topics learned from EHR data can be more stable and transferable than billing codes for characterizing the general status of a patient population. This suggests that EHR-based research may be able to leverage such phenotypic topics as variables when pooling patient populations in predictive models.
Keywords:
Clinical phenotype modeling; Computers and information processing; Data mining; Electronic medical records; Medical information systems; Pattern recognition
Authors: Katherine M Newton; Peggy L Peissig; Abel Ngo Kho; Suzette J Bielinski; Richard L Berg; Vidhu Choudhary; Melissa Basford; Christopher G Chute; Iftikhar J Kullo; Rongling Li; Jennifer A Pacheco; Luke V Rasmussen; Leslie Spangler; Joshua C Denny Journal: J Am Med Inform Assoc Date: 2013-03-26 Impact factor: 4.497
Authors: Jonas F Ludvigsson; Jyotishman Pathak; Sean Murphy; Matthew Durski; Phillip S Kirsch; Christophe G Chute; Euijung Ryu; Joseph A Murray Journal: J Am Med Inform Assoc Date: 2013-08-16 Impact factor: 4.497
Authors: Joshua C Denny; Marylyn D Ritchie; Melissa A Basford; Jill M Pulley; Lisa Bastarache; Kristin Brown-Gentry; Deede Wang; Dan R Masys; Dan M Roden; Dana C Crawford Journal: Bioinformatics Date: 2010-03-24 Impact factor: 6.937
Authors: Jyotishman Pathak; Kent R Bailey; Calvin E Beebe; Steven Bethard; David C Carrell; Pei J Chen; Dmitriy Dligach; Cory M Endle; Lacey A Hart; Peter J Haug; Stanley M Huff; Vinod C Kaggal; Dingcheng Li; Hongfang Liu; Kyle Marchant; James Masanz; Timothy Miller; Thomas A Oniki; Martha Palmer; Kevin J Peterson; Susan Rea; Guergana K Savova; Craig R Stancl; Sunghwan Sohn; Harold R Solbrig; Dale B Suesse; Cui Tao; David P Taylor; Les Westberg; Stephen Wu; Ning Zhuo; Christopher G Chute Journal: J Am Med Inform Assoc Date: 2013-11-04 Impact factor: 4.497
Authors: Dana C Crawford; David R Crosslin; Gerard Tromp; Iftikhar J Kullo; Helena Kuivaniemi; M Geoffrey Hayes; Joshua C Denny; William S Bush; Jonathan L Haines; Dan M Roden; Catherine A McCarty; Gail P Jarvik; Marylyn D Ritchie Journal: Front Genet Date: 2014-06-17 Impact factor: 4.599
Authors: Juan Zhao; Yun Zhang; David J Schlueter; Patrick Wu; Vern Eric Kerchberger; S Trent Rosenbloom; Quinn S Wells; QiPing Feng; Joshua C Denny; Wei-Qi Wei Journal: J Biomed Inform Date: 2019-08-22 Impact factor: 6.317
Authors: Amy C Justice; Rachel V Smith; Janet P Tate; Kathleen McGinnis; Ke Xu; William C Becker; Kuang-Yao Lee; Kevin Lynch; Ning Sun; John Concato; David A Fiellin; Hongyu Zhao; Joel Gelernter; Henry R Kranzler Journal: Addiction Date: 2018-08-01 Impact factor: 6.526
Authors: You Chen; Nancy M Lorenzi; Warren S Sandberg; Kelly Wolgast; Bradley A Malin Journal: J Am Med Inform Assoc Date: 2017-04-01 Impact factor: 4.497