Katherine P Liao1,2,3, Jiehuan Sun3,4, Tianrun A Cai1,2,3, Nicholas Link3, Chuan Hong2,3,4, Jie Huang2, Jennifer E Huffman3, Jessica Gronsbell5, Yichi Zhang4,6, Yuk-Lam Ho3, Victor Castro7, Vivian Gainer7, Shawn N Murphy2,7,8, Christopher J O'Donnell1,3, J Michael Gaziano1,2,3, Kelly Cho1,2,3, Peter Szolovits9, Isaac S Kohane2, Sheng Yu10,11,12, Tianxi Cai2,3,4. 1. Division of Rheumatology, Immunology, and Allergy, Brigham and Women's Hospital, Boston, MA, USA. 2. Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA. 3. Division of Data Sciences, VA Boston Healthcare System, Boston, MA, USA. 4. Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA. 5. Verily Life Sciences, Cambridge, MA, USA. 6. University of Rhode Island, Kingston, RI, USA. 7. Partners Healthcare Systems, Summerville, MA, USA. 8. Massachusetts General Hospital, Boston, MA, USA. 9. Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA. 10. Center for Statistical Science, Tsinghua University, Beijing, China. 11. Department of Industrial Engineering, Tsinghua University, Beijing, China. 12. Institute for Data Science, Tsinghua University, Beijing, China.
Abstract
OBJECTIVE: Electronic health records linked with biorepositories are a powerful platform for translational studies. A major bottleneck exists in the ability to phenotype patients accurately and efficiently. The objective of this study was to develop an automated high-throughput phenotyping method integrating International Classification of Diseases (ICD) codes and narrative data extracted using natural language processing (NLP). MATERIALS AND METHODS: We developed a mapping method for automatically identifying relevant ICD and NLP concepts for a specific phenotype leveraging the Unified Medical Language System. Along with health care utilization, aggregated ICD and NLP counts were jointly analyzed by fitting an ensemble of latent mixture models. The multimodal automated phenotyping (MAP) algorithm yields a predicted probability of phenotype for each patient and a threshold for classifying participants with phenotype yes/no. The algorithm was validated using labeled data for 16 phenotypes from a biorepository and further tested in an independent cohort phenome-wide association studies (PheWAS) for 2 single nucleotide polymorphisms with known associations. RESULTS: The MAP algorithm achieved higher or similar AUC and F-scores compared to the ICD code across all 16 phenotypes. The features assembled via the automated approach had comparable accuracy to those assembled via manual curation (AUCMAP 0.943, AUCmanual 0.941). The PheWAS results suggest that the MAP approach detected previously validated associations with higher power when compared to the standard PheWAS method based on ICD codes. CONCLUSION: The MAP approach increased the accuracy of phenotype definition while maintaining scalability, thereby facilitating use in studies requiring large-scale phenotyping, such as PheWAS.
OBJECTIVE: Electronic health records linked with biorepositories are a powerful platform for translational studies. A major bottleneck exists in the ability to phenotype patients accurately and efficiently. The objective of this study was to develop an automated high-throughput phenotyping method integrating International Classification of Diseases (ICD) codes and narrative data extracted using natural language processing (NLP). MATERIALS AND METHODS: We developed a mapping method for automatically identifying relevant ICD and NLP concepts for a specific phenotype leveraging the Unified Medical Language System. Along with health care utilization, aggregated ICD and NLP counts were jointly analyzed by fitting an ensemble of latent mixture models. The multimodal automated phenotyping (MAP) algorithm yields a predicted probability of phenotype for each patient and a threshold for classifying participants with phenotype yes/no. The algorithm was validated using labeled data for 16 phenotypes from a biorepository and further tested in an independent cohort phenome-wide association studies (PheWAS) for 2 single nucleotide polymorphisms with known associations. RESULTS: The MAP algorithm achieved higher or similar AUC and F-scores compared to the ICD code across all 16 phenotypes. The features assembled via the automated approach had comparable accuracy to those assembled via manual curation (AUCMAP 0.943, AUCmanual 0.941). The PheWAS results suggest that the MAP approach detected previously validated associations with higher power when compared to the standard PheWAS method based on ICD codes. CONCLUSION: The MAP approach increased the accuracy of phenotype definition while maintaining scalability, thereby facilitating use in studies requiring large-scale phenotyping, such as PheWAS.
Authors: Katherine M Newton; Peggy L Peissig; Abel Ngo Kho; Suzette J Bielinski; Richard L Berg; Vidhu Choudhary; Melissa Basford; Christopher G Chute; Iftikhar J Kullo; Rongling Li; Jennifer A Pacheco; Luke V Rasmussen; Leslie Spangler; Joshua C Denny Journal: J Am Med Inform Assoc Date: 2013-03-26 Impact factor: 4.497
Authors: Catherine A McCarty; Rex L Chisholm; Christopher G Chute; Iftikhar J Kullo; Gail P Jarvik; Eric B Larson; Rongling Li; Daniel R Masys; Marylyn D Ritchie; Dan M Roden; Jeffery P Struewing; Wendy A Wolf Journal: BMC Med Genomics Date: 2011-01-26 Impact factor: 3.063
Authors: Joshua C Denny; Marylyn D Ritchie; Melissa A Basford; Jill M Pulley; Lisa Bastarache; Kristin Brown-Gentry; Deede Wang; Dan R Masys; Dan M Roden; Dana C Crawford Journal: Bioinformatics Date: 2010-03-24 Impact factor: 6.937
Authors: Omri Gottesman; Helena Kuivaniemi; Gerard Tromp; W Andrew Faucett; Rongling Li; Teri A Manolio; Saskia C Sanderson; Joseph Kannry; Randi Zinberg; Melissa A Basford; Murray Brilliant; David J Carey; Rex L Chisholm; Christopher G Chute; John J Connolly; David Crosslin; Joshua C Denny; Carlos J Gallego; Jonathan L Haines; Hakon Hakonarson; John Harley; Gail P Jarvik; Isaac Kohane; Iftikhar J Kullo; Eric B Larson; Catherine McCarty; Marylyn D Ritchie; Dan M Roden; Maureen E Smith; Erwin P Böttinger; Marc S Williams Journal: Genet Med Date: 2013-06-06 Impact factor: 8.822
Authors: Vivian S Gainer; Andrew Cagan; Victor M Castro; Stacey Duey; Bhaswati Ghosh; Alyssa P Goodson; Sergey Goryachev; Reeta Metta; Taowei David Wang; Nich Wattanasin; Shawn N Murphy Journal: J Pers Med Date: 2016-02-26
Authors: Katherine P Liao; Ashwin N Ananthakrishnan; Vishesh Kumar; Zongqi Xia; Andrew Cagan; Vivian S Gainer; Sergey Goryachev; Pei Chen; Guergana K Savova; Denis Agniel; Susanne Churchill; Jaeyoung Lee; Shawn N Murphy; Robert M Plenge; Peter Szolovits; Isaac Kohane; Stanley Y Shaw; Elizabeth W Karlson; Tianxi Cai Journal: PLoS One Date: 2015-08-24 Impact factor: 3.240
Authors: Zongqi Xia; Elizabeth Secor; Lori B Chibnik; Riley M Bove; Suchun Cheng; Tanuja Chitnis; Andrew Cagan; Vivian S Gainer; Pei J Chen; Katherine P Liao; Stanley Y Shaw; Ashwin N Ananthakrishnan; Peter Szolovits; Howard L Weiner; Elizabeth W Karlson; Shawn N Murphy; Guergana K Savova; Tianxi Cai; Susanne E Churchill; Robert M Plenge; Isaac S Kohane; Philip L De Jager Journal: PLoS One Date: 2013-11-11 Impact factor: 3.240
Authors: Katherine P Liao; Tianxi Cai; Guergana K Savova; Shawn N Murphy; Elizabeth W Karlson; Ashwin N Ananthakrishnan; Vivian S Gainer; Stanley Y Shaw; Zongqi Xia; Peter Szolovits; Susanne Churchill; Isaac Kohane Journal: BMJ Date: 2015-04-24
Authors: Rebecca J Song; Yuk-Lam Ho; Petra Schubert; Yojin Park; Daniel Posner; Emily M Lord; Lauren Costa; Hanna Gerlovin; Katherine E Kurgansky; Tori Anglin-Foote; Scott DuVall; Jennifer E Huffman; Saiju Pyarajan; Jean C Beckham; Kyong-Mi Chang; Katherine P Liao; Luc Djousse; David R Gagnon; Stacey B Whitbourne; Rachel Ramoni; Sumitra Muralidhar; Philip S Tsao; Christopher J O'Donnell; John Michael Gaziano; Juan P Casas; Kelly Cho Journal: PLoS One Date: 2021-05-13 Impact factor: 3.240