Sheng Yu1,2, Yumeng Ma3, Jessica Gronsbell4, Tianrun Cai5, Ashwin N Ananthakrishnan6, Vivian S Gainer7, Susanne E Churchill8, Peter Szolovits9, Shawn N Murphy7,10, Isaac S Kohane8, Katherine P Liao11, Tianxi Cai4. 1. Center for Statistical Science, Tsinghua University, Beijing, China. 2. Department of Industrial Engineering, Tsinghua University, Beijing, China. 3. Department of Mathematical Sciences, Tsinghua University, Beijing, China. 4. Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA. 5. Department of Radiology, Brigham and Women's Hospital, Boston, MA, USA. 6. Division of Gastroenterology, Massachusetts General Hospital, Boston, MA, USA. 7. Research Information Science and Computing, Partners HealthCare, Charlestown, MA, USA. 8. Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA. 9. Computer Science and Artificial Intelligence Laboratory (CSAIL), Massachusetts Institute of Technology, Cambridge, MA, USA. 10. Department of Neurology, Massachusetts General Hospital, Boston, MA, USA. 11. Department of Medicine, Division of Rheumatology, Immunology, and Allergy, Brigham and Women's Hospital, Boston, MA, USA.
Abstract
Objective: Electronic health record (EHR)-based phenotyping infers whether a patient has a disease based on the information in his or her EHR. A human-annotated training set with gold-standard disease status labels is usually required to build an algorithm for phenotyping based on a set of predictive features. The time intensiveness of annotation and feature curation severely limits the ability to achieve high-throughput phenotyping. While previous studies have successfully automated feature curation, annotation remains a major bottleneck. In this paper, we present PheNorm, a phenotyping algorithm that does not require expert-labeled samples for training. Methods: The most predictive features, such as the number of International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM) codes or mentions of the target phenotype, are normalized to resemble a normal mixture distribution with high area under the receiver operating curve (AUC) for prediction. The transformed features are then denoised and combined into a score for accurate disease classification. Results: We validated the accuracy of PheNorm with 4 phenotypes: coronary artery disease, rheumatoid arthritis, Crohn's disease, and ulcerative colitis. The AUCs of the PheNorm score reached 0.90, 0.94, 0.95, and 0.94 for the 4 phenotypes, respectively, which were comparable to the accuracy of supervised algorithms trained with sample sizes of 100-300, with no statistically significant difference. Conclusion: The accuracy of the PheNorm algorithms is on par with algorithms trained with annotated samples. PheNorm fully automates the generation of accurate phenotyping algorithms and demonstrates the capacity for EHR-driven annotations to scale to the next level - phenotypic big data.
Objective: Electronic health record (EHR)-based phenotyping infers whether a patient has a disease based on the information in his or her EHR. A human-annotated training set with gold-standard disease status labels is usually required to build an algorithm for phenotyping based on a set of predictive features. The time intensiveness of annotation and feature curation severely limits the ability to achieve high-throughput phenotyping. While previous studies have successfully automated feature curation, annotation remains a major bottleneck. In this paper, we present PheNorm, a phenotyping algorithm that does not require expert-labeled samples for training. Methods: The most predictive features, such as the number of International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM) codes or mentions of the target phenotype, are normalized to resemble a normal mixture distribution with high area under the receiver operating curve (AUC) for prediction. The transformed features are then denoised and combined into a score for accurate disease classification. Results: We validated the accuracy of PheNorm with 4 phenotypes: coronary artery disease, rheumatoid arthritis, Crohn's disease, and ulcerative colitis. The AUCs of the PheNorm score reached 0.90, 0.94, 0.95, and 0.94 for the 4 phenotypes, respectively, which were comparable to the accuracy of supervised algorithms trained with sample sizes of 100-300, with no statistically significant difference. Conclusion: The accuracy of the PheNorm algorithms is on par with algorithms trained with annotated samples. PheNorm fully automates the generation of accurate phenotyping algorithms and demonstrates the capacity for EHR-driven annotations to scale to the next level - phenotypic big data.
Authors: Ashwin N Ananthakrishnan; Tianxi Cai; Guergana Savova; Su-Chun Cheng; Pei Chen; Raul Guzman Perez; Vivian S Gainer; Shawn N Murphy; Peter Szolovits; Zongqi Xia; Stanley Shaw; Susanne Churchill; Elizabeth W Karlson; Isaac Kohane; Robert M Plenge; Katherine P Liao Journal: Inflamm Bowel Dis Date: 2013-06 Impact factor: 5.325
Authors: Sheng Yu; Katherine P Liao; Stanley Y Shaw; Vivian S Gainer; Susanne E Churchill; Peter Szolovits; Shawn N Murphy; Isaac S Kohane; Tianxi Cai Journal: J Am Med Inform Assoc Date: 2015-04-29 Impact factor: 4.497
Authors: Sheng Yu; Kanako K Kumamaru; Elizabeth George; Ruth M Dunne; Arash Bedayat; Matey Neykov; Andetta R Hunsaker; Karin E Dill; Tianxi Cai; Frank J Rybicki Journal: J Biomed Inform Date: 2014-08-10 Impact factor: 6.317
Authors: Catherine A McCarty; Rex L Chisholm; Christopher G Chute; Iftikhar J Kullo; Gail P Jarvik; Eric B Larson; Rongling Li; Daniel R Masys; Marylyn D Ritchie; Dan M Roden; Jeffery P Struewing; Wendy A Wolf Journal: BMC Med Genomics Date: 2011-01-26 Impact factor: 3.063
Authors: Joshua C Denny; Marylyn D Ritchie; Melissa A Basford; Jill M Pulley; Lisa Bastarache; Kristin Brown-Gentry; Deede Wang; Dan R Masys; Dan M Roden; Dana C Crawford Journal: Bioinformatics Date: 2010-03-24 Impact factor: 6.937
Authors: Victor M Castro; Jessica Minnier; Shawn N Murphy; Isaac Kohane; Susanne E Churchill; Vivian Gainer; Tianxi Cai; Alison G Hoffnagle; Yael Dai; Stefanie Block; Sydney R Weill; Mireya Nadal-Vicens; Alisha R Pollastri; J Niels Rosenquist; Sergey Goryachev; Dost Ongur; Pamela Sklar; Roy H Perlis; Jordan W Smoller Journal: Am J Psychiatry Date: 2014-12-12 Impact factor: 18.112
Authors: Sheng Yu; Abhishek Chakrabortty; Katherine P Liao; Tianrun Cai; Ashwin N Ananthakrishnan; Vivian S Gainer; Susanne E Churchill; Peter Szolovits; Shawn N Murphy; Isaac S Kohane; Tianxi Cai Journal: J Am Med Inform Assoc Date: 2017-04-01 Impact factor: 4.497
Authors: Katherine P Liao; Jiehuan Sun; Tianrun A Cai; Nicholas Link; Chuan Hong; Jie Huang; Jennifer E Huffman; Jessica Gronsbell; Yichi Zhang; Yuk-Lam Ho; Victor Castro; Vivian Gainer; Shawn N Murphy; Christopher J O'Donnell; J Michael Gaziano; Kelly Cho; Peter Szolovits; Isaac S Kohane; Sheng Yu; Tianxi Cai Journal: J Am Med Inform Assoc Date: 2019-11-01 Impact factor: 4.497
Authors: Majid Afshar; Dmitriy Dligach; Brihat Sharma; Xiaoyuan Cai; Jason Boyda; Steven Birch; Daniel Valdez; Suzan Zelisko; Cara Joyce; François Modave; Ron Price Journal: J Am Med Inform Assoc Date: 2019-11-01 Impact factor: 4.497
Authors: Marzyeh Ghassemi; Tristan Naumann; Peter Schulam; Andrew L Beam; Irene Y Chen; Rajesh Ranganath Journal: AMIA Jt Summits Transl Sci Proc Date: 2020-05-30
Authors: Thomas H McCoy; Larry Han; Amelia M Pellegrini; Rudolph E Tanzi; Sabina Berretta; Roy H Perlis Journal: Alzheimers Dement Date: 2020-01-16 Impact factor: 21.566
Authors: George Hripcsak; Ning Shang; Peggy L Peissig; Luke V Rasmussen; Cong Liu; Barbara Benoit; Robert J Carroll; David S Carrell; Joshua C Denny; Ozan Dikilitas; Vivian S Gainer; Kayla Marie Howell; Jeffrey G Klann; Iftikhar J Kullo; Todd Lingren; Frank D Mentch; Shawn N Murphy; Karthik Natarajan; Jennifer A Pacheco; Wei-Qi Wei; Ken Wiley; Chunhua Weng Journal: J Biomed Inform Date: 2019-07-17 Impact factor: 6.317
Authors: David S Carrell; Bradley A Malin; David J Cronkite; John S Aberdeen; Cheryl Clark; Muqun Rachel Li; Dikshya Bastakoty; Steve Nyemba; Lynette Hirschman Journal: J Am Med Inform Assoc Date: 2020-07-01 Impact factor: 4.497