Vibhu Agarwal1, Tanya Podchiyska2, Juan M Banda3, Veena Goel4,5, Tiffany I Leung6, Evan P Minty2,7, Timothy E Sweeney2,8, Elsie Gyang9, Nigam H Shah3. 1. Biomedical Informatics Training Program, Stanford University, Stanford CA 94305-5479, USA vibhua@stanford.edu. 2. Biomedical Informatics Training Program, Stanford University, Stanford CA 94305-5479, USA. 3. Stanford Center for Biomedical Informatics Research, Stanford University, Stanford CA 94305-5479, USA. 4. Department of Pediatrics, Stanford University School of Medicine, Stanford CA 94305-5208, USA. 5. Department of Clinical Informatics, Stanford Children's Health, Stanford CA 94305-5474, USA. 6. Division of General Medical Disciplines, Stanford University, Stanford CA 94305, USA. 7. Faculty of Medicine, University of Calgary, Calgary Alberta, T2N 4N1, Canada. 8. Department of Surgery, Stanford Hospital & Clinics, Stanford CA 94305-2200, USA. 9. Division of Vascular Surgery, Stanford Hospital & Clinics, Stanford CA 94305-5642, USA.
Abstract
OBJECTIVE: Traditionally, patient groups with a phenotype are selected through rule-based definitions whose creation and validation are time-consuming. Machine learning approaches to electronic phenotyping are limited by the paucity of labeled training datasets. We demonstrate the feasibility of utilizing semi-automatically labeled training sets to create phenotype models via machine learning, using a comprehensive representation of the patient medical record. METHODS: We use a list of keywords specific to the phenotype of interest to generate noisy labeled training data. We train L1 penalized logistic regression models for a chronic and an acute disease and evaluate the performance of the models against a gold standard. RESULTS: Our models for Type 2 diabetes mellitus and myocardial infarction achieve precision and accuracy of 0.90, 0.89, and 0.86, 0.89, respectively. Local implementations of the previously validated rule-based definitions for Type 2 diabetes mellitus and myocardial infarction achieve precision and accuracy of 0.96, 0.92 and 0.84, 0.87, respectively.We have demonstrated feasibility of learning phenotype models using imperfectly labeled data for a chronic and acute phenotype. Further research in feature engineering and in specification of the keyword list can improve the performance of the models and the scalability of the approach. CONCLUSIONS: Our method provides an alternative to manual labeling for creating training sets for statistical models of phenotypes. Such an approach can accelerate research with large observational healthcare datasets and may also be used to create local phenotype models.
OBJECTIVE: Traditionally, patient groups with a phenotype are selected through rule-based definitions whose creation and validation are time-consuming. Machine learning approaches to electronic phenotyping are limited by the paucity of labeled training datasets. We demonstrate the feasibility of utilizing semi-automatically labeled training sets to create phenotype models via machine learning, using a comprehensive representation of the patient medical record. METHODS: We use a list of keywords specific to the phenotype of interest to generate noisy labeled training data. We train L1 penalized logistic regression models for a chronic and an acute disease and evaluate the performance of the models against a gold standard. RESULTS: Our models for Type 2 diabetes mellitus and myocardial infarction achieve precision and accuracy of 0.90, 0.89, and 0.86, 0.89, respectively. Local implementations of the previously validated rule-based definitions for Type 2 diabetes mellitus and myocardial infarction achieve precision and accuracy of 0.96, 0.92 and 0.84, 0.87, respectively.We have demonstrated feasibility of learning phenotype models using imperfectly labeled data for a chronic and acute phenotype. Further research in feature engineering and in specification of the keyword list can improve the performance of the models and the scalability of the approach. CONCLUSIONS: Our method provides an alternative to manual labeling for creating training sets for statistical models of phenotypes. Such an approach can accelerate research with large observational healthcare datasets and may also be used to create local phenotype models.
Authors: Min Jiang; Yukun Chen; Mei Liu; S Trent Rosenbloom; Subramani Mani; Joshua C Denny; Hua Xu Journal: J Am Med Inform Assoc Date: 2011-04-20 Impact factor: 4.497
Authors: Katherine M Newton; Peggy L Peissig; Abel Ngo Kho; Suzette J Bielinski; Richard L Berg; Vidhu Choudhary; Melissa Basford; Christopher G Chute; Iftikhar J Kullo; Rongling Li; Jennifer A Pacheco; Luke V Rasmussen; Leslie Spangler; Joshua C Denny Journal: J Am Med Inform Assoc Date: 2013-03-26 Impact factor: 4.497
Authors: Wei-Qi Wei; Cynthia L Leibson; Jeanine E Ransom; Abel N Kho; Pedro J Caraballo; High Seng Chai; Barbara P Yawn; Jennifer A Pacheco; Christopher G Chute Journal: J Am Med Inform Assoc Date: 2012-01-16 Impact factor: 4.497
Authors: Sheng Yu; Katherine P Liao; Stanley Y Shaw; Vivian S Gainer; Susanne E Churchill; Peter Szolovits; Shawn N Murphy; Isaac S Kohane; Tianxi Cai Journal: J Am Med Inform Assoc Date: 2015-04-29 Impact factor: 4.497
Authors: Catherine A McCarty; Rex L Chisholm; Christopher G Chute; Iftikhar J Kullo; Gail P Jarvik; Eric B Larson; Rongling Li; Daniel R Masys; Marylyn D Ritchie; Dan M Roden; Jeffery P Struewing; Wendy A Wolf Journal: BMC Med Genomics Date: 2011-01-26 Impact factor: 3.063
Authors: Jyotishman Pathak; Kent R Bailey; Calvin E Beebe; Steven Bethard; David C Carrell; Pei J Chen; Dmitriy Dligach; Cory M Endle; Lacey A Hart; Peter J Haug; Stanley M Huff; Vinod C Kaggal; Dingcheng Li; Hongfang Liu; Kyle Marchant; James Masanz; Timothy Miller; Thomas A Oniki; Martha Palmer; Kevin J Peterson; Susan Rea; Guergana K Savova; Craig R Stancl; Sunghwan Sohn; Harold R Solbrig; Dale B Suesse; Cui Tao; David P Taylor; Les Westberg; Stephen Wu; Ning Zhuo; Christopher G Chute Journal: J Am Med Inform Assoc Date: 2013-11-04 Impact factor: 4.497
Authors: Omri Gottesman; Helena Kuivaniemi; Gerard Tromp; W Andrew Faucett; Rongling Li; Teri A Manolio; Saskia C Sanderson; Joseph Kannry; Randi Zinberg; Melissa A Basford; Murray Brilliant; David J Carey; Rex L Chisholm; Christopher G Chute; John J Connolly; David Crosslin; Joshua C Denny; Carlos J Gallego; Jonathan L Haines; Hakon Hakonarson; John Harley; Gail P Jarvik; Isaac Kohane; Iftikhar J Kullo; Eric B Larson; Catherine McCarty; Marylyn D Ritchie; Dan M Roden; Maureen E Smith; Erwin P Böttinger; Marc S Williams Journal: Genet Med Date: 2013-06-06 Impact factor: 8.822
Authors: Katherine P Liao; Jiehuan Sun; Tianrun A Cai; Nicholas Link; Chuan Hong; Jie Huang; Jennifer E Huffman; Jessica Gronsbell; Yichi Zhang; Yuk-Lam Ho; Victor Castro; Vivian Gainer; Shawn N Murphy; Christopher J O'Donnell; J Michael Gaziano; Kelly Cho; Peter Szolovits; Isaac S Kohane; Sheng Yu; Tianxi Cai Journal: J Am Med Inform Assoc Date: 2019-11-01 Impact factor: 4.497
Authors: Curtis P Langlotz; Bibb Allen; Bradley J Erickson; Jayashree Kalpathy-Cramer; Keith Bigelow; Tessa S Cook; Adam E Flanders; Matthew P Lungren; David S Mendelson; Jeffrey D Rudie; Ge Wang; Krishna Kandarpa Journal: Radiology Date: 2019-04-16 Impact factor: 11.105
Authors: Sheng Yu; Yumeng Ma; Jessica Gronsbell; Tianrun Cai; Ashwin N Ananthakrishnan; Vivian S Gainer; Susanne E Churchill; Peter Szolovits; Shawn N Murphy; Isaac S Kohane; Katherine P Liao; Tianxi Cai Journal: J Am Med Inform Assoc Date: 2018-01-01 Impact factor: 4.497
Authors: Benjamin S Glicksberg; Riccardo Miotto; Kipp W Johnson; Khader Shameer; Li Li; Rong Chen; Joel T Dudley Journal: Pac Symp Biocomput Date: 2018