Sheng Yu1,2, Abhishek Chakrabortty3, Katherine P Liao4, Tianrun Cai5, Ashwin N Ananthakrishnan6, Vivian S Gainer7, Susanne E Churchill8, Peter Szolovits9, Shawn N Murphy7,10, Isaac S Kohane8, Tianxi Cai3. 1. Center for Statistical Science, Tsinghua University, Beijing, China. 2. Department of Industrial Engineering, Tsinghua University, Beijing, China. 3. Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, USA. 4. Division of Rheumatology, Brigham and Women's Hospital, Boston, Massachusetts, USA. 5. Department of Radiology, Brigham and Women's Hospital, Boston, Massachusetts, USA. 6. Division of Gastroenterology, Massachusetts General Hospital, Boston, Massachusetts, USA. 7. Research IS and Computing, Partners HealthCare, Charlestown, Massachusetts, USA. 8. Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts, USA. 9. Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA. 10. Department of Neurology, Massachusetts General Hospital, Boston, Massachusetts, USA.
Abstract
OBJECTIVE: Phenotyping algorithms are capable of accurately identifying patients with specific phenotypes from within electronic medical records systems. However, developing phenotyping algorithms in a scalable way remains a challenge due to the extensive human resources required. This paper introduces a high-throughput unsupervised feature selection method, which improves the robustness and scalability of electronic medical record phenotyping without compromising its accuracy. METHODS: The proposed Surrogate-Assisted Feature Extraction (SAFE) method selects candidate features from a pool of comprehensive medical concepts found in publicly available knowledge sources. The target phenotype's International Classification of Diseases, Ninth Revision and natural language processing counts, acting as noisy surrogates to the gold-standard labels, are used to create silver-standard labels. Candidate features highly predictive of the silver-standard labels are selected as the final features. RESULTS: Algorithms were trained to identify patients with coronary artery disease, rheumatoid arthritis, Crohn's disease, and ulcerative colitis using various numbers of labels to compare the performance of features selected by SAFE, a previously published automated feature extraction for phenotyping procedure, and domain experts. The out-of-sample area under the receiver operating characteristic curve and F -score from SAFE algorithms were remarkably higher than those from the other two, especially at small label sizes. CONCLUSION: SAFE advances high-throughput phenotyping methods by automatically selecting a succinct set of informative features for algorithm training, which in turn reduces overfitting and the needed number of gold-standard labels. SAFE also potentially identifies important features missed by automated feature extraction for phenotyping or experts.
OBJECTIVE: Phenotyping algorithms are capable of accurately identifying patients with specific phenotypes from within electronic medical records systems. However, developing phenotyping algorithms in a scalable way remains a challenge due to the extensive human resources required. This paper introduces a high-throughput unsupervised feature selection method, which improves the robustness and scalability of electronic medical record phenotyping without compromising its accuracy. METHODS: The proposed Surrogate-Assisted Feature Extraction (SAFE) method selects candidate features from a pool of comprehensive medical concepts found in publicly available knowledge sources. The target phenotype's International Classification of Diseases, Ninth Revision and natural language processing counts, acting as noisy surrogates to the gold-standard labels, are used to create silver-standard labels. Candidate features highly predictive of the silver-standard labels are selected as the final features. RESULTS: Algorithms were trained to identify patients with coronary artery disease, rheumatoid arthritis, Crohn's disease, and ulcerative colitis using various numbers of labels to compare the performance of features selected by SAFE, a previously published automated feature extraction for phenotyping procedure, and domain experts. The out-of-sample area under the receiver operating characteristic curve and F -score from SAFE algorithms were remarkably higher than those from the other two, especially at small label sizes. CONCLUSION: SAFE advances high-throughput phenotyping methods by automatically selecting a succinct set of informative features for algorithm training, which in turn reduces overfitting and the needed number of gold-standard labels. SAFE also potentially identifies important features missed by automated feature extraction for phenotyping or experts.
Authors: Richard H White; Martina Garcia; Banafsheh Sadeghi; Daniel J Tancredi; Patricia Zrelak; Joanne Cuny; Pradeep Sama; Harriet Gammon; Stephen Schmaltz; Patrick S Romano Journal: Thromb Res Date: 2010-04-28 Impact factor: 3.944
Authors: Victor M Castro; Jessica Minnier; Shawn N Murphy; Isaac Kohane; Susanne E Churchill; Vivian Gainer; Tianxi Cai; Alison G Hoffnagle; Yael Dai; Stefanie Block; Sydney R Weill; Mireya Nadal-Vicens; Alisha R Pollastri; J Niels Rosenquist; Sergey Goryachev; Dost Ongur; Pamela Sklar; Roy H Perlis; Jordan W Smoller Journal: Am J Psychiatry Date: 2014-12-12 Impact factor: 18.112
Authors: Marylyn D Ritchie; Joshua C Denny; Rebecca L Zuvich; Dana C Crawford; Jonathan S Schildcrout; Lisa Bastarache; Andrea H Ramirez; Jonathan D Mosley; Jill M Pulley; Melissa A Basford; Yuki Bradford; Luke V Rasmussen; Jyotishman Pathak; Christopher G Chute; Iftikhar J Kullo; Catherine A McCarty; Rex L Chisholm; Abel N Kho; Christopher S Carlson; Eric B Larson; Gail P Jarvik; Nona Sotoodehnia; Teri A Manolio; Rongling Li; Daniel R Masys; Jonathan L Haines; Dan M Roden Journal: Circulation Date: 2013-03-05 Impact factor: 29.690
Authors: Victor M Castro; Caitlin C Clements; Shawn N Murphy; Vivian S Gainer; Maurizio Fava; Jeffrey B Weilburg; Jane L Erb; Susanne E Churchill; Isaac S Kohane; Dan V Iosifescu; Jordan W Smoller; Roy H Perlis Journal: BMJ Date: 2013-01-29
Authors: Katherine P Liao; Ashwin N Ananthakrishnan; Vishesh Kumar; Zongqi Xia; Andrew Cagan; Vivian S Gainer; Sergey Goryachev; Pei Chen; Guergana K Savova; Denis Agniel; Susanne Churchill; Jaeyoung Lee; Shawn N Murphy; Robert M Plenge; Peter Szolovits; Isaac Kohane; Stanley Y Shaw; Elizabeth W Karlson; Tianxi Cai Journal: PLoS One Date: 2015-08-24 Impact factor: 3.240
Authors: Katherine P Liao; Tianxi Cai; Guergana K Savova; Shawn N Murphy; Elizabeth W Karlson; Ashwin N Ananthakrishnan; Vivian S Gainer; Stanley Y Shaw; Zongqi Xia; Peter Szolovits; Susanne Churchill; Isaac Kohane Journal: BMJ Date: 2015-04-24
Authors: Katherine P Liao; Jiehuan Sun; Tianrun A Cai; Nicholas Link; Chuan Hong; Jie Huang; Jennifer E Huffman; Jessica Gronsbell; Yichi Zhang; Yuk-Lam Ho; Victor Castro; Vivian Gainer; Shawn N Murphy; Christopher J O'Donnell; J Michael Gaziano; Kelly Cho; Peter Szolovits; Isaac S Kohane; Sheng Yu; Tianxi Cai Journal: J Am Med Inform Assoc Date: 2019-11-01 Impact factor: 4.497
Authors: Majid Afshar; Dmitriy Dligach; Brihat Sharma; Xiaoyuan Cai; Jason Boyda; Steven Birch; Daniel Valdez; Suzan Zelisko; Cara Joyce; François Modave; Ron Price Journal: J Am Med Inform Assoc Date: 2019-11-01 Impact factor: 4.497
Authors: Sheng Yu; Yumeng Ma; Jessica Gronsbell; Tianrun Cai; Ashwin N Ananthakrishnan; Vivian S Gainer; Susanne E Churchill; Peter Szolovits; Shawn N Murphy; Isaac S Kohane; Katherine P Liao; Tianxi Cai Journal: J Am Med Inform Assoc Date: 2018-01-01 Impact factor: 4.497
Authors: Jejo D Koola; Sharon E Davis; Omar Al-Nimri; Sharidan K Parr; Daniel Fabbri; Bradley A Malin; Samuel B Ho; Michael E Matheny Journal: J Biomed Inform Date: 2018-03-09 Impact factor: 6.317
Authors: Lingjiao Zhang; Xiruo Ding; Yanyuan Ma; Naveen Muthu; Imran Ajmal; Jason H Moore; Daniel S Herman; Jinbo Chen Journal: J Am Med Inform Assoc Date: 2020-01-01 Impact factor: 4.497