Victor M Castro1, Dmitriy Dligach1, Sean Finan1, Sheng Yu1, Anil Can1, Muhammad Abd-El-Barr1, Vivian Gainer1, Nancy A Shadick1, Shawn Murphy1, Tianxi Cai1, Guergana Savova1, Scott T Weiss1, Rose Du2. 1. From Research Information Systems and Computing (V.M.C., V.G., S.M.), Partners Healthcare; Boston Children's Hospital Informatics Program (D.D., S.F., G.S.); Harvard Medical School (D.D., S.Y., A.C., M.A.-E.-B., N.A.S., S.M., S.T.W., R.D.); Department of Medicine (S.Y., S.T.W.), Department of Neurosurgery (A.C., M.A.-E.-B., R.D.), Division of Rheumatology, Immunology and Allergy (N.A.S.), and Channing Division of Network Medicine (S.T.W., R.D.), Brigham and Women's Hospital, Boston, MA; Center for Statistical Science (S.Y.), Tsinghua University, Beijing, China; Department of Neurology (S.M.), Massachusetts General Hospital; and Biostatistics (T.C.), Harvard School of Public Health, Boston, MA. 2. From Research Information Systems and Computing (V.M.C., V.G., S.M.), Partners Healthcare; Boston Children's Hospital Informatics Program (D.D., S.F., G.S.); Harvard Medical School (D.D., S.Y., A.C., M.A.-E.-B., N.A.S., S.M., S.T.W., R.D.); Department of Medicine (S.Y., S.T.W.), Department of Neurosurgery (A.C., M.A.-E.-B., R.D.), Division of Rheumatology, Immunology and Allergy (N.A.S.), and Channing Division of Network Medicine (S.T.W., R.D.), Brigham and Women's Hospital, Boston, MA; Center for Statistical Science (S.Y.), Tsinghua University, Beijing, China; Department of Neurology (S.M.), Massachusetts General Hospital; and Biostatistics (T.C.), Harvard School of Public Health, Boston, MA. rdu@partners.org.
Abstract
OBJECTIVE: To use natural language processing (NLP) in conjunction with the electronic medical record (EMR) to accurately identify patients with cerebral aneurysms and their matched controls. METHODS: ICD-9 and Current Procedural Terminology codes were used to obtain an initial data mart of potential aneurysm patients from the EMR. NLP was then used to train a classification algorithm with .632 bootstrap cross-validation used for correction of overfitting bias. The classification rule was then applied to the full data mart. Additional validation was performed on 300 patients classified as having aneurysms. Controls were obtained by matching age, sex, race, and healthcare use. RESULTS: We identified 55,675 patients of 4.2 million patients with ICD-9 and Current Procedural Terminology codes consistent with cerebral aneurysms. Of those, 16,823 patients had the term aneurysm occur near relevant anatomic terms. After training, a final algorithm consisting of 8 coded and 14 NLP variables was selected, yielding an overall area under the receiver-operating characteristic curve of 0.95. After the final algorithm was applied, 5,589 patients were classified as having aneurysms, and 54,952 controls were matched to those patients. The positive predictive value based on a validation cohort of 300 patients was 0.86. CONCLUSIONS: We harnessed the power of the EMR by applying NLP to obtain a large cohort of patients with intracranial aneurysms and their matched controls. Such algorithms can be generalized to other diseases for epidemiologic and genetic studies.
OBJECTIVE: To use natural language processing (NLP) in conjunction with the electronic medical record (EMR) to accurately identify patients with cerebral aneurysms and their matched controls. METHODS: ICD-9 and Current Procedural Terminology codes were used to obtain an initial data mart of potential aneurysmpatients from the EMR. NLP was then used to train a classification algorithm with .632 bootstrap cross-validation used for correction of overfitting bias. The classification rule was then applied to the full data mart. Additional validation was performed on 300 patients classified as having aneurysms. Controls were obtained by matching age, sex, race, and healthcare use. RESULTS: We identified 55,675 patients of 4.2 million patients with ICD-9 and Current Procedural Terminology codes consistent with cerebral aneurysms. Of those, 16,823 patients had the term aneurysm occur near relevant anatomic terms. After training, a final algorithm consisting of 8 coded and 14 NLP variables was selected, yielding an overall area under the receiver-operating characteristic curve of 0.95. After the final algorithm was applied, 5,589 patients were classified as having aneurysms, and 54,952 controls were matched to those patients. The positive predictive value based on a validation cohort of 300 patients was 0.86. CONCLUSIONS: We harnessed the power of the EMR by applying NLP to obtain a large cohort of patients with intracranial aneurysms and their matched controls. Such algorithms can be generalized to other diseases for epidemiologic and genetic studies.
Authors: Shawn N Murphy; Griffin Weber; Michael Mendis; Vivian Gainer; Henry C Chueh; Susanne Churchill; Isaac Kohane Journal: J Am Med Inform Assoc Date: 2010 Mar-Apr Impact factor: 4.497
Authors: Victor M Castro; W Kay Apperson; Vivian S Gainer; Ashwin N Ananthakrishnan; Alyssa P Goodson; Taowei D Wang; Christopher D Herrick; Shawn N Murphy Journal: J Biomed Inform Date: 2014-09-06 Impact factor: 6.317
Authors: Victor M Castro; Jessica Minnier; Shawn N Murphy; Isaac Kohane; Susanne E Churchill; Vivian Gainer; Tianxi Cai; Alison G Hoffnagle; Yael Dai; Stefanie Block; Sydney R Weill; Mireya Nadal-Vicens; Alisha R Pollastri; J Niels Rosenquist; Sergey Goryachev; Dost Ongur; Pamela Sklar; Roy H Perlis; Jordan W Smoller Journal: Am J Psychiatry Date: 2014-12-12 Impact factor: 18.112
Authors: Katherine P Liao; Jiehuan Sun; Tianrun A Cai; Nicholas Link; Chuan Hong; Jie Huang; Jennifer E Huffman; Jessica Gronsbell; Yichi Zhang; Yuk-Lam Ho; Victor Castro; Vivian Gainer; Shawn N Murphy; Christopher J O'Donnell; J Michael Gaziano; Kelly Cho; Peter Szolovits; Isaac S Kohane; Sheng Yu; Tianxi Cai Journal: J Am Med Inform Assoc Date: 2019-11-01 Impact factor: 4.497
Authors: Anil Can; Victor M Castro; Dmitriy Dligach; Sean Finan; Sheng Yu; Vivian Gainer; Nancy A Shadick; Guergana Savova; Shawn Murphy; Tianxi Cai; Scott T Weiss; Rose Du Journal: Stroke Date: 2018-04-05 Impact factor: 7.914
Authors: Majid Afshar; Dmitriy Dligach; Brihat Sharma; Xiaoyuan Cai; Jason Boyda; Steven Birch; Daniel Valdez; Suzan Zelisko; Cara Joyce; François Modave; Ron Price Journal: J Am Med Inform Assoc Date: 2019-11-01 Impact factor: 4.497
Authors: Anil Can; Robert F Rudy; Victor M Castro; Dmitriy Dligach; Sean Finan; Sheng Yu; Vivian Gainer; Nancy A Shadick; Guergana Savova; Shawn Murphy; Tianxi Cai; Scott T Weiss; Rose Du Journal: Stroke Date: 2018-05-29 Impact factor: 7.914
Authors: Anil Can; Victor M Castro; Yildirim H Ozdemir; Sarajune Dagen; Dmitriy Dligach; Sean Finan; Sheng Yu; Vivian Gainer; Nancy A Shadick; Guergana Savova; Shawn Murphy; Tianxi Cai; Scott T Weiss; Rose Du Journal: Transl Stroke Res Date: 2017-07-27 Impact factor: 6.829
Authors: Sheng Yu; Yumeng Ma; Jessica Gronsbell; Tianrun Cai; Ashwin N Ananthakrishnan; Vivian S Gainer; Susanne E Churchill; Peter Szolovits; Shawn N Murphy; Isaac S Kohane; Katherine P Liao; Tianxi Cai Journal: J Am Med Inform Assoc Date: 2018-01-01 Impact factor: 4.497
Authors: Anil Can; Victor M Castro; Dmitriy Dligach; Sean Finan; Sheng Yu; Vivian Gainer; Nancy A Shadick; Guergana Savova; Shawn Murphy; Tianxi Cai; Scott T Weiss; Rose Du Journal: Stroke Date: 2018-09 Impact factor: 7.914
Authors: Majid Afshar; Andrew Phillips; Niranjan Karnik; Jeanne Mueller; Daniel To; Richard Gonzalez; Ron Price; Richard Cooper; Cara Joyce; Dmitriy Dligach Journal: J Am Med Inform Assoc Date: 2019-03-01 Impact factor: 4.497
Authors: Anil Can; Victor M Castro; Yildirim H Ozdemir; Sarajune Dagen; Sheng Yu; Dmitriy Dligach; Sean Finan; Vivian Gainer; Nancy A Shadick; Shawn Murphy; Tianxi Cai; Guergana Savova; Ruben Dammers; Scott T Weiss; Rose Du Journal: Neurology Date: 2017-08-30 Impact factor: 9.910