Rashmee U Shah1, R Kannan Mutharasan2, Faraz S Ahmad2, Anna G Rosenblatt3, Hawkins C Gay2, Benjamin A Steinberg1, Mark Yandell4,5, Martin Tristani-Firouzi6,7, Jake Klewer8, Rebeka Mukherjee1, Donald M Lloyd-Jones9. 1. Division of Cardiovascular Medicine, Department of Internal Medicine (R.U.S., B.A.S., R.M.), University of Utah School of Medicine, Salt Lake City. 2. Division of Cardiology, Department of Medicine (R.K.M., F.S.A., H.C.G.), Northwestern University Feinberg School of Medicine, Chicago, IL. 3. Division of Cardiology (A.G.R.), The University of Texas Southwestern Medical Center, Dallas. 4. Eccles Institute of Human Genetics (M.Y.), University of Utah, Salt Lake City. 5. USTAR Center for Genetic Discovery (M.Y.), University of Utah, Salt Lake City. 6. Division of Pediatric Cardiology (M.T.-F.), University of Utah School of Medicine, Salt Lake City. 7. Nora Eccles Harrison Cardiovascular Research and Training Institute (M.T.-F.), University of Utah, Salt Lake City. 8. Department of Internal Medicine (J.K.), University of Utah School of Medicine, Salt Lake City. 9. Department of Preventive Medicine (D.M.L.-J.), Northwestern University Feinberg School of Medicine, Chicago, IL.
Abstract
BACKGROUND: The electronic medical record contains a wealth of information buried in free text. We created a natural language processing algorithm to identify patients with atrial fibrillation (AF) using text alone. METHODS AND RESULTS: We created 3 data sets from patients with at least one AF billing code from 2010 to 2017: a training set (n=886), an internal validation set from site no. 1 (n=285), and an external validation set from site no. 2 (n=276). A team of clinicians reviewed and adjudicated patients as AF present or absent, which served as the reference standard. We trained 54 algorithms to classify each patient, varying the model, number of features, number of stop words, and the method used to create the feature set. The algorithm with the highest F-score (the harmonic mean of sensitivity and positive predictive value) in the training set was applied to the validation sets. F-scores and area under the receiver operating characteristic curves were compared between site no. 1 and site no. 2 using bootstrapping. Adjudicated AF prevalence was 75.1% at site no. 1 and 86.2% at site no. 2. Among 54 algorithms, the best performing model was logistic regression, using 1000 features, 100 stop words, and term frequency-inverse document frequency method to create the feature set, with sensitivity 92.8%, specificity 93.9%, and an area under the receiver operating characteristic curve of 0.93 in the training set. The performance at site no. 1 was sensitivity 92.5%, specificity 88.7%, with an area under the receiver operating characteristic curve of 0.91. The performance at site no. 2 was sensitivity 89.5%, specificity 71.1%, with an area under the receiver operating characteristic curve of 0.80. The F-score was lower at site no. 2 compared with site no. 1 (92.5% [SD, 1.1%] versus 94.2% [SD, 1.1%]; P<0.001). CONCLUSIONS: We developed a natural language processing algorithm to identify patients with AF using text alone, with >90% F-score at 2 separate sites. This approach allows better use of the clinical narrative and creates an opportunity for precise, high-throughput cohort identification.
BACKGROUND: The electronic medical record contains a wealth of information buried in free text. We created a natural language processing algorithm to identify patients with atrial fibrillation (AF) using text alone. METHODS AND RESULTS: We created 3 data sets from patients with at least one AF billing code from 2010 to 2017: a training set (n=886), an internal validation set from site no. 1 (n=285), and an external validation set from site no. 2 (n=276). A team of clinicians reviewed and adjudicated patients as AF present or absent, which served as the reference standard. We trained 54 algorithms to classify each patient, varying the model, number of features, number of stop words, and the method used to create the feature set. The algorithm with the highest F-score (the harmonic mean of sensitivity and positive predictive value) in the training set was applied to the validation sets. F-scores and area under the receiver operating characteristic curves were compared between site no. 1 and site no. 2 using bootstrapping. Adjudicated AF prevalence was 75.1% at site no. 1 and 86.2% at site no. 2. Among 54 algorithms, the best performing model was logistic regression, using 1000 features, 100 stop words, and term frequency-inverse document frequency method to create the feature set, with sensitivity 92.8%, specificity 93.9%, and an area under the receiver operating characteristic curve of 0.93 in the training set. The performance at site no. 1 was sensitivity 92.5%, specificity 88.7%, with an area under the receiver operating characteristic curve of 0.91. The performance at site no. 2 was sensitivity 89.5%, specificity 71.1%, with an area under the receiver operating characteristic curve of 0.80. The F-score was lower at site no. 2 compared with site no. 1 (92.5% [SD, 1.1%] versus 94.2% [SD, 1.1%]; P<0.001). CONCLUSIONS: We developed a natural language processing algorithm to identify patients with AF using text alone, with >90% F-score at 2 separate sites. This approach allows better use of the clinical narrative and creates an opportunity for precise, high-throughput cohort identification.
Entities:
Keywords:
algorithm; artificial intelligence; atrial fibrillation; electronic medical record; natural language processing; prevalence
Authors: Paul N Jensen; Karin Johnson; James Floyd; Susan R Heckbert; Ryan Carnahan; Sascha Dublin Journal: Pharmacoepidemiol Drug Saf Date: 2012-01 Impact factor: 2.890
Authors: Beth L Nordstrom; Joanna L Whyte; Marilyn Stolar; Catherine Mercaldi; Joel D Kallich Journal: Pharmacoepidemiol Drug Saf Date: 2012-05 Impact factor: 2.890
Authors: Jennifer A Pacheco; Luke V Rasmussen; Richard C Kiefer; Thomas R Campion; Peter Speltz; Robert J Carroll; Sarah C Stallings; Huan Mo; Monika Ahuja; Guoqian Jiang; Eric R LaRose; Peggy L Peissig; Ning Shang; Barbara Benoit; Vivian S Gainer; Kenneth Borthwick; Kathryn L Jackson; Ambrish Sharma; Andy Yizhou Wu; Abel N Kho; Dan M Roden; Jyotishman Pathak; Joshua C Denny; William K Thompson Journal: J Am Med Inform Assoc Date: 2018-11-01 Impact factor: 4.497
Authors: Cian McCarthy; Sean Murphy; Joshua A Cohen; Saad Rehman; Maeve Jones-O'Connor; David S Olshan; Avinainder Singh; Muthiah Vaduganathan; James L Januzzi; Jason H Wasfy Journal: JAMA Cardiol Date: 2019-05-01 Impact factor: 14.676
Authors: William R Hersh; Mark G Weiner; Peter J Embi; Judith R Logan; Philip R O Payne; Elmer V Bernstam; Harold P Lehmann; George Hripcsak; Timothy H Hartzog; James J Cimino; Joel H Saltz Journal: Med Care Date: 2013-08 Impact factor: 2.983
Authors: Rashmee U Shah; Rebeka Mukherjee; Yue Zhang; Aubrey E Jones; Jennifer Springer; Ian Hackett; Benjamin A Steinberg; Donald M Lloyd-Jones; Wendy W Chapman Journal: J Am Heart Assoc Date: 2020-02-26 Impact factor: 5.501
Authors: Maxwell Taggart; Wendy W Chapman; Benjamin A Steinberg; Shane Ruckel; Arianna Pregenzer-Wenzler; Yishuai Du; Jeffrey Ferraro; Brian T Bucher; Donald M Lloyd-Jones; Matthew T Rondina; Rashmee U Shah Journal: JAMA Netw Open Date: 2018-10-05