Susan Gruber1, Douglas Krakower2,3,4,5, John T Menchaca5, Katherine Hsu6,7, Rebecca Hawrusik6, Judith C Maro5, Noelle M Cocoros5, Benjamin A Kruskal8, Ira B Wilson9, Kenneth H Mayer2,3,4, Michael Klompas5,10. 1. Putnam Data Sciences, LLC, Cambridge, Massachusetts, USA. 2. Division of Infectious Diseases, Beth Israel Deaconess Medical Center, Boston, Massachusetts, USA. 3. The Fenway Institute, Fenway Health, Boston, Massachusetts, USA. 4. Harvard Medical School, Boston, Massachusetts, USA. 5. Department of Population Medicine, Harvard Medical School, Boston, Massachusetts, USA. 6. Massachusetts Department of Public Health, Boston, Massachusetts, USA. 7. Department of Pediatrics, Boston Medical Center, Boston, Massachusetts, USA. 8. Atrius Health, Boston, Massachusetts, USA. 9. Department of Health Services, Policy and Practice, Brown University, Providence, Rhode Island, USA. 10. Division of Infectious Diseases, Brigham and Women's Hospital, Boston, Massachusetts, USA.
Abstract
Human immunodeficiency virus (HIV) pre-exposure prophylaxis (PrEP) protects high risk patients from becoming infected with HIV. Clinicians need help to identify candidates for PrEP based on information routinely collected in electronic health records (EHRs). The greatest statistical challenge in developing a risk prediction model is that acquisition is extremely rare. METHODS: Data consisted of 180 covariates (demographic, diagnoses, treatments, prescriptions) extracted from records on 399 385 patient (150 cases) seen at Atrius Health (2007-2015), a clinical network in Massachusetts. Super learner is an ensemble machine learning algorithm that uses k-fold cross validation to evaluate and combine predictions from a collection of algorithms. We trained 42 variants of sophisticated algorithms, using different sampling schemes that more evenly balanced the ratio of cases to controls. We compared super learner's cross validated area under the receiver operating curve (cv-AUC) with that of each individual algorithm. RESULTS: The least absolute shrinkage and selection operator (lasso) using a 1:20 class ratio outperformed the super learner (cv-AUC = 0.86 vs 0.84). A traditional logistic regression model restricted to 23 clinician-selected main terms was slightly inferior (cv-AUC = 0.81). CONCLUSION: Machine learning was successful at developing a model to predict 1-year risk of acquiring HIV based on a physician-curated set of predictors extracted from EHRs.
Human immunodeficiency virus (HIV) pre-exposure prophylaxis (PrEP) protects high risk patients from becoming infected with HIV. Clinicians need help to identify candidates for PrEP based on information routinely collected in electronic health records (EHRs). The greatest statistical challenge in developing a risk prediction model is that acquisition is extremely rare. METHODS: Data consisted of 180 covariates (demographic, diagnoses, treatments, prescriptions) extracted from records on 399 385 patient (150 cases) seen at Atrius Health (2007-2015), a clinical network in Massachusetts. Super learner is an ensemble machine learning algorithm that uses k-fold cross validation to evaluate and combine predictions from a collection of algorithms. We trained 42 variants of sophisticated algorithms, using different sampling schemes that more evenly balanced the ratio of cases to controls. We compared super learner's cross validated area under the receiver operating curve (cv-AUC) with that of each individual algorithm. RESULTS: The least absolute shrinkage and selection operator (lasso) using a 1:20 class ratio outperformed the super learner (cv-AUC = 0.86 vs 0.84). A traditional logistic regression model restricted to 23 clinician-selected main terms was slightly inferior (cv-AUC = 0.81). CONCLUSION: Machine learning was successful at developing a model to predict 1-year risk of acquiring HIV based on a physician-curated set of predictors extracted from EHRs.
Authors: Jason S Haukoos; Michael S Lyons; Christopher J Lindsell; Emily Hopkins; Brooke Bender; Richard E Rothman; Yu-Hsiang Hsieh; Lynsay A Maclaren; Mark W Thrun; Comilla Sasson; Richard L Byyny Journal: Am J Epidemiol Date: 2012-03-19 Impact factor: 4.897
Authors: Julia L Marcus; Leo B Hurley; Charles Bradley Hare; Dong Phuong Nguyen; Tony Phengrasamy; Michael J Silverberg; Juliet E Stoltey; Jonathan E Volk Journal: J Acquir Immune Defic Syndr Date: 2016-12-15 Impact factor: 3.731
Authors: Dawn K Smith; Sherri L Pals; Jeffrey H Herbst; Sanjyot Shinde; James W Carey Journal: J Acquir Immune Defic Syndr Date: 2012-08-01 Impact factor: 3.731
Authors: Romain Pirracchio; Maya L Petersen; Marco Carone; Matthieu Resche Rigon; Sylvie Chevret; Mark J van der Laan Journal: Lancet Respir Med Date: 2014-11-24 Impact factor: 30.700
Authors: Laura B Balzer; Diane V Havlir; Moses R Kamya; Gabriel Chamie; Edwin D Charlebois; Tamara D Clark; Catherine A Koss; Dalsone Kwarisiima; James Ayieko; Norton Sang; Jane Kabami; Mucunguzi Atukunda; Vivek Jain; Carol S Camlin; Craig R Cohen; Elizabeth A Bukusi; Mark Van Der Laan; Maya L Petersen Journal: Clin Infect Dis Date: 2020-12-03 Impact factor: 20.999
Authors: Pedro B Carneiro; Victoria Frye; Chloe Mirzayi; Viraj Patel; David Lounsbury; Terry T-K Huang; Nasim Sabounchi; Christian Grov Journal: AIDS Educ Prev Date: 2022-06
Authors: Xianglong Xu; Zongyuan Ge; Eric P F Chow; Zhen Yu; David Lee; Jinrong Wu; Jason J Ong; Christopher K Fairley; Lei Zhang Journal: J Clin Med Date: 2022-03-25 Impact factor: 4.241
Authors: Xianglong Xu; Zhen Yu; Zongyuan Ge; Eric P F Chow; Yining Bao; Jason J Ong; Wei Li; Jinrong Wu; Christopher K Fairley; Lei Zhang Journal: J Med Internet Res Date: 2022-08-25 Impact factor: 7.076