Kenney Ng1, Steven R Steinhubl2, Christopher deFilippi2, Sanjoy Dey2, Walter F Stewart2. 1. From the Center for Computational Health, IBM Research, T.J. Watson Research Center, Cambridge, MA (K.N.); Cardiovascular Wellness, Geisinger Health System, Danville, PA (S.R.S.); Digital Medicine, Scripps Health, San Diego, CA (S.R.S.); Cardiology, Inova Heart and Vascular Institute, Fairfax, VA (C.d.); Center for Computational Health, IBM Research, T.J. Watson Research Center, Yorktown Heights, NY (S.D.); and Research, Sutter Health Research, Walnut Creek, CA (W.F.S.). kenney.ng@us.ibm.com. 2. From the Center for Computational Health, IBM Research, T.J. Watson Research Center, Cambridge, MA (K.N.); Cardiovascular Wellness, Geisinger Health System, Danville, PA (S.R.S.); Digital Medicine, Scripps Health, San Diego, CA (S.R.S.); Cardiology, Inova Heart and Vascular Institute, Fairfax, VA (C.d.); Center for Computational Health, IBM Research, T.J. Watson Research Center, Yorktown Heights, NY (S.D.); and Research, Sutter Health Research, Walnut Creek, CA (W.F.S.).
Abstract
BACKGROUND: Using electronic health records data to predict events and onset of diseases is increasingly common. Relatively little is known, although, about the tradeoffs between data requirements and model utility. METHODS AND RESULTS: We examined the performance of machine learning models trained to detect prediagnostic heart failure in primary care patients using longitudinal electronic health records data. Model performance was assessed in relation to data requirements defined by the prediction window length (time before clinical diagnosis), the observation window length (duration of observation before prediction window), the number of different data domains (data diversity), the number of patient records in the training data set (data quantity), and the density of patient encounters (data density). A total of 1684 incident heart failure cases and 13 525 sex, age-category, and clinic matched controls were used for modeling. Model performance improved as (1) the prediction window length decreases, especially when <2 years; (2) the observation window length increases but then levels off after 2 years; (3) the training data set size increases but then levels off after 4000 patients; (4) more diverse data types are used, but, in order, the combination of diagnosis, medication order, and hospitalization data was most important; and (5) data were confined to patients who had ≥10 phone or face-to-face encounters in 2 years. CONCLUSIONS: These empirical findings suggest possible guidelines for the minimum amount and type of data needed to train effective disease onset predictive models using longitudinal electronic health records data.
BACKGROUND: Using electronic health records data to predict events and onset of diseases is increasingly common. Relatively little is known, although, about the tradeoffs between data requirements and model utility. METHODS AND RESULTS: We examined the performance of machine learning models trained to detect prediagnostic heart failure in primary care patients using longitudinal electronic health records data. Model performance was assessed in relation to data requirements defined by the prediction window length (time before clinical diagnosis), the observation window length (duration of observation before prediction window), the number of different data domains (data diversity), the number of patient records in the training data set (data quantity), and the density of patient encounters (data density). A total of 1684 incident heart failure cases and 13 525 sex, age-category, and clinic matched controls were used for modeling. Model performance improved as (1) the prediction window length decreases, especially when <2 years; (2) the observation window length increases but then levels off after 2 years; (3) the training data set size increases but then levels off after 4000 patients; (4) more diverse data types are used, but, in order, the combination of diagnosis, medication order, and hospitalization data was most important; and (5) data were confined to patients who had ≥10 phone or face-to-face encounters in 2 years. CONCLUSIONS: These empirical findings suggest possible guidelines for the minimum amount and type of data needed to train effective disease onset predictive models using longitudinal electronic health records data.
Authors: Sameer Ather; Wenyaw Chan; Biykem Bozkurt; David Aguilar; Kumudha Ramasubbu; Amit A Zachariah; Xander H T Wehrens; Anita Deswal Journal: J Am Coll Cardiol Date: 2012-03-13 Impact factor: 24.094
Authors: Wayne D Rosamond; Patricia P Chang; Chris Baggett; Anna Johnson; Alain G Bertoni; Eyal Shahar; Anita Deswal; Gerardo Heiss; Lloyd E Chambless Journal: Circ Heart Fail Date: 2012-01-23 Impact factor: 8.790
Authors: Benjamin A Goldstein; Michael J Pencina; Maria E Montez-Rath; Wolfgang C Winkelmayer Journal: J Am Med Inform Assoc Date: 2016-06-29 Impact factor: 4.497
Authors: Roy J Byrd; Steven R Steinhubl; Jimeng Sun; Shahram Ebadollahi; Walter F Stewart Journal: Int J Med Inform Date: 2013-01-11 Impact factor: 4.046
Authors: Yajuan Wang; Kenney Ng; Roy J Byrd; Jianying Hu; Shahram Ebadollahi; Zahra Daar; Christopher deFilippi; Steven R Steinhubl; Walter F Stewart Journal: Conf Proc IEEE Eng Med Biol Soc Date: 2015
Authors: Véronique L Roger; Susan A Weston; Margaret M Redfield; Jens P Hellermann-Homan; Jill Killian; Barbara P Yawn; Steven J Jacobsen Journal: JAMA Date: 2004-07-21 Impact factor: 56.272
Authors: Jimeng Sun; Jianying Hu; Dijun Luo; Marianthi Markatou; Fei Wang; Shahram Edabollahi; Steven E Steinhubl; Zahra Daar; Walter F Stewart Journal: AMIA Annu Symp Proc Date: 2012-11-03
Authors: Rafael Garcia-Carretero; Luis Vigil-Medina; Inmaculada Mora-Jimenez; Cristina Soguero-Ruiz; Oscar Barquero-Perez; Javier Ramos-Lopez Journal: Med Biol Eng Comput Date: 2020-02-26 Impact factor: 2.602
Authors: Kenney Ng; Vibha Anand; Harry Stavropoulos; Riitta Veijola; Jorma Toppari; Marlena Maziarz; Markus Lundgren; Kathy Waugh; Brigitte I Frohnert; Frank Martin; Olivia Lou; William Hagopian; Peter Achenbach Journal: Diabetologia Date: 2022-10-05 Impact factor: 10.460