BACKGROUND: The increasing availability of clinical data from electronic medical records (EMRs) has created opportunities for secondary uses of health information. When used in machine learning classification, many data features must first be transformed by discretization. OBJECTIVE: To evaluate six discretization strategies, both supervised and unsupervised, using EMR data. MATERIALS AND METHODS: We classified laboratory data (arterial blood gas (ABG) measurements) and physiologic data (cardiac output (CO) measurements) derived from adult patients in the intensive care unit using decision trees and naïve Bayes classifiers. Continuous features were partitioned using two supervised, and four unsupervised discretization strategies. The resulting classification accuracy was compared with that obtained with the original, continuous data. RESULTS: Supervised methods were more accurate and consistent than unsupervised, but tended to produce larger decision trees. Among the unsupervised methods, equal frequency and k-means performed well overall, while equal width was significantly less accurate. DISCUSSION: This is, we believe, the first dedicated evaluation of discretization strategies using EMR data. It is unlikely that any one discretization method applies universally to EMR data. Performance was influenced by the choice of class labels and, in the case of unsupervised methods, the number of intervals. In selecting the number of intervals there is generally a trade-off between greater accuracy and greater consistency. CONCLUSIONS: In general, supervised methods yield higher accuracy, but are constrained to a single specific application. Unsupervised methods do not require class labels and can produce discretized data that can be used for multiple purposes.
BACKGROUND: The increasing availability of clinical data from electronic medical records (EMRs) has created opportunities for secondary uses of health information. When used in machine learning classification, many data features must first be transformed by discretization. OBJECTIVE: To evaluate six discretization strategies, both supervised and unsupervised, using EMR data. MATERIALS AND METHODS: We classified laboratory data (arterial blood gas (ABG) measurements) and physiologic data (cardiac output (CO) measurements) derived from adult patients in the intensive care unit using decision trees and naïve Bayes classifiers. Continuous features were partitioned using two supervised, and four unsupervised discretization strategies. The resulting classification accuracy was compared with that obtained with the original, continuous data. RESULTS: Supervised methods were more accurate and consistent than unsupervised, but tended to produce larger decision trees. Among the unsupervised methods, equal frequency and k-means performed well overall, while equal width was significantly less accurate. DISCUSSION: This is, we believe, the first dedicated evaluation of discretization strategies using EMR data. It is unlikely that any one discretization method applies universally to EMR data. Performance was influenced by the choice of class labels and, in the case of unsupervised methods, the number of intervals. In selecting the number of intervals there is generally a trade-off between greater accuracy and greater consistency. CONCLUSIONS: In general, supervised methods yield higher accuracy, but are constrained to a single specific application. Unsupervised methods do not require class labels and can produce discretized data that can be used for multiple purposes.
Authors: Charles Safran; Meryl Bloomrosen; W Edward Hammond; Steven Labkoff; Suzanne Markel-Fox; Paul C Tang; Don E Detmer Journal: J Am Med Inform Assoc Date: 2006-10-31 Impact factor: 4.497
Authors: Iftikhar J Kullo; Jin Fan; Jyotishman Pathak; Guergana K Savova; Zeenat Ali; Christopher G Chute Journal: J Am Med Inform Assoc Date: 2010 Sep-Oct Impact factor: 4.497
Authors: Mitchell J Cohen; Adam D Grossman; Diane Morabito; M Margaret Knudson; Atul J Butte; Geoffrey T Manley Journal: Crit Care Date: 2010-02-02 Impact factor: 9.097
Authors: P-R Burgel; J-L Paillasseur; D Caillaud; I Tillie-Leblond; P Chanez; R Escamilla; I Court-Fortune; T Perez; P Carré; N Roche Journal: Eur Respir J Date: 2010-01-14 Impact factor: 16.671
Authors: Mucahit Cevik; Mehmet Ali Ergun; Natasha K Stout; Amy Trentham-Dietz; Mark Craven; Oguzhan Alagoz Journal: Med Decis Making Date: 2015-10-15 Impact factor: 2.583
Authors: Davina J Zamanzadeh; Panayiotis Petousis; Tyler A Davis; Susanne B Nicholas; Keith C Norris; Katherine R Tuttle; Alex A T Bui; Majid Sarrafzadeh Journal: Annu Int Conf IEEE Eng Med Biol Soc Date: 2021-11
Authors: Esther I Metting; Johannes C C M In 't Veen; P N Richard Dekhuijzen; Ellen van Heijst; Janwillem W H Kocks; Jacqueline B Muilwijk-Kroes; Niels H Chavannes; Thys van der Molen Journal: ERJ Open Res Date: 2016-01-22
Authors: Beatriz Rodriguez-Morilla; Eduard Estivill; Carla Estivill-Domènech; Javier Albares; Francisco Segarra; Angel Correa; Manuel Campos; Maria Angeles Rol; Juan Antonio Madrid Journal: Front Neurosci Date: 2019-12-10 Impact factor: 4.677
Authors: Yeongho Choi; Jeong Ho Park; Ki Jeong Hong; Young Sun Ro; Kyoung Jun Song; Sang Do Shin Journal: BMJ Open Date: 2022-01-12 Impact factor: 2.692
Authors: Yizhao Ni; Andrew F Beck; Regina Taylor; Jenna Dyas; Imre Solti; Jacqueline Grupp-Phelan; Judith W Dexheimer Journal: J Am Med Inform Assoc Date: 2016-04-27 Impact factor: 4.497