Juan-Jose Beunza1, Enrique Puertas2, Ester García-Ovejero3, Gema Villalba4, Emilia Condes5, Gergana Koleva5, Cristian Hurtado6, Manuel F Landecho7. 1. Machine Learning Health Working Group, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Madrid, Spain; Department of Medicine, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Madrid, Spain. Electronic address: juanjo@juanjobeunza.com. 2. Machine Learning Health Working Group, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Madrid, Spain; Department of Computer Science and Technology, School of Architecture, Engineering and Design, Universidad Europea de Madrid, Madrid, Spain. 3. Machine Learning Health Working Group, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Madrid, Spain; Department of Nursing and Psychology, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Madrid, Spain. 4. Machine Learning Health Working Group, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Madrid, Spain; Indra, Madrid, Spain. 5. Machine Learning Health Working Group, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Madrid, Spain. 6. Machine Learning Health Working Group, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Madrid, Spain; Department of Pharmacy and Biotechnology, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Madrid, Spain. 7. Machine Learning Health Working Group, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Madrid, Spain; Departament of Internal Medicine, Clinica Universidad de Navarra, Pamplona, Spain.
Abstract
AIM: The aim of this study is to compare the utility of several supervised machine learning (ML) algorithms for predicting clinical events in terms of their internal validity and accuracy. The results, which were obtained using two statistical software platforms, were also compared. MATERIALS AND METHODS: The data used in this research come from the open database of the Framingham Heart Study, which originated in 1948 in Framingham, Massachusetts as a prospective study of risk factors for cardiovascular disease. Through data mining processes, three data models were elaborated and a comparative methodological study between the different ML algorithms - decision tree, random forest, support vector machines, neural networks, and logistic regression - was carried out. The global selection criterium for choosing the right set of hyperparameters and the type of data manipulation was the area under a curve (AUC). The software tools used to analyze the data were R-Studio® and RapidMiner®. RESULTS: The Framingham study open database contains 4240 observations. The algorithm that yielded the greatest AUC when analyzing the data in R-Studio was neural network applied to a model that excluded all observations in which there was at least one missing value (AUC = 0.71); when analyzing the data in RapidMiner and applying the same model, the best algorithm was support vector machines (AUC = 0.75). CONCLUSIONS: ML algorithms can reinforce the diagnostic and prognostic capacity of traditional regression techniques. Differences between the applicability of those algorithms and the results obtained with them were a function of the software platforms used in the data analysis.
AIM: The aim of this study is to compare the utility of several supervised machine learning (ML) algorithms for predicting clinical events in terms of their internal validity and accuracy. The results, which were obtained using two statistical software platforms, were also compared. MATERIALS AND METHODS: The data used in this research come from the open database of the Framingham Heart Study, which originated in 1948 in Framingham, Massachusetts as a prospective study of risk factors for cardiovascular disease. Through data mining processes, three data models were elaborated and a comparative methodological study between the different ML algorithms - decision tree, random forest, support vector machines, neural networks, and logistic regression - was carried out. The global selection criterium for choosing the right set of hyperparameters and the type of data manipulation was the area under a curve (AUC). The software tools used to analyze the data were R-Studio® and RapidMiner®. RESULTS: The Framingham study open database contains 4240 observations. The algorithm that yielded the greatest AUC when analyzing the data in R-Studio was neural network applied to a model that excluded all observations in which there was at least one missing value (AUC = 0.71); when analyzing the data in RapidMiner and applying the same model, the best algorithm was support vector machines (AUC = 0.75). CONCLUSIONS: ML algorithms can reinforce the diagnostic and prognostic capacity of traditional regression techniques. Differences between the applicability of those algorithms and the results obtained with them were a function of the software platforms used in the data analysis.
Keywords:
Area under curve; Diagnostic techniques and procedures; Machine learning; Research techniques; Supervised machine learning; Support vector machines
Authors: Jeph Herrin; Neena S Abraham; Xiaoxi Yao; Peter A Noseworthy; Jonathan Inselman; Nilay D Shah; Che Ngufor Journal: JAMA Netw Open Date: 2021-05-03
Authors: Amitava Banerjee; Suliang Chen; Ghazaleh Fatemifar; Mohamad Zeina; R Thomas Lumbers; Johanna Mielke; Simrat Gill; Dipak Kotecha; Daniel F Freitag; Spiros Denaxas; Harry Hemingway Journal: BMC Med Date: 2021-04-06 Impact factor: 11.150