Debbie Rankin1, Michaela Black1, Raymond Bond2, Jonathan Wallace2, Maurice Mulvenna2, Gorka Epelde3,4. 1. School of Computing, Engineering and Intelligent Systems, Ulster University, Northland Road, Derry~Londonderry, GB. 2. School of Computing, Ulster University, Jordanstown, GB. 3. Vicomtech Foundation, Basque Research and Technology Alliance (BRTA), Donostia-San Sebastián, ES. 4. Biodonostia Health Research Institute, eHealth Group, Donostia-San Sebastián, ES.
Abstract
BACKGROUND: The exploitation of synthetic data in healthcare is at an early stage. Synthetic data generation could unlock the vast potential within healthcare datasets that are too sensitive for release due to privacy concerns. Several synthetic data generators have been developed to date, however studies evaluating their efficacy and generalisability are scarce. OBJECTIVE: This work sets out to understand the difference in performance of supervised machine learning models trained on synthetic data compared with those trained on real data. METHODS: A total of 19 open healthcare datasets containing both categorical and numerical data have been selected for experimental work. Synthetic data is generated using three popular synthetic data generators that apply Classification and Regression Trees, parametric and Bayesian network approaches. Real and synthetic data are used (separately) to train five supervised machine learning models: stochastic gradient descent, decision tree, k-nearest neighbors, random forest and support vector machine. Models are tested only on real data to determine whether a model developed by training on synthetic data can be put into use by healthcare departments and used to accurately classify new, real examples. Evaluation metrics are computed and differentials in these scores are compared. The impact of statistical disclosure control on model performance is also assessed. RESULTS: The accuracy of ML models trained on synthetic data is lower than models trained on real data in 92% of cases. Tree-based models trained on synthetic data have deviations in accuracy from models trained on real data of 17.7-19.3%, whilst other models have lower deviations of 5.8-7.2%. The winning classifier when trained and tested on real data versus models trained on synthetic data and tested on real data is the same in 26.3% of cases for CART and parametric synthetic data, and in 21.1% of cases for Bayesian network generated synthetic data. Tree-based models perform best with real data and are the winning classifier in 94.7% of cases. This is not the case for models trained on synthetic data. When tree-based models are not considered, the winning classifier for real and synthetic data is matched in 73.7%, 52.6% and 68.4% of cases for CART, parametric and Bayesian network synthetic data, respectively. Statistical disclosure control methods did not have a notable impact on data utility. CONCLUSIONS: The results of this study are promising with small decreases in accuracy observed in models trained with synthetic data compared to models trained with real data, where both are tested on real data. Such deviations are expected and manageable. Tree-based classifiers have some sensitivity to synthetic data and the underlying cause requires further investigation. This study highlights the potential of synthetic data and the need for further evaluation its robustness. Synthetic data must ensure individual privacy and data utility is preserved in order to instil confidence in healthcare departments when utilising such data to inform policy decision-making.
BACKGROUND: The exploitation of synthetic data in healthcare is at an early stage. Synthetic data generation could unlock the vast potential within healthcare datasets that are too sensitive for release due to privacy concerns. Several synthetic data generators have been developed to date, however studies evaluating their efficacy and generalisability are scarce. OBJECTIVE: This work sets out to understand the difference in performance of supervised machine learning models trained on synthetic data compared with those trained on real data. METHODS: A total of 19 open healthcare datasets containing both categorical and numerical data have been selected for experimental work. Synthetic data is generated using three popular synthetic data generators that apply Classification and Regression Trees, parametric and Bayesian network approaches. Real and synthetic data are used (separately) to train five supervised machine learning models: stochastic gradient descent, decision tree, k-nearest neighbors, random forest and support vector machine. Models are tested only on real data to determine whether a model developed by training on synthetic data can be put into use by healthcare departments and used to accurately classify new, real examples. Evaluation metrics are computed and differentials in these scores are compared. The impact of statistical disclosure control on model performance is also assessed. RESULTS: The accuracy of ML models trained on synthetic data is lower than models trained on real data in 92% of cases. Tree-based models trained on synthetic data have deviations in accuracy from models trained on real data of 17.7-19.3%, whilst other models have lower deviations of 5.8-7.2%. The winning classifier when trained and tested on real data versus models trained on synthetic data and tested on real data is the same in 26.3% of cases for CART and parametric synthetic data, and in 21.1% of cases for Bayesian network generated synthetic data. Tree-based models perform best with real data and are the winning classifier in 94.7% of cases. This is not the case for models trained on synthetic data. When tree-based models are not considered, the winning classifier for real and synthetic data is matched in 73.7%, 52.6% and 68.4% of cases for CART, parametric and Bayesian network synthetic data, respectively. Statistical disclosure control methods did not have a notable impact on data utility. CONCLUSIONS: The results of this study are promising with small decreases in accuracy observed in models trained with synthetic data compared to models trained with real data, where both are tested on real data. Such deviations are expected and manageable. Tree-based classifiers have some sensitivity to synthetic data and the underlying cause requires further investigation. This study highlights the potential of synthetic data and the need for further evaluation its robustness. Synthetic data must ensure individual privacy and data utility is preserved in order to instil confidence in healthcare departments when utilising such data to inform policy decision-making.
Authors: Ravi Aggarwal; Viknesh Sounderajah; Guy Martin; Daniel S W Ting; Alan Karthikesalingam; Dominic King; Hutan Ashrafian; Ara Darzi Journal: NPJ Digit Med Date: 2021-04-07