Literature DB >> 32501278

Reliability of Supervised Machine Learning Using Synthetic Data in Healthcare: A Model to Preserve Privacy for Data Sharing.

Debbie Rankin1, Michaela Black1, Raymond Bond2, Jonathan Wallace2, Maurice Mulvenna2, Gorka Epelde3,4.   

Abstract

BACKGROUND: The exploitation of synthetic data in healthcare is at an early stage. Synthetic data generation could unlock the vast potential within healthcare datasets that are too sensitive for release due to privacy concerns. Several synthetic data generators have been developed to date, however studies evaluating their efficacy and generalisability are scarce.
OBJECTIVE: This work sets out to understand the difference in performance of supervised machine learning models trained on synthetic data compared with those trained on real data.
METHODS: A total of 19 open healthcare datasets containing both categorical and numerical data have been selected for experimental work. Synthetic data is generated using three popular synthetic data generators that apply Classification and Regression Trees, parametric and Bayesian network approaches. Real and synthetic data are used (separately) to train five supervised machine learning models: stochastic gradient descent, decision tree, k-nearest neighbors, random forest and support vector machine. Models are tested only on real data to determine whether a model developed by training on synthetic data can be put into use by healthcare departments and used to accurately classify new, real examples. Evaluation metrics are computed and differentials in these scores are compared. The impact of statistical disclosure control on model performance is also assessed.
RESULTS: The accuracy of ML models trained on synthetic data is lower than models trained on real data in 92% of cases. Tree-based models trained on synthetic data have deviations in accuracy from models trained on real data of 17.7-19.3%, whilst other models have lower deviations of 5.8-7.2%. The winning classifier when trained and tested on real data versus models trained on synthetic data and tested on real data is the same in 26.3% of cases for CART and parametric synthetic data, and in 21.1% of cases for Bayesian network generated synthetic data. Tree-based models perform best with real data and are the winning classifier in 94.7% of cases. This is not the case for models trained on synthetic data. When tree-based models are not considered, the winning classifier for real and synthetic data is matched in 73.7%, 52.6% and 68.4% of cases for CART, parametric and Bayesian network synthetic data, respectively. Statistical disclosure control methods did not have a notable impact on data utility.
CONCLUSIONS: The results of this study are promising with small decreases in accuracy observed in models trained with synthetic data compared to models trained with real data, where both are tested on real data. Such deviations are expected and manageable. Tree-based classifiers have some sensitivity to synthetic data and the underlying cause requires further investigation. This study highlights the potential of synthetic data and the need for further evaluation its robustness. Synthetic data must ensure individual privacy and data utility is preserved in order to instil confidence in healthcare departments when utilising such data to inform policy decision-making.

Entities:  

Year:  2020        PMID: 32501278     DOI: 10.2196/18910

Source DB:  PubMed          Journal:  JMIR Med Inform


  11 in total

1.  Recent Developments in Privacy-Preserving Mining of Clinical Data.

Authors:  Chance Desmet; Diane J Cook
Journal:  ACM IMS Trans Data Sci       Date:  2021-11

2.  dsSynthetic: synthetic data generation for the DataSHIELD federated analysis system.

Authors:  Soumya Banerjee; Tom R P Bishop
Journal:  BMC Res Notes       Date:  2022-06-27

3.  A Vision-Based System for Stage Classification of Parkinsonian Gait Using Machine Learning and Synthetic Data.

Authors:  Jorge Marquez Chavez; Wei Tang
Journal:  Sensors (Basel)       Date:  2022-06-13       Impact factor: 3.847

4.  Utility Metrics for Evaluating Synthetic Health Data Generation Methods: Validation Study.

Authors:  Khaled El Emam; Lucy Mosquera; Xi Fang; Alaa El-Hussuna
Journal:  JMIR Med Inform       Date:  2022-04-07

5.  Evaluating the utility of synthetic COVID-19 case data.

Authors:  Khaled El Emam; Lucy Mosquera; Elizabeth Jonker; Harpreet Sood
Journal:  JAMIA Open       Date:  2021-03-01

Review 6.  Diagnostic accuracy of deep learning in medical imaging: a systematic review and meta-analysis.

Authors:  Ravi Aggarwal; Viknesh Sounderajah; Guy Martin; Daniel S W Ting; Alan Karthikesalingam; Dominic King; Hutan Ashrafian; Ara Darzi
Journal:  NPJ Digit Med       Date:  2021-04-07

7.  Towards Effective Patient Simulators.

Authors:  Vadim Liventsev; Aki Härmä; Milan Petković
Journal:  Front Artif Intell       Date:  2021-12-15

8.  LT-FS-ID: Log-Transformed Feature Learning and Feature-Scaling-Based Machine Learning Algorithms to Predict the k-Barriers for Intrusion Detection Using Wireless Sensor Network.

Authors:  Abhilash Singh; J Amutha; Jaiprakash Nagar; Sandeep Sharma; Cheng-Chi Lee
Journal:  Sensors (Basel)       Date:  2022-01-29       Impact factor: 3.576

Review 9.  Selecting Privacy-Enhancing Technologies for Managing Health Data Use.

Authors:  Sara Jordan; Clara Fontaine; Rachele Hendricks-Sturrup
Journal:  Front Public Health       Date:  2022-03-16

10.  Sharing Biomedical Data: Strengthening AI Development in Healthcare.

Authors:  Tania Pereira; Joana Morgado; Francisco Silva; Michele M Pelter; Vasco Rosa Dias; Rita Barros; Cláudia Freitas; Eduardo Negrão; Beatriz Flor de Lima; Miguel Correia da Silva; António J Madureira; Isabel Ramos; Venceslau Hespanhol; José Luis Costa; António Cunha; Hélder P Oliveira
Journal:  Healthcare (Basel)       Date:  2021-06-30
View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.