Hong-Jun Yoon1, Christopher Stanley1, J Blair Christian1, Hilda B Klasky1, Andrew E Blanchard1, Eric B Durbin2, Xiao-Cheng Wu3, Antoinette Stroup4, Jennifer Doherty5, Stephen M Schwartz6, Charles Wiggins7, Mark Damesyn8, Linda Coyle9, Georgia D Tourassi10. 1. Computational Sciences and Engineering Division, Oak Ridge National Laboratory, Oak Ridge, TN, USA. 2. College of Medicine, University of Kentucky, Lexington, KY, USA. 3. Louisiana State University Health Sciences Center, School of Public Health, New Orleans, LA, USA. 4. Rutgers Cancer Institute of New Jersey, New Brunswick, NJ, USA. 5. Huntsman Cancer Institute, University of Utah, Salt Lake City, UT, USA. 6. Fred Hutchinson Cancer Research Center, Epidemiology Program, Seattle, WA, USA. 7. University of New Mexico, Albuquerque, NM, USA. 8. California Department of Public Health, Sacramento, CA, USA. 9. Information Management Services Inc., Calverton, MD, USA. 10. National Center for Computational Sciences, Oak Ridge National Laboratory, Oak Ridge, TN, USA.
Abstract
BACKGROUND: With the use of artificial intelligence and machine learning techniques for biomedical informatics, security and privacy concerns over the data and subject identities have also become an important issue and essential research topic. Without intentional safeguards, machine learning models may find patterns and features to improve task performance that are associated with private personal information. OBJECTIVE: The privacy vulnerability of deep learning models for information extraction from medical textural contents needs to be quantified since the models are exposed to private health information and personally identifiable information. The objective of the study is to quantify the privacy vulnerability of the deep learning models for natural language processing and explore a proper way of securing patients' information to mitigate confidentiality breaches. METHODS: The target model is the multitask convolutional neural network for information extraction from cancer pathology reports, where the data for training the model are from multiple state population-based cancer registries. This study proposes the following schemes to collect vocabularies from the cancer pathology reports; (a) words appearing in multiple registries, and (b) words that have higher mutual information. We performed membership inference attacks on the models in high-performance computing environments. RESULTS: The comparison outcomes suggest that the proposed vocabulary selection methods resulted in lower privacy vulnerability while maintaining the same level of clinical task performance.
BACKGROUND: With the use of artificial intelligence and machine learning techniques for biomedical informatics, security and privacy concerns over the data and subject identities have also become an important issue and essential research topic. Without intentional safeguards, machine learning models may find patterns and features to improve task performance that are associated with private personal information. OBJECTIVE: The privacy vulnerability of deep learning models for information extraction from medical textural contents needs to be quantified since the models are exposed to private health information and personally identifiable information. The objective of the study is to quantify the privacy vulnerability of the deep learning models for natural language processing and explore a proper way of securing patients' information to mitigate confidentiality breaches. METHODS: The target model is the multitask convolutional neural network for information extraction from cancer pathology reports, where the data for training the model are from multiple state population-based cancer registries. This study proposes the following schemes to collect vocabularies from the cancer pathology reports; (a) words appearing in multiple registries, and (b) words that have higher mutual information. We performed membership inference attacks on the models in high-performance computing environments. RESULTS: The comparison outcomes suggest that the proposed vocabulary selection methods resulted in lower privacy vulnerability while maintaining the same level of clinical task performance.
Entities:
Keywords:
Privacy; artificial intelligence; cancer epidemiology; deep learning; natural language processing; privacy-preserving training
Authors: Eric Ke Wang; Nie Zhe; Yue Ping Li; Zuo Dong Liang; Xun Zhang; Jun Tao Yu; Yun Ming Ye Journal: Math Biosci Eng Date: 2019-02-20 Impact factor: 2.080
Authors: Justin M Wozniak; Rajeev Jain; Prasanna Balaprakash; Jonathan Ozik; Nicholson T Collier; John Bauer; Fangfang Xia; Thomas Brettin; Rick Stevens; Jamaludin Mohd-Yusof; Cristina Garcia Cardona; Brian Van Essen; Matthew Baughman Journal: BMC Bioinformatics Date: 2018-12-21 Impact factor: 3.169
Authors: Nils Homer; Szabolcs Szelinger; Margot Redman; David Duggan; Waibhav Tembe; Jill Muehling; John V Pearson; Dietrich A Stephan; Stanley F Nelson; David W Craig Journal: PLoS Genet Date: 2008-08-29 Impact factor: 5.917
Authors: Mohammed Alawad; Shang Gao; John X Qiu; Hong Jun Yoon; J Blair Christian; Lynne Penberthy; Brent Mumphrey; Xiao-Cheng Wu; Linda Coyle; Georgia Tourassi Journal: J Am Med Inform Assoc Date: 2020-01-01 Impact factor: 4.497