Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 Issues and Solutions of Healthcare Data De-identification: the Case of South Korea.

Literature DB >> 29349950

Issues and Solutions of Healthcare Data De-identification: the Case of South Korea.

Soo Yong Shin¹.

Abstract

Entities: Chemical Disease Gene Species

Mesh：

Year: 2018 PMID： 29349950 PMCID： PMC5773854 DOI： 10.3346/jkms.2018.33.e41

Source DB: PubMed Journal: J Korean Med Sci ISSN： 1011-8934 Impact factor: 2.153

× No keyword cloud information.

Artificial intelligence (AI) has been highlighted as a mechanism to realize precision medicine because it contributes to analyzing healthcare big data.12 Especially, among diverse AI methods, machine learning (ML) methods including deep learning algorithms are widely applied to analyze healthcare data.2 ML requires a vast amount of data due to its nature. This means that collecting as much relevant data as possible is a critical task. The Precision Medicine Initiative3 or Observational Health Data Sciences and Informatics (OHDSI)4 might be representative cases for collecting healthcare big data. However, since healthcare data contain the most sensitive personal information, the concerns on protecting patients' privacy are increasing. In Korea, the Personal Information Protection Act was passed to protect privacy. Additionally, the Bioethics and Safety Act was passed to protect the unauthorized use of patients' health information. According to the two regulations, researchers should obtain informed consent from each research participant. However, it is almost impossible to obtain written consent if researchers perform research that requires a large number of participants. An alternative method is de-identification, which is an effective method to protect privacy and comply with regulations.5 Diverse de-identification methods have been developed for clinical texts and images,67 and the Korean government published the guideline for de-identification of personal data in 2016.8 This guideline was developed to provide clear methods on de-identification of personal data and scopes on utilizing de-identified data. The guideline proposes four steps for de-identifying personal data: 1) The preliminary review step verifies whether specific data are personally identifiable data or not, 2) The de-identification step makes individuals unidentifiable using the necessary methods, 3) The adequacy assessment step assesses whether de-identified data can be re-identified, and 4) The follow-up management step monitors the possibility of re-identification. This guideline tries to cover all possible personal data including financial, commerce, communication, and healthcare data. Unfortunately, researchers and companies in the biomedical field face a big challenge to obey the guideline since the characteristics of biomedical data are different from other industries. Financial or commerce data consist of repetitive patterns of transactions with relatively small number of features. However, biomedical data have too diverse features, for example, extensive laboratory test results and treatments. In addition, biomedical data include structured code data as well as unstructured data such as text, images, and videos. This implies that the published guideline is not suitable for the biomedical area. Here, we criticize the current regulation on de-identification in the context of Korea and raise several issues regarding the de-identification of biomedical data. First, the published guideline for de-identification demands “k-anonymity” for the mandatory privacy protection method.8 K-anonymity is a well-established method to protect privacy9 and easily provides the quantitative measure of privacy protection. Noticeably, the US Family Educational Rights and Privacy Act adopts k-anonymity.10 K-anonymity requires there should be at least the same k items in the dataset. However, k-anonymity is difficult to achieve in healthcare datasets since raw data must be distorted. For example, if we decide to de-identify 1,000 patients' clinical data that consist of 5 different laboratory test results by keeping 5-anonymity, every combination of 5 different patients' clinical data should be the same. Therefore, all laboratory results should be generalized by replacing individual attributes with broader categories, as in Tables 1 and 2. As shown in Table 2, the modified data lost all their detailed information which is essential for analysis. Even more, all clinical images such as computed tomography (CT) or magnetic resonance imaging (MRI) images cannot be the same; therefore, images data cannot be of k-anonymity. Alternative rules need to be suggested in a future revised, or new, guideline.

Table 1

Example of k-anonymity: original dataset

Patients	WBC, × 10³/µL	Hb, g/dL	AST (SGOT), IU/L	ALT (SGPT), IU/L	Cholesterol, mg/dL
Patient 1	5.6	17.0	39	64	199
Patient 2	5.4	17.5	44	67	173
Patient 3	4.2	16.4	28	58	179
Patient 4	4.7	16.1	36	64	180
Patient 5	6.1	18.4	101	151	231
Patient 6	7.5	15.6	33	42	195
Patient 7	8.2	17.1	35	54	175

WBC = white blood cell, Hb = hemoglobin, AST = aspartate aminotransferase, SGOT = serum glutamic oxaloacetic transaminase, ALT = alanine aminotransferase, SGPT = serum glutamic pyruvic transaminase.

Table 2

Example of k-anonymity: modified dataset for 5-anonymity

Patients	WBC, × 10³/µL	Hb, g/dL	AST (SGOT), IU/L	ALT (SGPT), IU/L	Cholesterol, mg/dL
Patient 1	Normal	Normal	Normal	> 40	Normal
Patient 2	Normal	> 17.0	> 40	> 40	Normal
Patient 3	Normal	Normal	Normal	> 40	Normal
Patient 4	Normal	Normal	Normal	> 40	Normal
Patient 5	Normal	> 17.0	> 40	> 40	> 200
Patient 6	Normal	Normal	Normal	> 40	Normal
Patient 7	Normal	Normal	Normal	> 40	Normal

Patients 1, 3, 4, 6, and 7 are the same to obey 5-anonymity. As a result, all data were distorted.

WBC = white blood cell, Hb = hemoglobin, AST = aspartate aminotransferase, SGOT = serum glutamic oxaloacetic transaminase, ALT = alanine aminotransferase, SGPT = serum glutamic pyruvic transaminase.

Second, there is a controversial debate over the definition of personal information in the regulations. In the Korean Personal Information Protection Act, personal information is defined as “information that pertains to a living person, including the full name, resident registration number, images, etc., by which the individual in question can be identified, (including information by which the individual in question cannot be identified but can be identified through simple combination with other information).” However, laypersons think all private information, including height and weight, should be protected. According to the aforementioned definition, height and weight cannot identify the specific individual in most cases; therefore, they are not personal information. Most of the personal information is not personally identifiable information. To resolve this debate, the term should be clarified as “personally identifiable information” instead of “personal information.” Third, the definition of re-identification is also unclear. Usually, re-identification implies that the de-identified data are matched to the specific individual. However, there is no clear definition in the Korean regulations. Some argue that re-identification is finding the individual's identity, for example, resident registration number, name, and phone numbers. While others argue that re-identification includes finding the same entity among different databases even though the identity is not confirmed. Technically, the second opinion should not be considered as re-identification for healthcare research. For big data research, the researcher should be able to combine the patient data from electronic medical records of hospital A with those of hospital B. In this case, there should exist the key (the information to find the same patient) to link the different databases. If linking two different databases is re-identification, big data research cannot be performed without written consents. In this regard, the regulations should be revised to include the precise definition of re-identification, for example, pointing out the exact identity of an individual by discovering the individuals' name or phone numbers. Additionally, the concept of “motivated intruder” in the Information Commissioner's Office (ICO) of the United Kingdom should be introduced.11 A motivated intruder is the layperson who can access resources such as the internet, libraries, and all public documents, and is not assumed to have any specialist knowledge such as computer hacking skills or domain knowledge. This concept is important in clinical research since physicians can easily recognize their patients despite de-identification. Fourth, there is no list of personal health identifiers. Unfortunately, the definition of personal information includes the following sentence “(including information by which the individual in question cannot be identified but can be identified through simple combination with other information).” This definition implies all candidate identifiers need to be protected. Technically, personal identifiers can be categorized into direct identifier and indirect identifier. The direct identifier can uniquely identify the individual, for example, name, address, etc. Indirect identifier or quasi-identifier is the above-quoted information. It cannot immediately identify individuals, but has a chance to identify individuals when linked with other identifiers. For example, we may guess the individual by combining diverse indirect identifiers including one's occupation, race, place of birth, and education information. The problem is all indirect identifiers cannot identify the individual. To distinguish the individual, the combination of indirect identifiers heavily depends on background knowledge for the target individual. For example, the famous re-identification of Governor William Weld's medical information in 1997 was possible since he was a public figure with a highly publicized hospitalization.12 Therefore, a full list of personal health identifiers is indispensable for the practical de-identification. Current Korean de-identification guideline suggests each organization should review and decide the identifiers at their own risk. However, the Health Insurance Portability and Accountability Act (HIPAA) in the US defines the 18 protected health information (PHI).5 This approach can reduce the burdens of the de-identification processes. Last, due to the advance of technologies, more types of possible direct identifiers are introduced and used, but have not been considered yet, i.e., the artificially reconstructed facial images using skull CT images13 and genetic/genomic data.14 There is no clear answer whether we should treat the artificially reconstructed facial images as normal full-face images or not. Furthermore, there is a debate over genetic/genomic data if they are personally identifiable information or not. The re-identification of the personal genome project data was possible,15 but the genetic genealogy database was used for identification. This implies DNA itself cannot identify an individual's identity. To identify the individual, there should be a reference database such as a criminal DNA database. We need to discuss these emerging candidate identifiers related to the above issues. De-identification is indispensable for big data research and open data, which can strengthen academic research and commercial solution development. To accelerate the development of healthcare AI, the aforementioned equivocal issues should be clearly resolved by revising guideline and regulations continuously. And then, the research on the de-identification method to obey regulations should be followed. Unfortunately, healthcare data de-identification research is not popular in Korea since most IT engineers or researchers cannot access the clinical data in hospitals. However, privacy-preserving data mining or data de-identification methods have been widely applied to other areas. Promising methods for healthcare data de-identification include differential privacy16 and homomorphic encryption.17 To advance the research, as well as to comply with the regulation, both the on-going efforts to revise the governmental regulations and to develop methods for privacy protection should be made together by multidisciplinary collaboration with jurists, bioethicists, doctors, basic science researchers, and IT engineers.

11 in total

Review 1. Facial reconstruction--anatomical art or artistic anatomy?

Authors: Caroline Wilkinson
Journal: J Anat Date: 2010-02 Impact factor: 2.610

2. Protecting privacy using k-anonymity.

Authors: Khaled El Emam; Fida Kamal Dankar
Journal: J Am Med Inform Assoc Date: 2008-06-25 Impact factor: 4.497

3. The precision medicine initiative: a new national effort.

Authors: Euan A Ashley
Journal: JAMA Date: 2015-06-02 Impact factor: 56.272

4. A De-Identification Pipeline for Ultrasound Medical Images in DICOM Format.

Authors: Eriksson Monteiro; Carlos Costa; José Luís Oliveira
Journal: J Med Syst Date: 2017-04-13 Impact factor: 4.460

5. Identifying personal genomes by surname inference.

Authors: Melissa Gymrek; Amy L McGuire; David Golan; Eran Halperin; Yaniv Erlich
Journal: Science Date: 2013-01-18 Impact factor: 47.728

6. Machine Learning and Prediction in Medicine - Beyond the Peak of Inflated Expectations.

Authors: Jonathan H Chen; Steven M Asch
Journal: N Engl J Med Date: 2017-06-29 Impact factor: 91.245

7. A De-identification method for bilingual clinical texts of various note types.

Authors: Soo-Yong Shin; Yu Rang Park; Yongdon Shin; Hyo Joung Choi; Jihyun Park; Yongman Lyu; Moo-Song Lee; Chang-Min Choi; Woo-Sung Kim; Jae Ho Lee
Journal: J Korean Med Sci Date: 2014-12-23 Impact factor: 2.153

8. Building public trust in uses of Health Insurance Portability and Accountability Act de-identified data.

Authors: Deven McGraw
Journal: J Am Med Inform Assoc Date: 2012-06-26 Impact factor: 4.497

9. Redefining genomic privacy: trust and empowerment.

Authors: Yaniv Erlich; James B Williams; David Glazer; Kenneth Yocum; Nita Farahany; Maynard Olson; Arvind Narayanan; Lincoln D Stein; Jan A Witkowski; Robert C Kain
Journal: PLoS Biol Date: 2014-11-04 Impact factor: 8.029

10. Sharing Clinical Big Data While Protecting Confidentiality and Security: Observational Health Data Sciences and Informatics.

Authors: Rae Woong Park
Journal: Healthc Inform Res Date: 2017-01-31

3 in total

Review 1. South Korea: in the midst of a privacy reform centered on data sharing.

Authors: Hannah Kim; So Yoon Kim; Yann Joly
Journal: Hum Genet Date: 2018-08-18 Impact factor: 4.132

Review 2. Data Pseudonymization in a Range That Does Not Affect Data Quality: Correlation with the Degree of Participation of Clinicians.

Authors: Soo-Yong Shin; Hun-Sung Kim
Journal: J Korean Med Sci Date: 2021-11-15 Impact factor: 2.153

3. Identifying primary care datasets and perspectives on their secondary use: a survey of Australian data users and custodians.

Authors: Rachel Canaway; Douglas Boyle; Jo-Anne Manski-Nankervis; Kathleen Gray
Journal: BMC Med Inform Decis Mak Date: 2022-04-06 Impact factor: 2.796

3 in total