| Literature DB >> 23612074 |
Sunyong Yoo1, Moonshik Shin, Doheon Lee.
Abstract
Electronic Health Records (EHRs) enable the sharing of patients' medical data. Since EHRs include patients' private data, access by researchers is restricted. Therefore k-anonymity is necessary to keep patients' private data safe without damaging useful medical information. However, k-anonymity cannot prevent sensitive attribute disclosure. An alternative, l-diversity, has been proposed as a solution to this problem and is defined as: each Q-block (ie, each set of rows corresponding to the same value for identifiers) contains at least l well-represented values for each sensitive attribute. While l-diversity protects against sensitive attribute disclosure, it is limited in that it focuses only on diversifying sensitive attributes. The aim of the study is to develop a k-anonymity method that not only minimizes information loss but also achieves diversity of the sensitive attribute. This paper proposes a new privacy protection method that uses conditional entropy and mutual information. This method considers both information loss as well as diversity of sensitive attributes. Conditional entropy can measure the information loss by generalization, and mutual information is used to achieve the diversity of sensitive attributes. This method can offer appropriate Q-blocks for generalization. We used the adult database from the UCI Machine Learning Repository and found that the proposed method can greatly reduce information loss compared with a recent l-diversity study. It can also achieve the diversity of sensitive attributes by counting the number of Q-blocks that have leaks of diversity. This study provides a privacy protection method that can improve data utility and protect against sensitive attribute disclosure. The method is viable and should be of interest for further privacy protection in EHR applications.Entities:
Keywords: Conditional entropy; Information loss; Mutual information; k-anonymity; l-diversity
Year: 2012 PMID: 23612074 PMCID: PMC3626125 DOI: 10.2196/ijmr.2140
Source DB: PubMed Journal: Interact J Med Res ISSN: 1929-073X
An example of an original data table.
| Index | Quasi-identifier (QI) | Sensitive | |
| Work | Country | Disease | |
| 1 | Private | USA | Heart Disease |
| 2 | State-gov | Mexico | Cancer |
| 3 | Local-gov | Brazil | Cancer |
| 4 | Federal-gov | USA | Flu |
| 5 | Private | Canada | Heart Disease |
| 6 | Self-emp-not-inc | Canada | Heart Disease |
| 7 | Self-emp-inc | USA | Flu |
| 8 | Private | USA | Heart Disease |
| 9 | State-gov | Mexico | Flu |
An example of a 3-anonymous data table after generalization.
| Index | Quasi-identifier (QI) | Sensitive | |
| Work | Country | Disease | |
| 1 | Private | North | Heart Disease |
| 5 | Private | North | Heart Disease |
| 8 | Private | North | Heart Disease |
| 2 | Government | South | Cancer |
| 3 | Government | South | Cancer |
| 9 | Government | South | Flu |
| 4 | Workclass | North | Flu |
| 6 | Workclass | North | Heart Disease |
| 7 | Workclass | North | Flu |
Therefore, k-anonymity is defined as: Let D denote the original data table and D * denote a release candidate of D produced by the generalization. Given a set of QI attributes Q ,…,Q , release candidate D * is said to be k-anonymous with respect to Q ,…,Q if each unique tuple in the projection of D * on Q ,…,Q occurs at least k times.
An example of a 3-diverse data table.
| Index | Quasi-identifier (QI) | Sensitive | |
| Work | Country | Disease | |
| 1 | Workclass | America | Heart Disease |
| 3 | Workclass | America | Cancer |
| 7 | Workclass | America | Flu |
| 2 | Workclass | America | Cancer |
| 8 | Workclass | America | Heart Disease |
| 9 | Workclass | America | Flu |
| 4 | Workclass | North | Flu |
| 5 | Workclass | North | Heart Disease |
| 6 | Workclass | North | Heart Disease |
Figure 1Equations (1) to (8).
Figure 2Individual conditional entropies and mutual information for a pair of correlated subsystems.
Data table showing generalized QI attributes and sensitive attributes for first instance and second instance to explain conditional entropy and mutual information.
| Index | Original quasi-identifier | Generalized quasi-identifier | Sensitive | ||
| Work | Country | Work | Country | Disease | |
| 1 | Private | USA | Workclass | America | Heart Disease |
| 2 | State-gov | Mexico | Workclass | America | Cancer |
| 3 | Local-gov | Brazil | Local-gov | Brazil | Cancer |
| 4 | Federal-gov | USA | Federal-gov | USA | Flu |
| 5 | Private | Canada | Private | Canada | Heart Disease |
| 6 | Self-emp-not-inc | Canada | Self-emp-not-inc | Canada | Heart Disease |
| 7 | Self-emp-inc | USA | Self-emp-inc | USA | Flu |
| 8 | Private | USA | Private | USA | Heart Disease |
| 9 | State-gov | Mexico | State-gov | Mexico | Flu |
Figure 3Simplified concept of the proposed method.
Figure 4Comparison of total information loss with respect to the number of instances.
Figure 5Comparison of the number of Q-blocks, which are l=1 (homogeneity attack), l=2 (background knowledge attack), and l=3 (safe), to measure the diversity (the size of Q-block is set to 3).
Figure 6Comparison of execution time with respect to the number of instances.