| Literature DB >> 30013012 |
Yancheng Shi1, Zhenjiang Zhang2, Han-Chieh Chao3,4,5, Bo Shen6.
Abstract
With the rapid development of information technology, large-scale personal data, including those collected by sensors or IoT devices, is stored in the cloud or data centers. In some cases, the owners of the cloud or data centers need to publish the data. Therefore, how to make the best use of the data in the risk of personal information leakage has become a popular research topic. The most common method of data privacy protection is the data anonymization, which has two main problems: (1) The availability of information after clustering will be reduced, and it cannot be flexibly adjusted. (2) Most methods are static. When the data is released multiple times, it will cause personal privacy leakage. To solve the problems, this article has two contributions. The first one is to propose a new method based on micro-aggregation to complete the process of clustering. In this way, the data availability and the privacy protection can be adjusted flexibly by considering the concepts of distance and information entropy. The second contribution of this article is to propose a dynamic update mechanism that guarantees that the individual privacy is not compromised after the data has been subjected to multiple releases, and minimizes the loss of information. At the end of the article, the algorithm is simulated with real data sets. The availability and advantages of the method are demonstrated by calculating the time, the average information loss and the number of forged data.Entities:
Keywords: dynamic update; micro aggregation; privacy protection; sensitive attributes
Year: 2018 PMID: 30013012 PMCID: PMC6068819 DOI: 10.3390/s18072307
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Diagnosis records and anonymous generalization result.
| Name | Age | Zip | Disease | GID | Age | Zip | Disease |
|---|---|---|---|---|---|---|---|
| Bob | 21 | 12,000 | dyspepsia | 1 | (21,22) | 12–14 k | dyspepsia |
| Alice | 22 | 14,000 | bronchitis | 1 | (21,22) | 12–14 k | bronchitis |
| Andy | 24 | 18,000 | flu | 2 | (23,24) | 18–25 k | flu |
| David | 23 | 25,000 | gastritis | 2 | (23,24) | 18–25 k | gastritis |
| Gary | 41 | 20,000 | flu | 3 | (36,41) | 20–27 k | flu |
| Helen | 36 | 27,000 | gastritis | 3 | (36,41) | 20–27 k | gastritis |
Nominal attributes matrix.
| Religion |
|
|
|
|---|---|---|---|
| Buddhism |
| 0 | 0 |
| Catholicism | 0 |
| 0 |
| Islam | 0 | 0 |
|
The original equivalent group and the cache table W.
| GID | Age | Zip | Disease | Cache Table | |||
|---|---|---|---|---|---|---|---|
| 1 | (21,22) | 12–14 k | dyspepsia | GID | Age | Zip | Disease |
| 1 | (21,22) | 12–14 k | bronchitis | 1 | x1 | xx000 | dyspepsia |
| 2 | (23,24) | 18–25 k | flu | 1 | x2 | xx000 | bronchitis |
| 2 | (23,24) | 18–25 k | gastritis | … | … | … | … |
The equivalent group and the cache table W (when t1 is added).
| GID | Age | Zip | Disease | Cache Table | |||
|---|---|---|---|---|---|---|---|
| 1 | (21,22) | 12–14 k | dyspepsia |
|
|
|
|
| 1 | (21,22) | 12–14 k | bronchitis | 1 | x1 | xx000 | dyspepsia |
| 2 | (23,25) | 18–27 k | flu | 1 | x2 | xx000 | bronchitis |
| 2 | (23,25) | 18–27 k | gastritis | 1 | 25 | 27,000 | flu |
| 2 | (23,25) | 18–27 k | flu | … | … | … | … |
The equivalent group and the cache table W (when t2 is added).
| GID | Age | Zip | Disease | Cache Table W | |||
|---|---|---|---|---|---|---|---|
| 1 | (21,22) | 12–14 k | dyspepsia |
|
|
|
|
| 1 | (21,22) | 12–14 k | bronchitis | 1 | x1 | xx000 | dyspepsia |
| 2 | (23,24) | 18–25 k | flu | 1 | x2 | xx000 | bronchitis |
| 2 | (23,24) | 18–25 k | gastritis | 2 | 25 | 27000 | flu |
| 3 | (25,26) | 27–29 k | flu | 2 | 26 | 29000 | headache |
| 3 | (25,26) | 27–29 k | headache | … | … | … | … |
Quasi identifier attributes and the result set.
| GID | Age | Zip | GID | Disease | Percentage |
|---|---|---|---|---|---|
| 1 | (21,22) | 12–14 k | 1 | dyspepsia | a% |
| 1 | (21,22) | 12–14 k | 1 | bronchitis | b% |
| 2 | (23,24) | 18–25 k | 2 | flu | c% |
| 2 | (23,24) | 18–25 k | … | … | … |
Figure 1Clustering run time.
Figure 2Average information loss.
Figure 3Forged data comparison.