| Literature DB >> 36137086 |
Youzhe Heng1, Frederik Armknecht1, Yanling Chen2, Rainer Schnell2.
Abstract
Linking several databases containing information on the same person is an essential step of many data workflows. Due to the potential sensitivity of the data, the identity of the persons should be kept private. Privacy-Preserving Record-Linkage (PPRL) techniques have been developed to link persons despite errors in the identifiers used to link the databases without violating their privacy. The basic approach is to use encoded quasi-identifiers instead of plain quasi-identifiers for making the linkage decision. Ideally, the encoded quasi-identifiers should prevent re-identification but still allow for a good linkage quality. While several PPRL techniques have been proposed so far, Bloom filter-based PPRL schemes (BF-PPRL) are among the most popular due to their scalability. However, a recently proposed attack on BF-PPRL based on graph similarities seems to allow individuals' re-identification from encoded quasi-identifiers. Therefore, the graph matching attack is widely considered a serious threat to many PPRL-approaches and leads to the situation that BF-PPRL schemes are rejected as being insecure. In this work, we argue that this view is not fully justified. We show by experiments that the success of graph matching attacks requires a high overlap between encoded and plain records used for the attack. As soon as this condition is not fulfilled, the success rate sharply decreases and renders the attacks hardly effective. This necessary condition does severely limit the applicability of these attacks in practice and also allows for simple but effective countermeasures.Entities:
Year: 2022 PMID: 36137086 PMCID: PMC9499274 DOI: 10.1371/journal.pone.0267893
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.752
Fig 1Privacy-preserving record linkage process.
Fig 2Graph matching attack description.
Fig 3Flowchart of graph matching attack.
Features for nodes (CC: Connected Component; Avg: Average; Std: Standard Deviation).
| Node based | Edge based | Structural based |
|---|---|---|
| Frequency | Degree | CC Degree |
| Length | Max. Sim | CC Density |
| Min. Sim | Betweenness Centrality | |
| Avg. Sim | Degree Centrality | |
| Std. Sim | … |
Fig 4Result and evaluation of the attack.
Each pair refers to same individual in the example.
Fig 5Accuracy of re-identification depending on the overlap for randomly selected records.
Fig 6Accuracy of re-identification depending on the overlap for non-randomly selected records.
Accuracy of re-identification for databases which form proper subsets.
| | | | |
| |||||
|---|---|---|---|---|---|---|---|
|
| 5,000 | 10,000 | 66.7% | 0 | 0 | 0 | 0 |
|
| 3,000 | 10,000 | 46.2% | 0 | 0 | 0 | 0 |
|
| 1,000 | 10,000 | 18.2% | 0 | 0 | 0 | 0 |
|
| 10,000 | 5,000 | 66.7% | 0 | 0 | 0 | 0 |
|
| 10,000 | 3,000 | 46.2% | 0 | 0 | 0 | 0 |
|
| 10,000 | 1,000 | 18.2% | 0 | 0 | 0 | 0 |
Fig 7Accuracy of re-identification before and after 0.25 fake injections success rate drops from 100% to 0%.