| Literature DB >> 32935029 |
A P Brown1, S M Randall1, J H Boyd1, A M Ferrante1.
Abstract
INTRODUCTION: The need for increased privacy protection in data linkage has driven the development of privacy-preserving record linkage (PPRL) techniques. A popular technique using Bloom filters with cryptographic analyses, modifications, and hashing variations to optimise privacy has been the focus of much research in this area. With few applications of Bloom filters within a probabilistic framework, there is limited information on whether approximate matches between Bloom filtered fields can improve linkage quality.Entities:
Year: 2019 PMID: 32935029 PMCID: PMC7482522 DOI: 10.23889/ijpds.v4i1.1095
Source DB: PubMed Journal: Int J Popul Data Sci ISSN: 2399-4908
Figure 1: Estimated field and dataset weight curvesWeight proportion represents the proportion of a field match comparison weight (0 = full disagreement, 1 = full agreement)
FP = false positives, FN = false negatives, Cut-off values are shown in parentheses, Cut-off values are shown in parentheses
| 1% Error | 5% Error | 10% Error | 20% Error | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| FP | FN | Total | FP | FN | Total | FP | FN | Total | FP | FN | Total | ||
| Exact | 395 | 1,674 | 2,069 | 2,442 | 18,099 | 20,541 | 131,199 | 16,399 | 147,598 | 110,763 | 372,572 | 483,335 | |
| Field Level | |||||||||||||
| Jaro-Winkler | 92 | 1,781 | 1,873 | 881 | 2,904 | 3,785 | 4,641 | 13,612 | 18,253 | 44,503 | 81,321 | 125,824 | |
| Sørensen–Dice | 125 | 1,713 | 1,838 | 1,054 | 2,517 | 3,571 | 2,978 | 16,736 | 19,714 | 40,436 | 105,024 | 145,460 | |
| Jaccard | 99 | 1,719 | 1,818 | 827 | 2,703 | 3,530 | 1,276 | 20,439 | 21,715 | 34,869 | 109,274 | 144,143 | |
| Hamming | 132 | 1,732 | 1,864 | 830 | 2,691 | 3,521 | 5,033 | 10,526 | 15,559 | 39,301 | 76,619 | 115,920 | |
| Dataset Level | |||||||||||||
| Jaro-Winkler | 74 | 1,752 | 1,862 | 1,034 | 2,799 | 3,840 | 3,427 | 15,199 | 17,343 | 47,449 | 84,134 | 135,452 | |
| Sørensen–Dice | 109 | 1,742 | 1,848 | 1,401 | 3,652 | 4,612 | 3,540 | 25,521 | 27,343 | 53,702 | 120,408 | 166,761 | |
| Jaccard | 83 | 1,744 | 1,819 | 1,205 | 3,708 | 4,563 | 10,691 | 19,047 | 28,179 | 66,948 | 109,909 | 169,002 | |
| Hamming | 72 | 1,753 | 1,871 | 962 | 2,774 | 3,848 | 3,349 | 13,537 | 16,762 | 29,584 | 101,440 | 129,008 | |
| Cut-off value | |||||||||||||
| Jaro-Winkler | 191 | 1,798 | 1,989 | 2,366 | 3,447 | 5,813 | 5,639 | 16,815 | 22,454 | 120,523 | 64,166 | 184,689 | |
| (0.85) | (0.90) | (0.85) | (0.85) | ||||||||||
| Sørensen–Dice | 263 | 1,739 | 2,002 | 2,123 | 4,218 | 6,341 | 17,563 | 25,301 | 42,864 | 90,544 | 109,127 | 199,671 | |
| (0.90) | (0.85) | (0.80) | (0.80) | ||||||||||
| Jaccard | 233 | 1,756 | 1,989 | 1,500 | 6,035 | 13,363 | 7,286 | 38,324 | 45,610 | 142,297 | 48,699 | 190,996 | |
| (0.80) | (0.75) | (0.70) | (0.70) | ||||||||||
| Hamming | 155 | 1,806 | 1,961 | 1,710 | 3,677 | 5,387 | 6,799 | 15,687 | 22,486 | 25,428 | 158,455 | 183,883 | |
| (0.15) | (0.15) | (0.20) | (0.20) | ||||||||||
Figure 2: Precision-recall for each comparison (synthetic datasets)WC = weight curve
Actual cut-off values are shown in parentheses
| False Positives | False Negatives | Total | ||
|---|---|---|---|---|
| Exact | 29,040 | 233,405 | 262,445 | |
| Field Level | ||||
| Jaro-Winkler | 33,729 | 170,188 | 203,917 | |
| Sørensen–Dice | 34,876 | 170,801 | 205,677 | |
| Jaccard | 44,576 | 162,931 | 207,507 | |
| Hamming | 46,905 | 166,138 | 213,043 | |
| Dataset Level | ||||
| Jaro-Winkler | 34,513 | 170,298 | 204,811 | |
| Sørensen–Dice | 41,929 | 176,513 | 218,442 | |
| Jaccard | 35,172 | 181,066 | 216,238 | |
| Hamming | 38,082 | 170,185 | 208,267 | |
| Cut-off value | ||||
| Jaro-Winkler (0.85) | 44,038 | 169,633 | 213,671 | |
| Sørensen–Dice (0.75) | 39,848 | 192,193 | 232,041 | |
| Jaccard (0.65) | 42,598 | 193,117 | 235,715 | |
| Hamming (0.20) | 39,750 | 191,080 | 230,830 | |
| Synthetic Dataset Level | ||||
| Jaro-Winkler | 31,900 | 172,363 | 204,263 | |
| Sørensen–Dice | 52,073 | 165,345 | 217,418 | |
| Jaccard | 50,586 | 163,769 | 214,355 | |
| Hamming | 40,112 | 170,642 | 210,754 | |
Figure 3: Precision-recall for each comparison (NSW Emergency dataset)WC = weight curve