| Literature DB >> 28693507 |
Adrian P Brown1, Sean M Randall2, Anna M Ferrante2, James B Semmens2, James H Boyd2.
Abstract
BACKGROUND: Probabilistic record linkage is a process used to bring together person-based records from within the same dataset (de-duplication) or from disparate datasets using pairwise comparisons and matching probabilities. The linkage strategy and associated match probabilities are often estimated through investigations into data quality and manual inspection. However, as privacy-preserved datasets comprise encrypted data, such methods are not possible. In this paper, we present a method for estimating the probabilities and threshold values for probabilistic privacy-preserved record linkage using Bloom filters.Entities:
Keywords: Data quality; Linkage quality; Privacy; Probabilistic; Record linkage
Mesh:
Year: 2017 PMID: 28693507 PMCID: PMC5504757 DOI: 10.1186/s12874-017-0370-0
Source DB: PubMed Journal: BMC Med Res Methodol ISSN: 1471-2288 Impact factor: 4.615
Fig. 1Record comparison example
Field state combinations
| First Name | Last Name | Sex | Year of Birth | Count |
|---|---|---|---|---|
| Agree | Agree | Agree | Agree | 1502 |
| Agree | Agree | Missing | Disagree | 2142 |
| Agree | Disagree | Disagree | Missing | 28,644 |
| … | … | … | … | … |
Synthetic dataset characteristics
| Field | 0% Error | 1% Error | 5% Error | 10% Error | 20% Error | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| Unique Values | Discriminating Power | Unique Values | Discriminating Power | Unique Values | Discriminating Power | Unique Values | Discriminating Power | Unique Values | Discriminating Power | |
| First Name | 31,183 | 8.91 | 34,595 | 8.92 | 45,914 | 8.99 | 58,046 | 9.08 | 78,256 | 9.29 |
| Middle Name | 25,002 | 7.33 | 28,224 | 7.35 | 38,285 | 7.45 | 48,973 | 7.59 | 67,160 | 7.95 |
| Last Name | 56,507 | 10.87 | 61,198 | 10.88 | 77,088 | 10.96 | 94,925 | 11.07 | 125,483 | 11.35 |
| Dob Year | 112 | 6.49 | 114 | 6.49 | 116 | 6.50 | 117 | 6.51 | 119 | 6.53 |
| Dob Month | 12 | 3.58 | 12 | 3.58 | 12 | 3.58 | 12 | 3.58 | 12 | 3.58 |
| Dob Day | 31 | 4.94 | 31 | 4.94 | 31 | 4.94 | 31 | 4.94 | 31 | 4.93 |
| Sex | 2 | 1.00 | 2 | 1.00 | 2 | 1.00 | 2 | 1.00 | 2 | 1.00 |
| Address | 171,088 | 12.89 | 178,583 | 12.92 | 207,909 | 13.04 | 241,966 | 13.21 | 304,353 | 13.66 |
| Suburb | 1962 | 8.33 | 7390 | 8.36 | 19,664 | 8.48 | 31,054 | 8.65 | 49,929 | 9.10 |
| Postcode | 379 | 6.77 | 1755 | 6.80 | 2579 | 6.91 | 2981 | 7.06 | 3395 | 7.45 |
Synthetic dataset linkage quality - estimated vs. calculated
| Data Error Rate | Calculated Probabilities | EM m-probs and Estimated u-probs | ||||||
|---|---|---|---|---|---|---|---|---|
| Highest | Estimated | Highest | Estimated | |||||
| Threshold | FMeasure | Threshold | FMeasure | Threshold | FMeasure | Threshold | FMeasure | |
| 0% | 49 | 1.0000 | 8 | 0.9999 | 49 | 1.0000 | 8 | 0.9999 |
| 1% | 9 | 0.9979 | 16 | 0.9978 | 13 | 0.9979 | 11 | 0.9979 |
| 5% | 8 | 0.9549 | 16 | 0.9541 | 12 | 0.9549 | 11 | 0.9549 |
| 10% | 8 | 0.8443 | 16 | 0.8399 | 12 | 0.8439 | 11 | 0.8436 |
| 20% | 8 | 0.5217 | 16 | 0.4938 | 12 | 0.4999 | 11 | 0.4917 |
Administrative dataset characteristics
| NSW(13,534,177 records) | SA(2,509,914 records) | WA(6,772,949 records) | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Field | Unique Values | Missing % | Discriminating Power | Unique Values | Missing % | Discriminating Power | Unique Values | Missing % | Discriminating Power |
| First Name | 168,766 | 2.9% | 8.61 | 124,849 | 5.5% | 9.18 | 78,992 | 0.3% | 8.54 |
| Middle Name | 114,686 | 54.2% | 6.96 | 22,180 | 75.4% | 7.19 | 61,241 | 40.8% | 7.13 |
| Last Name | 291,595 | 0% | 10.92 | 81,431 | 5.3% | 10.81 | 123,481 | 0% | 10.73 |
| Dob Year | 123 | 0% | 6.47 | 115 | 0% | 6.45 | 118 | 0% | 6.39 |
| Dob Month | 12 | 0% | 3.58 | 12 | 0% | 3.58 | 12 | 0% | 3.58 |
| Dob Day | 31 | 0% | 4.94 | 31 | 0% | 4.94 | 31 | 0% | 4.94 |
| Sex | 2 | 0% | 1.00 | 2 | 0% | 1.00 | 2 | 0% | 0.99 |
| Address | 3,084,889 | 1.5% | 16.96 | 690,615 | 8.1% | 14.92 | 1,350,796 | 0.2% | 16.05 |
| Suburb | 49,843 | 0.5% | 9.30 | 10,729 | 6.9% | 7.85 | 5542 | 0.1% | 7.73 |
| Postcode | 3947 | 0.8% | 8.17 | 2238 | 8.5% | 6.90 | 2319 | 0.2% | 6.58 |
Estimated probabilities
| NSW | SA | WA | ||||
|---|---|---|---|---|---|---|
| Field | EM m-prob | Est. u-prob | EM m-prob | Est. u-prob | EM m-prob | Est. u-prob |
| First Name | 0.9817 | 0.0024 | 0.8707 | 0.0015 | 0.9732 | 0.0027 |
| Middle Name | 0.4686 | 0.0017 | 0.1846 | 0.0004 | 0.4385 | 0.0025 |
| Last Name | 0.9916 | 0.0005 | 0.8931 | 0.0005 | 0.9823 | 0.0006 |
| Dob Year | 0.9973 | 0.0113 | 0.9997 | 0.0114 | 0.9935 | 0.0119 |
| Dob Month | 0.9987 | 0.0834 | 0.9988 | 0.0834 | 0.9949 | 0.0835 |
| Dob Day | 0.9965 | 0.0325 | 0.9988 | 0.0325 | 0.9963 | 0.0326 |
| Sex | 0.9999 | 0.5008 | 1.0000 | 0.5010 | 0.9998 | 0.5018 |
| Address | 0.8325 | 7.99E-06 | 0.6486 | 2.8E-05 | 0.7338 | 1.7E-05 |
| Suburb | 0.9303 | 0.0016 | 0.7462 | 0.0038 | 0.8402 | 0.0047 |
| Postcode | 0.9540 | 0.0034 | 0.7574 | 0.0070 | 0.8640 | 0.0104 |
Linkage quality (max F-measure) – EM vs. calculated
| Dataset | Probabilities | Sample 1 | Sample 2 | Sample 3 | Sample 4 | Sample 5 | RMSE |
|---|---|---|---|---|---|---|---|
| NSW | Calculated | 0.9941 | 0.9943 | 0.9942 | 0.9941 | 0.9940 | |
| EM | 0.9961 | 0.9965 | 0.9963 | 0.9963 | 0.9961 | 0.0021 | |
| SA | Calculated | 0.9532 | 0.9521 | 0.9529 | 0.9553 | 0.9532 | |
| EM | 0.9590 | 0.9567 | 0.9563 | 0.9582 | 0.9589 | 0.0046 | |
| WA | Calculated | 0.9907 | 0.9904 | 0.9910 | 0.9905 | 0.9906 | |
| EM | 0.9920 | 0.9916 | 0.9921 | 0.9917 | 0.9918 | 0.0012 |
Linkage quality – max F-measure vs. F-measure at threshold estimate
| Dataset | Threshold | Sample 1 | Sample 2 | Sample 3 | Sample 4 | Sample 5 | RMSE | |
|---|---|---|---|---|---|---|---|---|
| NSW | Best | 14 | 0.9961 | 0.9965 | 0.9963 | 0.9963 | 0.9961 | |
| Estimated | 12 | 0.9943 | 0.9946 | 0.9945 | 0.9944 | 0.9942 | 0.0019 | |
| SA | Best | 13 | 0.9590 | 0.9567 | 0.9563 | 0.9582 | 0.9589 | |
| Estimated | 12 | 0.9589 | 0.9566 | 0.9563 | 0.9581 | 0.9588 | 0.0001 | |
| WA | Best | 13 | 0.9920 | 0.9916 | 0.9921 | 0.9917 | 0.9918 | |
| Estimated | 11 | 0.9871 | 0.9870 | 0.9873 | 0.9871 | 0.9875 | 0.0046 | |