| Literature DB >> 30533534 |
J C Doidge1,2, K Harron3.
Abstract
Many of the distinctions made between probabilistic and deterministic linkage are misleading. While these two approaches to record linkage operate in different ways and can produce different outputs, the distinctions between them are more a result of how they are implemented than because of any intrinsic differences. In the way they are generally applied, probabilistic and deterministic procedures can be little more than alternative means to similar ends-or they can arrive at very different ends depending on choices that are made during implementation. Misconceptions about probabilistic linkage contribute to reluctance for implementing it and mistrust of its outputs. We aim to explain how the outputs of either approach can be tailored to suit the intended application, but also to highlight the ways in which probabilistic linkage is generally more flexible, more powerful and more informed by the data. This is accomplished by examining common misconceptions about probabilistic linkage and its difference from deterministic linkage, highlighting the potential impact of design choices on the outputs of either approach. We hope that better understanding of linkage designs will help to allay concerns about probabilistic linkage, and help data linkers to select and tailor procedures to produce outputs that are appropriate for their intended use.Entities:
Keywords: data linkage; data matching; deterministic linkage; electronic health records; medical record linkage; probabilistic linkage; record linkage
Year: 2018 PMID: 30533534 PMCID: PMC6281162 DOI: 10.23889/ijpds.v3i1.410
Source DB: PubMed Journal: Int J Popul Data Sci ISSN: 2399-4908
Figure 1: Example distributions of match weights and thresholdsCurves illustrating the expected distribution of probabilistic match weights for matches (green) and non-matches (red). In practice, only a single distribution is visualised, representing the match weights for all pairwise comparisons. Classification of links (assumed matches) then generally involves the specification of thresholds (blue). A: Poor discrimination between matches and non-matches (high potential for linkage error); B: Good discrimination between matches and non-matches (low potential for linkage error); C: Two thresholds with manual review region (potential errors subject to review); D: Single threshold and no manual review region (linkage errors accepted).
| Myth | Truth |
|---|---|
| ‘Probabilistic linkage… | |
| … and deterministic linkage are completely distinct methods.’ | Each pattern of agreement over matching variables corresponds to a potential decision rule in deterministic linkage and a match weight in probabilistic linkage. For any match weight threshold that can be set, there is generally an equivalent set of deterministic rules that can be specified. |
| … is based on the probability that record pairs are a match. | It is based on a score that, under certain assumptions, correlates with the likelihood that record pairs are a match. |
| … is intrinsically imperfect or imprecise.’ | The effectiveness of any linkage procedure depends on the quality of the matching variables. Probabilistic and deterministic linkage can be equivalent when the same matching variables are used but it is easier to incorporate poor-quality matching variables in probabilistic linkage. |
| … produces more false matches.’ | There is always a trade-off between false matches and missed matches. In probabilistic linkage, this trade-off can be tuned in either direction by adjusting the match weight threshold. |
| … requires manual review.’ | The use and amount of manual review depends entirely on how the thresholds are chosen and the degree of certainty acceptable in results. With a single threshold, no manual review is required. |
| … allows for disagreement on matching variables.’ | Deterministic linkage also allows for disagreement on matching variables. |
| … can accommodate partial agreement.’ | Deterministic linkage can also accommodate partial agreement. |
| … reflects uncertainty in linkage.’ | In their usual forms, neither probabilistic nor deterministic linkage account for uncertainty in linkage (this is the task for the analysis, not the linkage). Both Probabilistic match weights and deterministic rule steps are crude indicators of uncertainty in a link. |