| Literature DB >> 31633184 |
James C Doidge1,2, Katie L Harron2.
Abstract
Linked data are increasingly being used for epidemiological research, to enhance primary research, and in planning, monitoring and evaluating public policy and services. Linkage error (missed links between records that relate to the same person or false links between unrelated records) can manifest in many ways: as missing data, measurement error and misclassification, unrepresentative sampling, or as a special combination of these that is specific to analysis of linked data: the merging and splitting of people that can occur when two hospital admission records are counted as one person admitted twice if linked and two people admitted once if not. Through these mechanisms, linkage error can ultimately lead to information bias and selection bias; so identifying relevant mechanisms is key in quantitative bias analysis. In this article we introduce five key concepts and a study classification system for identifying which mechanisms are relevant to any given analysis. We provide examples and discuss options for estimating parameters for bias analysis. This conceptual framework provides the 'links' between linkage error, information bias and selection bias, and lays the groundwork for quantitative bias analysis for linkage error.Entities:
Keywords: Linkage error; bias; bias analysis; data linkage; information bias; missing data; quantitative bias analysis; record linkage; selection bias; sensitivity analysis
Year: 2019 PMID: 31633184 PMCID: PMC7020770 DOI: 10.1093/ije/dyz203
Source DB: PubMed Journal: Int J Epidemiol ISSN: 0300-5771 Impact factor: 7.196
Figure 1.2 × 2 table representing accuracy in record linkage. As with screening tests, linkage accuracy can be represented in a 2 × 2 table where sensitivity (or recall) = a/(a + c) and specificity = b/(b + d), positive predictive value (or precision) = a/(a + b) and the negative predictive value = d/(c + d).
Eleven ‘linkage structures’ for classifying analysis of two linked sets of records
| Linkage structure | Venn diagram | Example | Is linkage meaningfully interpreted? | Is splitting or merging possible? | Is selection dependent on linkage? | What are the implications of a missed link? | What are the implications of a false link? |
|---|---|---|---|---|---|---|---|
| ‘Master’ |
| Analysis of mortality risk through linkage to a register of deaths | Yes, with respect to a variable of interest | No | No | False-negative misclassification | Potential false positive misclassification |
| ‘Intersection’ |
| Analysis of health in aeroplane passengers through linkage of health care service data to passenger manifests | Yes, with respect to the inclusion criteria | No | Yes | Erroneous exclusion | Potential erroneous inclusion |
| ‘Union’ |
| Analysis of pooled data from two providers of a comparable service | Only with respect to variables based on inclusion in both datasets | Yes | Only with respect to potential merging and splitting | Splitting | Merging |
| ‘Disjunctive union’ |
| When comparing two services, an analyst may wish to exclude people who used both | Yes, with respect to a variable that is both a criterion for inclusion and a variable of interest | Yes | Yes, with potential for erroneous inclusion of split entities and exclusion of merged entities | Splitting and erroneous inclusion in both subgroups | Merging and erroneous exclusion |
| ‘Set difference’ |
| When evaluating one service, an analyst may wish to exclude people who also used an alternative service | Yes, with respect to exclusion criteria | No | Yes | Erroneous inclusion | Potential erroneous exclusion |
| ‘Perfect overlap’ |
| Analysis of data from two services that independently cover the same population, such as one for mothers and one for babies (if every baby record has a corresponding maternal record) | No | No | Only with ‘complete case’ approaches to missing data | Missing data | Potential misclassification or measurement error |
| ‘Nested’ |
| Analysis of birthweight for participants in a cohort study through linkage with birth registrations | No | No | Only with ‘complete case’ approaches to missing data | Missing data | Potential misclassification or measurement error |
| ‘Nested subset’ |
| A special case of the nested structure, in which the larger auxiliary file provides information about inclusion or exclusion criteria, e.g. linkage to of a cohort to a birth register, to define a substudy of cohort members with low birthweight | No | No | Yes | Missing data in the selection criteria (which may mean exclusion) | Potential erroneous inclusion or exclusion |
| ‘Nest’ |
| Comparison of outcomes between admitted patients with and without linked test results | Yes, with respect to a variable of interest | No | No | False-negative misclassification | Potential-false positive misclassification |
| ‘Nested set difference’ |
| Analysis excluding people who used a service that is only provided to a subset of the population covered by the primary file, e.g. exclusion of patients who received a treatment that was recorded separately | Yes, with respect to criterion for exclusion | No | Yes | Erroneous inclusion | Potential erroneous exclusion |
| ‘Imperfect nest’ |
| A special case of a nested structure, in which the larger auxiliary file has less than full coverage of the primary file, e.g. linkage to birth records for a cohort that includes some people born overseas. | No | No | Only if a complete case analysis approach is taken to missing data | Missing data | Potential misclassification or measurement error |
Circles represent the population covered by two sets of records, ignoring linkage within either set. Shading represents the region from which the analysis sample is derived (the sampling frame). The size of the circles is irrelevant. Linkage with either set (‘internal linkage’) can have implications that must also be considered (see text).
In our experience, these structures are unusual in practice; if following the decision tree then revisit questions and ensure responses are appropriate.
A complete case approach to missing data in an ‘imperfect nest’ structure becomes equivalent to an ‘intersection’ structure.
Figure 2.Linkage structure classification tree. ‘Entities’ are the unit at which linkage occurs; usually people but potentially families, households, companies etc. ‘Sets’ refers to groups of records being linked; these may be separate data sources, subsets of larger source files (e.g. hospital admissions for disease X) or even subsets of the same source file (e.g. ‘hospital admissions for disease X’ and ‘possible readmissions’, or linkage of mothers to babies in Hospital Episode Statistics10). Linkage within either set can have additional implications for how linkage error can manifest, especially with respect to potential for splitting and merging (see text).
Techniques for estimating linkage error bias parameters
| Technique | False links | Missed links | Limitations | ||
|---|---|---|---|---|---|
| % | Δ | % | Δ | ||
| Comparison of linked data with training data or ‘gold standard’ (often a subset), e.g. records with unique identifiers available for linkage | ✓ | ✓ | ✓ | ✓ | Training data that are representative in terms of the quality of matching variables and the association of quality to variables of interest, are rarely available |
| Negative controls (a subset of records that should definitely not link, i.e. a partial gold standard set), e.g. people known to be alive when linking to a death register | ✓ | ✓ | ✗ | ✗ | Negative controls can be easier to source than positive controls but still require representativeness |
| Comparison of linked and unlinked records, e.g. | ✗ | ✗ | ∼ | ✓ | Only useful when expecting ∼100% match rate in one file. No guarantee that linked records are true matches |
| Comparison of linkable and unlinkable records or records with higher quality matching data and records with lower quality matching data, e.g. missing NHS numbers | ✗ | ✗ | ✗ | ✓ | Usually feasible, given access to record-level information about matching variable quality |
| Comparison of plausible and implausible links, e.g. simultaneous admissions to hospital | ∼ | ✓ | ✗ | ✗ | Often feasible but implausible links are often excluded by data linkers during ‘quality assurance’ |
| Analysis of observed versus plausible number of candidate links, across deterministic rules or probabilistic match weight thresholds, e.g. | ✓ | ✗ | ✗ | ✗ | Only feasible in 1:1 or 1:many linkages (where at most one link is expected in one or both directions) |
| Comparison of characteristics of linked data to reference statistics from external data sources, e.g. | ∼ | ∼ | ∼ | ∼ | Requires representativeness and consideration of other possible reasons for differences, such as differences in data collection and quality |
%, can provide evidence about rates of linkage error; Δ, can provide evidence about differences in error rates with respect to variables of interest.
If 100% of records in one file are expected to link, and the number of false links can be estimated then number of non-links approaches the number of missed links can be derived from these (e.g. if approximately nil false links then the number of missed links is approximately the number of non-links).
Implausible links usually represent only the ‘tip of the iceberg’ and hide a larger proportion of plausible false links. For some of these, the proportion of all possible scenarios that would be considered implausible can be calculated and used to inversely weight the observed number of implausible links, to estimate the unobserved total number of false links.
The extent to which this technique can be used to inform estimation of bias parameters depends heavily on representativeness and the absence of any other reasons for observed differences. It is perhaps more useful for qualitative validation than informing quantitative bias analysis, but is sometimes useful.
Linkage error in a ‘master’ linkage structure
|
|
|
|
|
|
|
Is the target sample uniquely identified in the data, prior to linkage? Generally, yes; there exists one record per admission in the PICU data and each infection record can link to at most one PICU record. There is no possibility of splitting or merging of admissions. However, if the PICU file were also internally linked, for example to include only one admission per child, then splitting and merging could be implicated. Is linkage meaningfully interpreted? Yes; a link is interpreted as meaning that a child in PICU had an infection, which is a variable of interest. Absence of a link is interpreted as implying that they were infection-free. Is selection dependent on linkage? No; selection into the analysis sample is solely determined by inclusion in a primary file (PICU admissions). |
|
|
Linkage error in an ‘intersection’ linkage structure
|
|
|
|
|
|
|
Is the target sample uniquely identified in the data, prior to linkage? Yes. Length of flight is a flight-level characteristic, and the unit of observation must be ‘person-flights’, which would be uniquely identified in the electronic flight records without any possibility of splitting or merging—unless, that is, some further restriction based on internal linkage (e.g. limiting the analysis to one person-flight per person) is also being applied. Is linkage meaningfully interpreted? Yes, with respect to inclusion criteria. Is selection dependent on linkage? Yes, because linkage is meaningfully interpreted with respect to inclusion criteria. |
|
|