| Literature DB >> 29025131 |
Katie L Harron1, James C Doidge2,3, Hannah E Knight1,4, Ruth E Gilbert2, Harvey Goldstein2,5, David A Cromwell1, Jan H van der Meulen1.
Abstract
Linked datasets are an important resource for epidemiological and clinical studies, but linkage error can lead to biased results. For data security reasons, linkage of personal identifiers is often performed by a third party, making it difficult for researchers to assess the quality of the linked dataset in the context of specific research questions. This is compounded by a lack of guidance on how to determine the potential impact of linkage error. We describe how linkage quality can be evaluated and provide widely applicable guidance for both data providers and researchers. Using an illustrative example of a linked dataset of maternal and baby hospital records, we demonstrate three approaches for evaluating linkage quality: applying the linkage algorithm to a subset of gold standard data to quantify linkage error; comparing characteristics of linked and unlinked data to identify potential sources of bias; and evaluating the sensitivity of results to changes in the linkage procedure. These approaches can inform our understanding of the potential impact of linkage error and provide an opportunity to select the most appropriate linkage procedure for a specific analysis. Evaluating linkage quality in this way will improve the quality and transparency of epidemiological and clinical research using linked data.Entities:
Keywords: Record linkage; administrative data; bias; data accuracy; data linkage; hospital records; linkage error; selection bias; sensitivity and specificity
Mesh:
Year: 2017 PMID: 29025131 PMCID: PMC5837697 DOI: 10.1093/ije/dyx177
Source DB: PubMed Journal: Int J Epidemiol ISSN: 0300-5771 Impact factor: 7.196
Box 1. Summary of approaches to evaluating linkage quality
| Using a gold standard dataset to quantify false matches and missed matches | Comparing characteristics of linked and unlinked data to identify potential sources of bias | Sensitivity analyses to evaluate how sensitive results are to changes in linkage procedure | |
|---|---|---|---|
| Purpose | To quantify errors (missed matches and false matches) | To identify subgroups of records that are more prone to linkage error and are potential sources of bias | Assesses the extent to which results of interest may vary depending on different levels of error, and the direction of likely bias |
| Strengths | Easily interpretable; allows linkage error to be fully measured | Straightforward to implement and easily interpretable | Straightforward to implement |
| Limitations | Representative gold standard data are rarely available | Cannot be applied if systematic differences are expected between linked unlinked records (e.g. if linking to death register) | Results may be difficult to interpret as false matches and missed matches may impact on results in opposing or compounding ways |
| Technical requirements | A representative group of records for which true match status is known; data linker capacity to perform evaluation (researchers rarely have access to gold standard data) | A linkage design where all records in at least one file are expected to link; provision of record-level or aggregate characteristics of unlinked records to researchers | Provision of information on the strength of the match (e.g. deterministic rule or probabilistic match weight) |
Figure 1Creation of a gold standard dataset for evaluating linkage quality in the HES mother-baby cohort. 183 195 records of births/deliveries from 15 English hospitals, April 2012–March 2013; 2672 955 records of births/deliveries from all NHS hospitals in England, April 2012–March 2013; 372 817 records in the HES mother-baby validation cohort (gold standard).
Characteristics of records in the HES mother-baby cohort according to linkage status derived from gold standard data
| True matches ( | False matches ( | St. diff. | Missed matches ( | St. diff. | |||||
|---|---|---|---|---|---|---|---|---|---|
| % | % | % | |||||||
| Stillbirth | 325 | 0.5 | 6 | 0.9 | 0.1 | 9 | 3.0 | 0.2 | |
| Survival to postnatal discharge | 71384 | 99.1 | 627 | 98.1 | 0.1 | 286 | 96.3 | 0.2 | |
| Delivery risk factor | 6738 | 9.4 | 105 | 16.4 | 0.2 | 49 | 16.5 | 0.2 | |
| Female infant | 34967 | 48.7 | 321 | 50.2 | 0.0 | 140 | 47.1 | 0.0 | |
| Multiple birth | 1961 | 2.7 | 126 | 19.7 | 0.6 | 31 | 10.4 | 0.3 | |
| Caesarean section | 18034 | 25.1 | 63 | 9.9 | 0.4 | 10 | 3.4 | 0.7 | |
| Pregnancy risk factor | 7388 | 10.3 | 16 | 2.5 | 0.3 | 1 | 0.3 | 0.5 | |
| Neonatal medical condition | 6281 | 8.7 | 91 | 14.2 | 0.2 | 90 | 30.3 | 0.6 | |
| Neonatal ICU | 8461 | 11.8 | 32 | 5.0 | 0.2 | 33 | 11.1 | 0.0 | |
| Parity: nulliparous | 27125 | 37.7 | 335 | 52.4 | 0.3 | 192 | 64.5 | 0.6 | |
| Gestational age group | Full term (39+ wks) | 45611 | 72.3 | 102 | 44.4 | 0.7 | 27 | 44.3 | 0.7 |
| Early term (37–38 wks) | 12721 | 20.2 | 66 | 28.7 | 17 | 27.9 | |||
| Late preterm (34–36 wks) | 3280 | 5.2 | 39 | 17.0 | 6 | 9.8 | |||
| Moderate/very preterm (< 34 wks) | 1494 | 2.4 | 23 | 10.0 | 11 | 18.0 | |||
| Missing | 8775 | 12.2 | 409 | 64.0 | 236 | 79.5 | |||
| Birthweight (g) | < 1500 | 909 | 1.4 | 14 | 6.1 | 0.7 | 7 | 10.9 | 0.7 |
| 1500–< 2500 | 1798 | 6.0 | 45 | 19.7 | 12 | 18.8 | |||
| 2500–< 4000 | 51718 | 82.0 | 160 | 69.9 | 42 | 65.6 | |||
| 4000+ | 6687 | 10.6 | 10 | 4.4 | 3 | 4.7 | |||
| Missing | 8769 | 12.2 | 410 | 64.2 | 233 | 78.5 | |||
| Size for gestation | Small (< 10th percentile) | 5274 | 8.4 | 25 | 11.1 | 0.2 | 5 | 8.3 | 0.1 |
| Normal | 54367 | 81.6 | 187 | 93.1 | 51 | 85.0 | |||
| Large (> 10th percentile) | 6344 | 10.1 | 13 | 5.8 | 4 | 6.7 | |||
| Missing | 8896 | 12.4 | 414 | 64.8 | 237 | 79.8 | |||
| Ethnicity | White | 48896 | 68.0 | 408 | 63.9 | 0.3 | 165 | 55.6 | 0.4 |
| Mixed | 3410 | 4.7 | 24 | 3.8 | 14 | 4.7 | |||
| Asian | 7367 | 10.3 | 49 | 7.7 | 20 | 6.7 | |||
| Black | 4866 | 6.8 | 32 | 5.0 | 25 | 8.4 | |||
| Other | 4508 | 6.3 | 77 | 12.1 | 38 | 12.8 | |||
| Unknown | 2834 | 3.9 | 49 | 7.7 | 35 | 11.8 | |||
| Newborn length of stay (days) | < 2 | 38329 | 53.3 | 315 | 49.3 | 0.2 | 131 | 44.1 | 0.7 |
| 2–6 | 28946 | 40.3 | 244 | 38.2 | 74 | 24.9 | |||
| 7+ | 4599 | 6.4 | 80 | 12.5 | 92 | 31.0 | |||
| Maternal age (years) | < 20 | 2859 | 4.0 | 21 | 3.3 | 0.1 | 13 | 4.4 | 0.2 |
| 20–24 | 11752 | 16.4 | 88 | 13.8 | 42 | 14.1 | |||
| 25–29 | 19226 | 26.8 | 155 | 24.3 | 55 | 18.5 | |||
| 30–34 | 22377 | 31.1 | 220 | 34.4 | 101 | 34.0 | |||
| 35–39 | 12433 | 17.3 | 125 | 19.6 | 64 | 21.6 | |||
| 40+ | 3234 | 4.5 | 30 | 4.7 | 22 | 7.4 | |||
| Income/deprivation quintile | Most deprived | 27042 | 37.7 | 206 | 32.3 | 0.1 | 97 | 32.9 | 0.2 |
| 2 | 16394 | 22.9 | 170 | 26.7 | 86 | 29.2 | |||
| 3 | 13104 | 18.3 | 129 | 20.3 | 58 | 19.7 | |||
| 4 | 9040 | 12.6 | 77 | 12.1 | 37 | 12.5 | |||
| Most affluent | 6146 | 8.6 | 55 | 8.6 | 17 | 5.8 | |||
| Missing | 155 | 0.2 | 2 | 0.3 | 2 | 0.7 | |||
0.2, 0.5, and 0.8 can be considered as small, medium and large effect sizes respectively.
St. diff, standardized differences; ICU, intensive care unit; wks, weeks.
aHypoxia, amniotic fluid embolism, placental-transfusion syndrome, umbilical cord prolapse, chorioamnionitis, fetal haemorrhage, birth trauma, complications of delivery, umbilical cord problem.
bEclampsia, gestational hypertension, diabetes, placental abruption or infarction.
cCongenital anomaly, perinatal infection, neonatal abstinence syndrome, respiratory distress syndrome.
dQuintiles of deprivation were derived from the Index of Multiple Deprivation (IMD) score based on patient postcode in HES.
*Percentage of records with missing data (excluded from other category percentages).
Linkage success for a range of linkage criteria
| Original probabilistic linkage (threshold weight = 20) | High-threshold probabilistic linkage (threshold weight = 45) | Deterministic linkage only | |
|---|---|---|---|
| Linked records | 72520/72817 | 65020/72817 | 35324/72817 |
| 99.6% | 89.3% | 48.5% | |
| Missed match rate | 297/72817 | 7797/72817 | 37493/72817 |
| 0.4% | 10.7% | 51.5% | |
| False match rate | 636/72520 | 212/65020 | 22/35324 |
| 0.9% | 0.3% | 0.1% | |
| Positive predictive value | 71884/72520 | 64808/65020 | 35302/35324 |
| 99.1% | 99.7% | 99.9% |
Comparison of outcome measures for a range of linkage criteria
| Gold standard | Original probabilistic linkage | High-threshold probabilistic linkage | Deterministic linkage only | |
|---|---|---|---|---|
| % Preterm births (95% CI) | 7.65 (7.45–7.86) | 7.64 (7.43–7.85) | 7.31 (7.11–7.53) | 7.43 (7.16–7.71) |
| % Stillbirths (95% CI) | 0.47 (0.42–0.52) | 0.46 (0.41–0.51) | 0.44 (0.39–0.49) | 0.45 (0.40–0.50) |
| Odds ratio (95% CI) for neonatal survival to discharge: mothers with delivery risk factors vs those without | 0.40 (0.17–0.95) | 0.42 (0.18–0.98) | 0.35 (0.15–0.79) | 0.52 (0.22–1.25) |
| Odds ratio (95% CI) for delivery risk factors: Black ethnicity vs White ethnicity | 0.98 (0.88–1.09) | 0.97 (0.87–1.08) | 0.89 (0.79–1.01) | 0.80 (0.66–0.96) |
Figure 2Number of linked records and percentage of missed matches and false matches for a range of linkage criteria. W = threshold used to classify links in probabilistic linkage.