| Literature DB >> 32429765 |
Katie Harron1, James C Doidge2, Harvey Goldstein1,3.
Abstract
Background: Linkage of administrative data sources provides an efficient means of collecting detailed data on how individuals interact with cross-sectoral services, society, and the environment. These data can be used to supplement conventional cohort studies, or to create population-level electronic cohorts generated solely from administrative data. However, errors occurring during linkage (false matches/missed matches) can lead to bias in results from linked data.Aim: This paper provides guidance on evaluating linkage quality in cohort studies.Entities:
Keywords: Cohort studies; administrative data; data linkage; measurement error; selection bias
Mesh:
Year: 2020 PMID: 32429765 PMCID: PMC7261400 DOI: 10.1080/03014460.2020.1742379
Source DB: PubMed Journal: Ann Hum Biol ISSN: 0301-4460 Impact factor: 1.533
Linkage accuracy.
| True match status | ||
|---|---|---|
| Match (pair from same individual) | Non-match (pair from different individuals) | |
| Assigned link status | ||
| Link | True match | False match |
| Non-link | Missed match | True non-match |
Sensitivity (or recall) = a/(a + c); specificity = d/(b + d); positive predictive value (or precision) = a/(a + b); negative predictive value = d/(c + d).
Common linkage structures for combining data from two sources, one of which is a cohort study.*
| Linkage structure | Example | Purpose | Implications of a missed match | Implications of a false match |
|---|---|---|---|---|
| “Intersection” | The large circle represents a national dataset containing records of school attainment (e.g. the National Pupil Database in England) and the small circle represents a cohort study. The school database will include records for some individuals who are not cohort participants. Not all cohort participants may be captured in the school database (e.g. those who moved out of the country before starting school). Analysis is restricted to cohort participants with a linked school record. | To define the study population. | Exclusion from the study sample and potential selection bias (cohort participants without linked school records are excluded). | Measurement error or misclassification in any school variables obtained through linkage.** |
| “Master” | The large circle represents a cohort study and the small circle represents a disease registry linkage with the disease registry will be meaningfully interpreted as a cohort participant having the disease. The shaded area indicates that data from all cohort participants will be analysed. | To define exposure/outcome. | Misclassification of disease status, i.e. a cohort participant is erroneously classified as being disease-free. | Misclassification of disease status (if a cohort participant who does not have the disease is linked with the disease registry).*** |
| “Nested” | The large circle represents birth registration data and the small circle represents a cohort study. All cohort participants are expected to have a birth registration record, but the birth registration data will include some individuals who are not cohort participants. The cohort defines the analysis sample; participants who are linked with a birth registration data have further information on variables of interest. | To add further information on variables of interest. | Missing data: no birth registration variables will be available for cohort participants without a linked record. | Measurement error or misclassification in any birth registration variables obtained through linkage.** |
Shaded areas represent the study sample for a particular research question. The relative size of the circles does not matter. We assume that the cohort sample is uniquely identified prior to linkage (i.e. there is only one record per participant) but that the linked data (e.g. administrative data) may contain multiple records per person.
Although we have used cohort studies as an example here, this is not a general requirement for these linkage structures. **If a false match is made to a record that (by chance) holds the same values of analysis variables as the true match, measurement error or misclassification would not occur. ***If a cohort participant who does have the disease is linked with the wrong registry record, this could lead to measurement error or misclassification in any other variables captured about the disease (e.g. stage or type of cancer).
Methods for estimating rates and distributions of linkage error.
| Method | Description | Example | Requirements |
|---|---|---|---|
| Manual review | Manual inspection of record pairs is used to make a decision about whether two records belong to the same individual or not, based on similarities of identifiers held in those records. Humans may recognise small differences between identifiers that may not have been fully captured in an automated linkage strategy (e.g. recognising that Beth is a derivative of Elizabeth, or that December 31 1999 is close to January 01 2000). | Manual review is routinely use at the Centre for Health Record Linkage (CHeReL; New South Wales Ministry of Health). (Centre for Health Record Linkage | Access to identifiers |
| Applying a linkage algorithm to a subset of (gold-standard) data | Testing a linkage strategy on a sample of data where the true match status is known can provide estimates of linkage error rates. “Gold-standard” or “training” datasets might come from a subset of data where a unique identifier is available, where manual review can be performed on a sample of data, or where external information is available. If a subsample is used, it should be representative of the quality of the main dataset. | Linkage of admission records for children in intensive care with laboratory records from infection surveillance systems. In this study, 2 of the 22 laboratories were able to provide high quality, complete and unique identifiers that were used to create a gold-standard subsample. (Harron et al. | Access to identifiers within a gold-standard or training dataset |
| Applying a linkage algorithm to “negative controls” | Testing a linkage strategy on a subset of data we are sure should not link (i.e. data from two unrelated populations) can be a convenient way of identifying false match rates. | Linking birth records to hospital records for pregnant women known to have had an abortive outcome (i.e. where no birth record should exist). (Paixão et al. | Access to identifiers |
| Identification of implausible scenarios | False matches can be identified in cases where (non-identifiable) information in the records mean it is unlikely that two records belong to the same individual (e.g. a male patient being admitted for a caesarean section, or an admission following a death). In cases where we expect there to be a maximum of one match per record (e.g 1:1 or many:1 matching), multiple matches per record will indicate one or more false matches. Identifying false matches in these ways can provide a minimal estimate of the false match rate. | Identifying false matches through implausible sequences of events in hospital data, e.g. multiple admissions on the same day in different parts of the country.(Hagger-Johnson et al. | Access to attribute data and knowledge about potential implausible scenarios |
| Comparison of linked and unlinked records | In a “Master” or “Nested” structure where we expect all cohort records to link, the number of missed matches can be estimated as the number of records that failed to link. | Comparing the characteristics of linked and unlinked maternal and baby hospital records in New South Wales.(Ford et al. | Access to unlinked records with attribute data |
| Comparison of records with high versus low quality identifiers | Records with missing or invalid identifiers may be less likely or even impossible to link in many applications, so comparing the distribution of identifier quality with respect to variables of interest can provide information about the minimum number of missed links (those with insufficient data for linkage) and the likely distribution of missed links with respect to variables of interest. | Comparison of records with and without a valid NHS number for linkage of tuberculosis case notifications and a laboratory database of all culture positive isolates from tuberculosis reference laboratories. (Aldridge et al. | Access to record-level or aggregate indicators of identifier quality and attribute data |
| Comparisons with external data sources | In situations where the expected number of links is not known | Comparing the characteristics of a cohort of linked mother-baby records with national published statistics on birth characteristics. (Harron et al. | Access to attribute data only |
More false matches might be present, but unidentified.
Technically, if there are false matches, there will also be additional missed matches (i.e. if a link is made to the wrong record, it will not appear as a missed match, but we will have missed the correct match).
Quantitative bias analysis for linkage error in a cohort linked to a register of deaths.
| Subgroups with known vital status | ||
|---|---|---|
| Dead (“positive control”) | Alive (“negative control”) | |
| Linked | 275 | 23 |
| Not linked | 36 | 7535 |
| Sensitivity of survival classification | 275/(275 + 36) = 0.884 | |
| Specificity of survival classification | 7535/(7535 + 23) = 0.997 | |
Data reproduced from Moore et al. (2014). Note that sensitivity and specificity of classification are not equivalent to the sensitivity and specificity of linkage, which is unknown. Also note that, because the positive controls and negative controls would not be expected to form a representative sample of the cohort, it is not possible to calculate positive or negative predictive values from this table (i.e. rows cannot be summed).
Imputation-based approaches for handling linkage error in analysis: an example based on 5 cohort records linked with cancer registry records with varying levels of certainty.
| Sex | Age | SES | Cancer | Linkage certainty | |
|---|---|---|---|---|---|
| 1 | Male | 55 | Low | Yes | Certain link |
| 2 | Male | 45 | High | Yes | Certain link |
| 3 | Female | 46 | High | No | Certain non-link |
| 4 | Male | 48 | Low | ? | Match weight = 15 |
| 5 | Female | 52 | High | ? | Match weight = 2 |
Records 1–3 are considered to have complete data; records 4 and 5 are considered to have missing or partially observed data. In multiple imputation, missing data for records 4 and 5 would be imputed based on the observed characteristics (sex, age, SES) and the relationship between these characteristics and the outcome (Cancer) in the complete records. In prior-informed imputation, the posterior distribution for the imputation would be informed, in addition, by the match weights in the candidate linking records (i.e. match weight = 15 for record 4 would provide more evidence of a match than match weight = 2 for record 5).