| Literature DB >> 30381794 |
Katie Harron1, Chris Dibben2, James Boyd3, Anders Hjern4, Mahmoud Azimaee5, Mauricio L Barreto6, Harvey Goldstein7.
Abstract
Linkage of population-based administrative data is a valuable tool for combining detailed individual-level information from different sources for research. While not a substitute for classical studies based on primary data collection, analyses of linked administrative data can answer questions that require large sample sizes or detailed data on hard-to-reach populations, and generate evidence with a high level of external validity and applicability for policy making. There are unique challenges in the appropriate research use of linked administrative data, for example with respect to bias from linkage errors where records cannot be linked or are linked together incorrectly. For confidentiality and other reasons, the separation of data linkage processes and analysis of linked data is generally regarded as best practice. However, the 'black box' of data linkage can make it difficult for researchers to judge the reliability of the resulting linked data for their required purposes. This article aims to provide an overview of challenges in linking administrative data for research. We aim to increase understanding of the implications of (i) the data linkage environment and privacy preservation; (ii) the linkage process itself (including data preparation, and deterministic and probabilistic linkage methods) and (iii) linkage quality and potential bias in linked data. We draw on examples from a number of countries to illustrate a range of approaches for data linkage in different contexts.Entities:
Keywords: Data linkage; data accuracy administrative data; epidemiological studies; measurement error; record linkage; selection bias
Year: 2017 PMID: 30381794 PMCID: PMC6187070 DOI: 10.1177/2053951717745678
Source DB: PubMed Journal: Big Data Soc ISSN: 2053-9517
Considerations for safe data linkage environments.
| Context | Key points |
|---|---|
| Data access approvals | • Comprehensive approvals processes typically check that: |
| ^ There is a legal basis for data access | |
| ^ There are appropriate security arrangements | |
| ^ Data are used only for a specified purpose, are kept only for a specified length of time, and are not further disclosed | |
| ^ The requesting institution has appropriate credentials | |
| ^ The ethics of the proposed study have been properly scrutinised | |
| Researcher requirements | • Researchers have a responsibility, often laid out in terms of use, to use data for bona fide purposes only |
| • Researchers should receive regular training in information governance | |
| • Legal sanctions are in place where data are used inappropriately or without due care | |
| Physical or virtual setting | • Secure physical, or virtual, locations established for the processing and linkage of personal or potentially identifiable data, characterised by: |
| ^ Strict access arrangements | |
| ^ Secure data transfer processes | |
| ^ Restricted network and/or internet access | |
| • Tight disclosure control procedures | |
| ^ For example, aggregate data only, suppression of small cell sizes (e.g. < 5), k-anonymity | |
| • Help protect against outsider attacks or coercion | |
| • Provide tangible reassurance on data security to the public |
Linkage error.
| Match status | ||
|---|---|---|
| Match (pair from same individual) | Non-match (pair from different individuals) | |
| Link status | ||
| Link | Identified match | False-match |
| Non-link | Missed-match | Identified non-match |
Evaluating linkage quality.
| Approach | Key points |
|---|---|
| ‘Gold standard’ or reference data | • Data where the true match status is known, used to test linkage algorithms and estimate rates of linkage error. |
| • Typically based on a subsample of records that have been manually reviewed, an additional data source with complete identifiers, a representative synthetic dataset, or external reference rates for the population of interest | |
| ^ For example, comparison of mortality rates based on linkage of death registrations versus national figures ( | |
| Post-linkage data validation | • Used to estimate minimum false-match rates by identifying implausible scenarios within the data. |
| ^ For example, linkage of a hospital admission record following a known date of death could indicate a false-match; as could linkage of multiple death records to a single census record ( | |
| Sensitivity analyses | • Used to assess the extent to which results vary according to different linkage criteria. |
| • Could involve changing the linkage algorithm or changing the threshold within probabilistic linkage, and re-running analyses to evaluate any impact on results ( | |
| ^ For example, comparing results over a range of match weights could help identify the direction of the effect of linkage errors on outcomes of interest ( | |
| Comparing characteristics of linked and unlinked data | • Used to identify any differences in linkage rates for different subgroups of individuals. |
| ^ For example, comparing rates of preterm birth in linked and unlinked maternity records ( | |
| • Where not all records are expected to match, distributions of variables in the linked data can be compared to external sources (e.g. age and/or ethnic group distributions from national census data) to explore any evidence of selection bias ( |