| Literature DB >> 28369581 |
Ruth Gilbert1, Rosemary Lafferty1, Gareth Hagger-Johnson1, Katie Harron2, Li-Chun Zhang3, Peter Smith3, Chris Dibben4, Harvey Goldstein1.
Abstract
Record linkage of administrative and survey data is increasingly used to generate evidence to inform policy and services. Although a powerful and efficient way of generating new information from existing data sets, errors related to data processing before, during and after linkage can bias results. However, researchers and users of linked data rarely have access to information that can be used to assess these biases or take them into account in analyses. As linked administrative data are increasingly used to provide evidence to guide policy and services, linkage error, which disproportionately affects disadvantaged groups, can undermine evidence for public health. We convened a group of researchers and experts from government data providers to develop guidance about the information that needs to be made available about the data linkage process, by data providers, data linkers, analysts and the researchers who write reports. The guidance goes beyond recommendations for information to be included in research reports. Our aim is to raise awareness of information that may be required at each step of the linkage pathway to improve the transparency, reproducibility, and accuracy of linkage processes, and the validity of analyses and interpretation of results.Entities:
Mesh:
Year: 2018 PMID: 28369581 PMCID: PMC5896589 DOI: 10.1093/pubmed/fdx037
Source DB: PubMed Journal: J Public Health (Oxf) ISSN: 1741-3842 Impact factor: 2.341
Fig. 1Steps in the data linkage pathway.
GUILD guidance information to be shared before, during and after data linkage
| Item | Concept | Guidance |
|---|---|---|
| Step 1 | Data provision | |
| 1a | Population included in the data set | Data providers should give details of the population included in the data set (e.g. everyone registered with a GP), the geographic coverage of the data (e.g. England and Wales), the number of records in each source data set and how any ‘opt-outs’ were dealt with |
| 1b | Linkability of the data set | Details should be shared about how the data were generated (e.g. face-to-face), processed (e.g. a self-entered form or entered by an administrator) and quality controlled (e.g. manually checked), including how identifying characteristics were |
| 1b(i) | – Collected and allocated | |
| 1b(ii) | – Updated as further personal data were collected, and dates of most recent updates | |
| 1b(iii) | – Checked and cleaned, including any validation rules | |
| 1b(iv) | – Replaced with artificial identifiers to reduce disclosure before being released for linkage | |
| Step 2 | Data linkage | |
| 2a | Descriptions of linkage processes | Data linkers should provide descriptions of how the linkage was done including: |
| 2a(i) | – A clear description of the data sources and identifying characteristics used for linkage, details of how identifiers were cleaned and validated before linkage, patterns of missingness, the expected range of values after cleaning, and how any de-duplication was performed. | |
| 2a(ii) | – Details of any transformation or replacement with artificial identifiers before linkage | |
| 2a(iii) | – A detailed description of the method (or algorithm) used for linkage, whether it was rule-based (e.g. deterministic) or score-based (e.g. probabilistic linkage), and how multiple linkages were handled | |
| 2a(iv) | – A detailed description of any new derived variables that were introduced during the linkage process (e.g. confidence level or probability of linkage or link score) | |
| 2a(v) | – Details of any blocking or grouping methods used for score-based linkage and how match scores were derived | |
| 2b | Record-level indicators of the linkage process | Data linkers should provide analysts with record-level indicators of the data linkage process to enable adjustments for linkage error in the analyses. Indicators could include the pass-ID (the step in a rule-based linkage process when a pair of records linked), or match scores (e.g. match weights used in probabilistic linkage) |
| 2c | Aggregate linkage results | Data linkers should make available descriptions, tables and flow diagrams depicting linkage accuracy for each linkage undertaken. These should include: |
| 2c(i) | – A description of the number of records that were linked and unlinked in each of the source files | |
| 2c(ii) | – A table comparing the aggregate characteristics of individuals in the linked and unlinked records for each source data set (defined by the analyst in agreement with the data linker) | |
| 2c(iii) | – A description of the ‘representativeness’ of the linked data set to each source data set, for example, including weights that can be applied to allow grossing up the linked data set to better represent the source data sets | |
| 2c(iv) | – A flow diagram to represent the steps in linkage and numbers involved at each step | |
| 2d | Generic reports of linkage accuracy | The data linker should report generic information about the quality of linkage carried out. This should include: |
| 2d(i) | – Estimates of linkage error rates based on regular quality monitoring of linkage accuracy. For example, measures of the sensitivity and specificity for the algorithm used | |
| 2d(ii) | – Details of how error rates were estimated, for example, by comparing linked records with a reference data set | |
| 2e | Descriptions of disclosure controls | Data linkers should describe any statistical disclosure controls used to reduce identifiability of linked data prior to release to data analysts |
| 2f | Overview of data linkage | Data linkers should establish systems to improve the quality of linkage studies, for example, by publishing a database detailing the data linkages undertaken with links to publications. The advisory and approvals structure for data linkage should include experts who can scrutinize the impact of linkage processes on results of analyses |
| Step 3 | Data analyses | Data analysts should assess and report on the quality of the linked data used for analyses |
| 3a | Account for linkage error | Analysts should report how analyses took into account linkage error, including: |
| 3a(i) | – How record-level indicators of the linkage process or aggregate measures reflecting linkage quality were used for adjustments, including underlying assumptions and methods used | |
| 3a(ii) | – Uncertainty analyses of the effects of linkage errors | |
| 3a(iii) | – Sensitivity analyses to determine the impact of assumptions used in the analyses | |
| Step 4 | Reporting study findings | Reports of linkage studies should, where possible, include items in Steps 1–3, building on the RECORD statement for research reports ( |