| Literature DB >> 30620344 |
Boris P Hejblum1,2, Griffin M Weber3, Katherine P Liao3,4, Nathan P Palmer3, Susanne Churchill3, Nancy A Shadick4, Peter Szolovits5, Shawn N Murphy6,7, Isaac S Kohane3, Tianxi Cai1,3.
Abstract
We develop an algorithm for probabilistic linkage of de-identified research datasets at the patient level, when only diagnosis codes with discrepancies and no personal health identifiers such as name or date of birth are available. It relies on Bayesian modelling of binarized diagnosis codes, and provides a posterior probability of matching for each patient pair, while considering all the data at once. Both in our simulation study (using an administrative claims dataset for data generation) and in two real use-cases linking patient electronic health records from a large tertiary care network, our method exhibits good performance and compares favourably to the standard baseline Fellegi-Sunter algorithm. We propose a scalable, fast and efficient open-source implementation in the ludic R package available on CRAN, which also includes the anonymized diagnosis code data from our real use-case. This work suggests it is possible to link de-identified research databases stripped of any personal health identifiers using only diagnosis codes, provided sufficient information is shared between the data sources.Entities:
Mesh:
Year: 2019 PMID: 30620344 PMCID: PMC6326114 DOI: 10.1038/sdata.2018.298
Source DB: PubMed Journal: Sci Data ISSN: 2052-4463 Impact factor: 6.444
Figure 1Matching accuracy on simulated data under various settings.
(a) Impact of the discordance between the two datasets. (b) impact of rare codes. (c) impact of the proportion of overlapping patient between the two datasets. The figure shows the performance of the matching according to various simulation scenarios in terms of True Positive Rate (TPR) and Positive Predictive Value (PPV). F-S refers to the Fellegi-Sunter method while ludic denotes our proposed Bayesian approach.
Figure 2Matching accuracy according to +.
The figure shows the performance of the matching according to various value of the hyper-parameter ε+ which models the probability of discrepancy between the two datasets. The top panel displays results when true value of ε+ is 3%, while the bottom panel is for a true value of 20%.
Characteristics of the real datasets.
| Dataset | Time spana | Number of patients with at least 1 diagnosis code | Number of diagnosis code recorded at least once | Average number of diagnoses per patient | Average number of diagnoses per patient among silver standard matches |
|---|---|---|---|---|---|
| aThe 6-year time span includes codes from 1/1/2002 through 12/31/2007, while the 11-year time span includes codes from 1/1/2002 through 12/31/2012. | |||||
| RA1 | 6 years | 26,681 | 7,868 | 30.2 | 33.6 |
| RA2 | 6 years | 5,707 | 4,981 | 29.0 | 33.3 |
| RA2 | 11 years | 6,394 | 6,086 | 44.2 | 54.0 |
Figure 3Histogram of diagnosis code prevalence in the RA datasets.
Performance matching 2 real use case datasets.
| Data time span | Matching method | Number of matches | TPRa | PPVa | Computing time |
|---|---|---|---|---|---|
| aBased on the 3,831 silver standard true matches. | |||||
| bUsing a 3.5 GHz Intel Core i7 processor with 32 GB of memory available. | |||||
| cUsing a 3.6 GHz Intel Xeon 5600 series processor with 96 GB of memory available. | |||||
| 6 years | 0.5 cutoff | 4,369 | 0.93 | 0.81 | 96 sb |
| 6 years | 0.9 cutoff | 4,179 | 0.91 | 0.84 | 96 sb |
| 6 years | F-S blocked | 2,594,443 | 0.81 | <0.01 | 49 minb |
| 6 years | F-S blocked 1-1 | 5,696 | 0.38 | 0.26 | 49 minb |
| 6 years | F-S | — | — | — | > 4 daysc |
| 6 years | F-S 1-1 | — | — | — | > 4 daysc |
| 11 years | 0.5 cutoff | 4,043 | 0.84 | 0.80 | 96 sb |
| 11 years | 0.9 cutoff | 3,625 | 0.80 | 0.84 | 96 sb |
| 11 years | F-S blocked | 2,898,367 | 0.80 | <0.01 | 62 minb |
| 11 years | F-S blocked 1-1 | 6,356 | 0.29 | 0.17 | 62 minb |
| 11 years | F-S | — | — | — | > 4 daysc |
| 11 years | F-S 1-1 | — | — | — | > 4 daysc |
Figure 4Workflow of the linkage algorithm.