| Literature DB >> 31550255 |
Alan F Karr1, Matthew T Taylor2,3, Suzanne L West1, Soko Setoguchi2, Tzuyung D Kou4, Tobias Gerhard2, Daniel B Horton2.
Abstract
Linkage of medical databases, including insurer claims and electronic health records (EHRs), is increasingly common. However, few studies have investigated the behavior and output of linkage software. To determine how linkage quality is affected by different algorithms, blocking variables, methods for string matching and weight determination, and decision rules, we compared the performance of 4 nonproprietary linkage software packages linking patient identifiers from noninteroperable inpatient and outpatient EHRs. We linked datasets using first and last name, gender, and date of birth (DOB). We evaluated DOB and year of birth (YOB) as blocking variables and used exact and inexact matching methods. We compared the weights assigned to record pairs and evaluated how matching weights corresponded to a gold standard, medical record number. Deduplicated datasets contained 69,523 inpatient and 176,154 outpatient records, respectively. Linkage runs blocking on DOB produced weights ranging in number from 8 for exact matching to 64,273 for inexact matching. Linkage runs blocking on YOB produced 8 to 916,806 weights. Exact matching matched record pairs with identical test characteristics (sensitivity 90.48%, specificity 99.78%) for the highest ranked group, but algorithms differentially prioritized certain variables. Inexact matching behaved more variably, leading to dramatic differences in sensitivity (range 0.04-93.36%) and positive predictive value (PPV) (range 86.67-97.35%), even for the most highly ranked record pairs. Blocking on DOB led to higher PPV of highly ranked record pairs. An ensemble approach based on averaging scaled matching weights led to modestly improved accuracy. In summary, we found few differences in the rankings of record pairs with the highest matching weights across 4 linkage packages. Performance was more consistent for exact string matching than for inexact string matching. Most methods and software packages performed similarly when comparing matching accuracy with the gold standard. In some settings, an ensemble matching approach may outperform individual linkage algorithms.Entities:
Mesh:
Year: 2019 PMID: 31550255 PMCID: PMC6759179 DOI: 10.1371/journal.pone.0221459
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Demographic characteristics by dataset.
| Inpatient | Outpatient | |
|---|---|---|
| Age, N (%) | ||
| <1 | 8,185 (11.8%) | 6,127 (3.5%) |
| 1‒9 | 3,123 (4.5%) | 21,747 (12.3%) |
| 10‒19 | 3,502 (5.0%) | 24,105 (13.7%) |
| 20‒29 | 6,004 (8.6%) | 17,197 (9.8%) |
| 30‒39 | 6,900 (9.9%) | 19,992 (11.3%) |
| 40‒49 | 6,058 (8.7%) | 20,306 (11.5%) |
| 50‒59 | 8,968 (12.9%) | 24,825 (14.1%) |
| 60‒69 | 9,861 (14.2%) | 21,505 (12.2%) |
| 70‒79 | 8,412 (12.1%) | 13,063 (7.4%) |
| 80‒89 | 6,654 (9.6%) | 6,241 (3.5%) |
| 90+ | 1,856 (2.7%) | 1,046 (0.6%) |
| Gender, N (%) | ||
| Female | 36,753 (52.9%) | 100,238 (56.9%) |
| Male | 32,770 (47.1%) | 75,911 (43.1%) |
| NA | 0 | 5 (0.003%) |
| Race, N (%) | ||
| Asian | 6,533 (9.4%) | 15,397 (8.7%) |
| Black | 9,506 (13.7%) | 22,607 (12.8%) |
| Other | 14,242 (20.5%) | 1,149 (0.7%) |
| White | 38,619 (55.5%) | 95,372 (54.2%) |
| NA | 623 (0.9%) | 41,629 (23.6%) |
| First name, unique values | 13,221 | 27,232 |
| Last name, unique values | 31,103 | 60,014 |
| Inpatient medical record number, unique values | 69,091 | 138,156 |
| Inpatient medical record number, missing values | 0 | 37,975 |
Summary of weights produced by record linkage using DOB as the blocking variable.
| Linkage Run Name | String Matching | Weight Determination | Number of Weights | Minimum Weight | Maximum Weight | Pairs with Highest Weight | Pairs with Second Highest Weight | Pairs with Lowest | Pairs with Second Lowest Weight |
|---|---|---|---|---|---|---|---|---|---|
| R/EX/FS | Exact | Prob-FS | 8 | -12.3808 | 32.35027 | 30,536 | 24 | 176,066 | 189,273 |
| R/EX/EM | Exact | Prob-EM | 8 | -17.7754 | 25.82314 | 30,536 | 24 | 176,066 | 189,273 |
| R/EX/EPI | Exact | Prob-EPI | 8 | 0 | 1 | 30,536 | 24 | 176,066 | 189,273 |
| MTB/EX/FS | Exact | Prob-FS | 9 | -12.3808 | 32.35027 | 30,536 | 24 | 176,056 | 10 |
| MTB/EX/EM | Exact | Prob-EM | 9 | -17.7681 | 25.82333 | 30,536 | 24 | 176,056 | 10 |
| MTB/EX/D | Exact | Det | 4 | 0 | 3 | 30,536 | 4,445 | 176,066 | 189,443 |
| CU/EX/FS | Exact | Prob-FS | 9 | -12.3808 | 32.35027 | 30,536 | 24 | 176,048 | 10 |
| LP/EX/FS | Exact | Prob-FS | 9 | -7.53877 | 12.78385 | 30,536 | 24 | 176,056 | 10 |
| LP/EX/EM | Exact | Prob-EM | 9 | -7.53877 | 12.78385 | 30,536 | 24 | 176,056 | 10 |
| R/INEX/FS | Inexact | Prob-FS | 8 | -12.3808 | 32.35027 | 31,619 | 25 | 176,018 | 189,095 |
| R/INEX/EM | Inexact | Prob-EM | 121 | -18.5957 | 22.80771 | 30,536 | 3 | 176,018 | 189,095 |
| MTB/INEX/FS | Inexact | Prob-FS | 64,273 | -12.3808 | 32.35027 | 30,536 | 3 | 5,691 | 1 |
| MTB/INEX/EM | Inexact | Prob-EM | 64,273 | -17.7789 | 22.76115 | 30,536 | 3 | 5,691 | 1 |
| MTB/INEX/D | Inexact | Det | 24,603 | 0 | 3 | 30,536 | 3 | 5,692 | 1 |
| CU/INEX/FS | Inexact | Prob-FS | 1,492 | -12.3808 | 32.35027 | 30,554 | 3 | 173,105 | 2 |
| LP/INEX/FS | Inexact | Prob-FS | 37 | 2.1 | 15.8 | 15 | 3 | a | a |
| LP/INEX/EM | Inexact | Prob-EM | 37 | 2.1 | 15.8 | 15 | 3 | a | a |
DOB, date of birth; R, R package; MTB, Merge ToolBox; CU, Curtin University Probabilistic Linkage Engine; LP, Link Plus; Prob-FS, probabilistic, Fellegi-Sunter; Prob-EM, probabilistic, expectation-maximization; Prob-EPI, probabilistic, EpiLink; Det, deterministic.
a We were unable to recover negative weights for Link Plus with inexact string matching.
Agreement on matching variables for runs with exact string matching, blocking on DOB.
| Weight Rank | Agreement on | |||||
|---|---|---|---|---|---|---|
| R/EX/FS | R/EX/EM | R/EX/EPI | MTB/EX/FS | MTB/EX/EM | CU/EX/FS | |
| 1 | First, Last, Gender | First, Last, Gender | First, Last, Gender | First, Last, Gender | First, Last, Gender | First, Last, Gender |
| 2 | First, Last | First, Last | First, Last | First, Last | First, Last | First, Last |
| 3 | Last, Gender | First, Gender | Last, Gender | Last, Gender | First, Gender | Last, Gender |
| 4 | First, Gender | Last, Gender | First, Gender | First, Gender | Last, Gender | First, Gender |
| 5 | Last | First | Last | Last | First | Last |
| 6 | First | Last | First | First | Last | First |
| 7 | Gender | Gender | Gender | Gender | Gender | Gender |
| 8 | None | None | None | None, gender missing | None, gender missing | None, gender missing |
| 9 | N/A | N/A | N/A | None, no matching variables missing | None, no matching variables missing | None, no matching variables missing |
DOB, date of birth; R, R package; FS, probabilistic, Fellegi-Sunter; EM, probabilistic, expectation-maximization; EPI, probabilistic, EpiLink; MTB, Merge ToolBox; CU, Curtin University Probabilistic Linkage Engine
Fig 1Correlation matrix for the 17 sets of weights.
EX, exact string matching; INEX, inexact string matching; R, R package; MTB, Merge ToolBox; CU, Curtin University Probabilistic Linkage Engine; LP, Link Plus; FS, probabilistic, Fellegi-Sunter; EM, probabilistic, expectation-maximization; EPI, probabilistic, EpiLink; D, deterministic. The rows and columns are ordered so that runs using exact methods are at the top and left.
Agreement with gold standard among records with the highest weights, blocking on DOB.
| Linkage Run Name | String Matching | Weight Determination | Number (%) of Pairs with Highest Weight | Number (%) of Pairs with First or Second Highest Weight | ||
|---|---|---|---|---|---|---|
| Agreement with inpatient MRN | ||||||
| No | Yes | No | Yes | |||
| R/EX/FS | Exact | Prob-FS | 809 (2.6) | 29,727 (97.4) | 814 (2.7) | 29,746 (97.3) |
| R/EX/EM | Exact | Prob-EM | 809 (2.6) | 29,727 (97.4) | 814 (2.7) | 29,746 (97.3) |
| R/EX/EPI | Exact | Prob-EPI | 809 (2.6) | 29,727 (97.4) | 814 (2.7) | 29,746 (97.3) |
| MTB/EX/FS | Exact | Prob-FS | 809 (2.6) | 29,727 (97.4) | 814 (2.7) | 29,746 (97.3) |
| MTB/EX/EM | Exact | Prob-EM | 809 (2.6) | 29,727 (97.4) | 814 (2.7) | 29,746 (97.3) |
| MTB/EX/D | Exact | Det | 809 (2.6) | 29,727 (97.4) | 2,597 (7.4) | 32,384 (92.6) |
| CU/EX/FS | Exact | Prob-FS | 809 (2.6) | 29,727 (97.4) | 814 (2.7) | 29,746 (97.3) |
| LP/EX/FS | Exact | Prob-FS | 809 (2.6) | 29,727 (97.4) | 814 (2.7) | 29,746 (97.3) |
| LP/EX/EM | Exact | Prob-EM | 809 (2.6) | 29,727 (97.4) | 814 (2.7) | 29,746 (97.3) |
| R/INEX/FS | Inexact | Prob-FS | 945 (3.0) | 30,674 (97.0) | 951 (3.0) | 30,693 (97.0) |
| R/INEX/EM | Inexact | Prob-EM | 809 (2.6) | 29,727 (97.4) | 809 (2.6) | 29,730 (97.4) |
| MTB/INEX/FS | Inexact | Prob-FS | 809 (2.6) | 29,727 (97.4) | 809 (2.6) | 29,730 (97.4) |
| MTB/INEX/EM | Inexact | Prob-EM | 809 (2.6) | 29,727 (97.4) | 809 (2.6) | 29,730 (97.4) |
| MTB/INEX/D | Inexact | Det | 809 (2.6) | 29,727 (97.4) | 809 (2.6) | 29,730 (97.4) |
| CU/INEX/FS | Inexact | Prob-FS | 816 (2.7) | 29,738 (97.3) | 816 (2.7) | 29,741 (97.3) |
| LP/INEX/FS | Inexact | Prob-FS | 2 (13.3) | 13 (86.7) | 3 (16.7) | 15 (83.3) |
| LP/INEX/EM | Inexact | Prob-EM | 2 (13.3) | 13 (86.7) | 3 (16.7) | 15 (83.3) |
DOB, date of birth; MRN, medical record number; R, R package; MTB, Merge ToolBox; CU, Curtin University Probabilistic Linkage Engine; LP, Link Plus; Prob-FS, probabilistic, Fellegi-Sunter; Prob-EM, probabilistic, expectation-maximization; Prob-EPI, probabilistic, EpiLink; Det, deterministic.
Agreement with gold standard among records with the lowest weights, blocking on DOB.
| Linkage Run Name | String Matching | Weight Determination | Number (%) Pairs with Lowest Weight | Number (%) Pairs with First or Second Lowest Weight | ||
|---|---|---|---|---|---|---|
| Agreement with inpatient MRN | ||||||
| No | Yes | No | Yes | |||
| R/EX/FS | Exact | Prob-FS | 176,066 (100) | 0 (0) | 364,871 (99.9) | 468 (0.1%) |
| R/EX/EM | Exact | Prob-EM | 176,066 (100) | 0 (0) | 364,871 (99.9) | 468 (0.1%) |
| R/EX/EPI | Exact | Prob-EPI | 176,066 (100) | 0 (0) | 364,871 (99.9) | 468 (0.1%) |
| MTB/EX/FS | Exact | Prob-FS | 176,056 (100) | 0 (0) | 176,066 (100) | 0 (0) |
| MTB/EX/EM | Exact | Prob-EM | 176,056 (100) | 0 (0) | 176,066 | 0 (0) |
| MTB/EX/D | Exact | Det | 176,066 (100) | 0 (0) | 365,038 | 471 (0.1%) |
| CU/EX/FS | Exact | Prob-FS | 176,048 (100) | 0 (0) | 176,058 | 0 (0) |
| LP/EX/FS | Exact | Prob-FS | 176,056 (100) | 0 (0) | 176,066 | 0 (0) |
| LP/EX/EM | Exact | Prob-EM | 176,056 (100) | 0 (0) | 176,066 | 0 (0) |
| R/INEX/FS | Inexact | Prob-FS | 176,018 (100) | 0 (0) | 364,693 | 420 (0.1%) |
| R/INEX/EM | Inexact | Prob-EM | 176,018 (100) | 0 (0) | 364,693 | 420 (0.1%) |
| MTB/INEX/FS | Inexact | Prob-FS | 5,691 (100) | 0 (0) | 5,692 | 0 (0) |
| MTB/INEX/EM | Inexact | Prob-EM | 5,691 (100) | 0 (0) | 5,692 | 0 (0) |
| MTB/INEX/D | Inexact | Det | 5,692 (100) | 0 (0) | 5,693 | 0 (0) |
| CU/INEX/FS | Inexact | Prob-FS | 173,105 (100) | 0 (0) | 173,107 | 0 (0) |
| LP/INEX/FS | Inexact | Prob-FS | 84 (98.8) | 1 (1.2) | ||
| LP/INEX/EM | Inexact | Prob-EM | 84 (98.8) | 1 (1.2) | ||
DOB, date of birth; MRN, medical record number; R, R package; MTB, Merge ToolBox; CU, Curtin University Probabilistic Linkage Engine; LP, Link Plus; Prob-FS, probabilistic, Fellegi-Sunter; Prob-EM, probabilistic, expectation-maximization; Prob-EPI, probabilistic, EpiLink; Det, deterministic.
a Low weights were unrecoverable in Link Plus using inexact string matching.
Fig 2Receiver operating characteristic curves for all linkage runs.
EX, exact string matching; INEX, inexact string matching; R, R package; MTB, Merge ToolBox; CU, Curtin University Probabilistic Linkage Engine; LP, Link Plus; LP, Link Plus; FS, probabilistic, Fellegi-Sunter; EM, probabilistic, expectation-maximization; EPI, probabilistic, EpiLink; D, deterministic. Curves are limited to values of sensitivity equal to or exceeding 0.95 for clarify. Full ROC curves are presented in Fig D in S1 File.