| Literature DB >> 32352389 |
Tigran Avoundjian1,2, Julia C Dombrowski1,2,3, Matthew R Golden1,2,3, James P Hughes4, Brandon L Guthrie1,5, Janet Baseman1, Mauricio Sadinle4.
Abstract
BACKGROUND: Many public health departments use record linkage between surveillance data and external data sources to inform public health interventions. However, little guidance is available to inform these activities, and many health departments rely on deterministic algorithms that may miss many true matches. In the context of public health action, these missed matches lead to missed opportunities to deliver interventions and may exacerbate existing health inequities.Entities:
Keywords: data management; medical record linkage; public health practice; public health surveillance
Mesh:
Year: 2020 PMID: 32352389 PMCID: PMC7226047 DOI: 10.2196/15917
Source DB: PubMed Journal: JMIR Public Health Surveill ISSN: 2369-2960
Record linkage algorithms.
| Algorithm | Match criteria | Source |
| Exact match | Exact match on first name, last name, AND year of birth | Not applicable |
| Stenger | Best record pairs with a score of 50+ based on the following criteria: +20 points: first 3 letters of the last name and 2 letters of the first name +15 points: exact match on the full name +15 points: match on birth year (±2 years) +5 points: exact match on the year of birth +10 points: exact match on the month of birth +5 points: exact match on the day of birth | Public Health Seattle King County and Avoundjian et al [ |
| Ocampo 1 | Record pairs that met the following criteria: Exacta: last name, first name, date of birth, race, genderb, AND SSNc OR Very higha: (last name, first name, date of birth, AND genderb) OR SSN OR High: last name, first name, date of birth, AND (genderb OR race) | Ocampo et al [ |
| Ocampo 2 | Record pairs that matched in Ocampo 1 OR met the following criteria: Medium high: last name, first name (Soundex), date of birth, or genderb | Ocampo et al [ |
| Bosh | Records that met any of the following matching keys: Full last name+first 6 letters of first name+full date of birth First letter of the last name+letters 3 to 10 of the last name+letters 2 to 9 of the first name+full date of birth Letters 2 to 7 of the last name+first 6 letters of the last name+full date of birth First 2 letters of the last name+first 3 letters of the first name+full SSN+full date of birthd Full last name+first 3 letters of the first name+full date of birth Letters 3 to 5 of the last name+first 3 letters of the first name+full date of birth First 4 letters of the last name+first 4 letters of the first name+full date of birth First letter of the last name+letters 3 to 10 of the last name+letters 2 to 9 of the first name+month and year of birthe First letter of the last name+letters 3 to 10 of the last name+letters 2 to 9 of the first name+day and year of birthe Full SSNd,e First 5 letters of the last name+first 4 letters of the first name+month and year of birthe First letter of the last name+letters 3 to 10 of the last name+letters 2 to 9 of the first name+(day OR month of birth)+year of birth, switching the first and last names in 1 datasete First 5 letters of the last name+first 4 letters of the first name+month and year of birth, switching the first and last names in 1 datasete | Bosh et al [ |
| fastLink (Fellegi-Sunter) | Calculates match/nonmatch weights using an expectation maximization algorithm and computes a match probability for each record pair. Pairs are classified as a match if their match probability is above 0.85. The following fields are used to estimate the match probability: First name and last name: partial match using Jaro Winkler string distance, with 3 agreement levelsf Year of birth, month of birth, day of birth, gender and race: exact match | Enamorado et al [ |
| Beta Record Linkage | Uses a Gibbs sampler to sample plausible matching configurations and uses a loss function to identify the optimal set of matching pairs. The following fields are used by the algorithm: First name and last name: partial match using Levenshtein string distance, with 4 agreement levelsg Year of birth, month of birth, day of birth, gender, and race: exact match | Sadinle [ |
aWe omitted social security number from the exact and very high match tiers because of lack of social security number data.
bOriginal algorithm used birth sex instead of gender.
cSSN: social security number.
dKey was not implemented because of lack of social security number data.
eThese keys require the following additional criteria to be met to be considered a match: exact match on gender OR full date of birth AND first name in the HIV dataset not among the 20 most common names in the HIV dataset AND last name in the HIV dataset not among the 20 most common names in the HIV dataset. Note: the original algorithm used birth sex instead of gender in these criteria. In addition, the original criteria also required a match on digits 1 to 4 and 6 to 9 of social security number, which was not implemented because of lack of social security number data.
fFastLink’s default agreement levels for partially matched fields: 0 to 0.87: not a match, 0.88 to 0.91: partial match, and 0.92+: exact match.
gBeta record linkage’s default agreement levels for partially matched fields: 0 to 0.49: not a match, 0.5 to 0.74: probable nonmatch, 0.76 to 0.998: probable match, and 0.99+: exact match.
Figure 1Simulations: record linkage algorithm recall/precision.
Figure 2Record linkage algorithm matching computational performance. Average computational time after 20 replications in scenario where overlap (50%) and number of erroneous fields per record (1) were fixed and size of second dataset was varied (10%, 25%, 50%, and 75% of first dataset [N=2000]).
Figure 3Real-world matching scenario: record linkage algorithm recall and precision. PPV: positive predictive value.