| Literature DB >> 36173664 |
Xiaochun Li1, Huiping Xu1, Shaun Grannis2.
Abstract
BACKGROUND: Quality patient care requires comprehensive health care data from a broad set of sources. However, missing data in medical records and matching field selection are 2 real-world challenges in patient-record linkage.Entities:
Keywords: Fellegi-Sunter model; latent class model; matching field selection; missing at random; record linkage
Mesh:
Year: 2022 PMID: 36173664 PMCID: PMC9562057 DOI: 10.2196/33775
Source DB: PubMed Journal: J Med Internet Res ISSN: 1438-8871 Impact factor: 7.076
Summary of four use cases, Indiana Network for Patient Care (INPC), newborn screening (NBS), Social Security Administration (SSA), and Marion County Health Department (MCHD), with information on the number of records in each use case, blocking schemes, and the numbers of record pairs in blocking schemes.
| Block | Pairs | |
|
| ||
|
| SSNa | 53,054,690 |
|
| FN-TELb | 41,729,402 |
|
| DB-MB-YB-ZIPc | 133,553,036 |
|
| FN-LN-YBd | 193,865,283 |
|
| DB-LN-MB-YBe | 191,181,498 |
|
| ||
|
| MRNf | 4,147,098 |
|
| TELg | 2,644,454 |
|
| MB-DB-ZIP | 8,083,396 |
|
| LN-FNh | 3,005,368 |
|
| NK_LN-NK_FNi | 1,217,736 |
|
| ||
|
| SSN | 805,331 |
|
| FN-LN-ZIP | 18,103 |
|
| FN-LN-MI-YB | 1,395,395 |
|
| FN-LN-MI-DB-MB | 547,376 |
|
| FN-LN-DB-MB-YB | 722,167 |
|
| ||
|
| SSN | 869,454 |
|
| TEL | 28,238 |
|
| DB-MB-YB-zip | 5,083,429 |
|
| FN-LN-YB | 3,378,017 |
|
| DB-LN-MB-YB | 3,701,460 |
aSSN: Social Security number.
bFN-TEL: first name and telephone number.
cDB-MB-YB-ZIP: day, month, and year of birth and zip code.
dFN-LN-YB: first name, last name, and year of birth.
eDB-LN-MB-YB: day, month, and year of birth and last name.
fMRN: medical record number.
gTEL: telephone no.
hLN-FN: last name, first name.
iNK_LN, NK_FN: next of kin last name and first name.
Manual review results for the 4 use cases.
| Use case | Number of pairsa | Number of pairs deemed as matches | Number of pairs deemed as nonmatches | Match prevalenceb |
| INPCc | 15,000 | 7840 | 7160 | 0.523 |
| SSAd | 16,500 | 5950 | 10,550 | 0.361 |
| NBSe | 15,000 | 7967 | 7033 | 0.531 |
| MCHDf | 15,500 | 5927 | 9573 | 0.382 |
aNumber of pairs is the total number of pairs sampled for manual review, which determines the pairs as either matches or nonmatches.
bMatch prevalence is the ratio of the number of pairs deemed as matches and the total number of pairs for manual review for each use case.
cINPC: Indiana Network for Patient Care.
dSSA: Social Security Administration.
eNBS: newborn screening.
fMCHD: Marion County Health Department.
Summary of modeling information by data use case and by blocking scheme.
| Data and block | Expert-specified fieldsa | Data-driven fieldsa | |
|
| |||
|
| DB-LN-MB-YBb | MRNc FNd SEXe TELf ADRg ZIPh SSNi | MRN FN SEX TEL ADR ZIP SSN |
|
| DB-MB-YB-ZIP | MRN LN FN SEX TEL ADR SSN | MRN LN FN SEX TEL ADR SSN |
|
| FN-LN-YB | MRN SEX DB MB TEL ADR ZIP SSN | MRN SEX DB MB TEL ADR ZIP SSN |
|
| FN-TEL | MRN LN SEX DB MB YB ADR ZIP SSN | MRN LN SEX DB MB YB ADR ZIP SSN |
|
| SSN | MRN LN FN SEX DB MB YB TEL ADR ZIP | MRN LN FN SEX DB MB YB TEL ADR ZIP |
|
| |||
|
| FN-LN-DB-MB-YB | SSN MI ZIP | SSN MI ZIP |
|
| FN-LN-MI-DB-MB | ZIP YB SSN | ZIP YB SSN |
|
| FN-LN-MI-YB | DB MB ZIP SSN | DB MB ZIP SSN |
|
| FN-LN-ZIP | MI DB MB YB SSN | MI DB MB YB SSN |
|
| SSN | LN FN MI DB MB YB ZIP | LN FN MI DB MB YB ZIP |
|
| |||
|
| LN-FN | MRN SEX DB MB YB TEL ADR ZIP | MRN SEXm DB MB YBm TEL ADR ZIP |
|
| MB-DB-ZIP | MRN LN FN SEX YB TEL ADR | MRN LN FN SEX YB TEL ADR |
|
| MRN | LN FN SEX DB MB YB TEL ADR ZIP | LN FN SEXm DB MB YB TELm ADR ZIP |
|
| NK_LN-NK_FN | MRN LN FN SEX DB MB YB TEL ADR ZIP | MRN LNm FN SEX DB MB YB TEL ADR ZIP |
|
| TEL | MRN LN FN SEX DB MB YB ADR ZIP | MRN LN FN SEX DB MB YBm ADR ZIP |
|
| |||
|
| LN-FN | MRN SEX DB MB YB TEL ADR ZIP | MRN SEXm DB MB YBm TEL ADR ZIP |
|
| MB-DB-ZIP | MRN LN FN SEX YB TEL ADR | MRN LN FN SEX YB TEL ADR |
|
| MRN | LN FN SEX DB MB YB TEL ADR ZIP | LN FN SEXm DB MB YB TELm ADR ZIP |
|
| NK_LN-NK_FN | MRN LN FN SEX DB MB YB TEL ADR ZIP | MRN LNm FN SEX DB MB YB TEL ADR ZIP |
|
| TEL | MRN LN FN SEX DB MB YB ADR ZIP | MRN LN FN SEX DB MB YBm ADR ZIP |
aColumns “Expert-specified fields” and “Data-driven fields” display the fields used in the Fellegi-Sunter (FS) model.
bDB-LN-MB-YB: day, month, and year of birth and last name.
cMRN: medical record number.
dFN: first name.
eSEX: sex.
fTEL: telephone number.
gADR: address.
hZIP: zip code.
iSSN: Social Security number.
jFields (italicized) selected only by data-driven methods.
kSSA: Social Security Administration.
lNBS: newborn screening.
mFields not selected by the data-driven method but specified by experts.
nMCHD: Marion County Health Department.
Matching results of the four use cases evaluated on their respective ground truth sets of random-selected and manually reviewed record pairs.
| Data | Value, N | Sensitivity (95% CI) | Specificity (95% CI) | Positive predictive value (95% CI) | Negative predictive value (95% CI) | |||||||||
|
| ||||||||||||||
|
|
| |||||||||||||
|
|
| MADb | 15,000 | 0.962 (0.958-0.967) | 0.990 (0.987-0.992) | 0.990 (0.988-0.992) | 0.960 (0.955-0.964) | 0.976 (0.974-0.978) | ||||||
|
|
| MARc | 15,000 | 0.970 (0.966-0.974) | 0.988 (0.986-0.991) | 0.989 (0.987-0.991) | 0.968 (0.964-0.972) | 0.980 (0.977-0.982) | ||||||
|
|
| |||||||||||||
|
|
| MAD | 16,500 | 0.781 (0.770-0.792) | 0.995 (0.994-0.996) | 0.989 (0.986-0.992) | 0.890 (0.884-0.895) | 0.873 (0.866-0.879) | ||||||
|
|
| MAR | 16,500 | 0.785 (0.775-0.796) | 0.995 (0.993-0.996) | 0.989 (0.985-0.991) | 0.892 (0.886-0.897) | 0.875 (0.869-0.882) | ||||||
|
|
| |||||||||||||
|
|
| MAD | 15,000 | 0.795 (0.786-0.804) | 0.881 (0.874-0.889) | 0.883 (0.876-0.891) | 0.791 (0.782-0.801) | 0.837 (0.830-0.843) | ||||||
|
|
| MAR | 15,000 | 0.860 (0.852-0.868) | 0.873 (0.865-0.881) | 0.885 (0.877-0.892) | 0.846 (0.838-0.855) | 0.872 (0.866-0.878) | ||||||
|
|
| |||||||||||||
|
|
| MAD | 15,500 | 0.944 (0.937-0.949) | 0.989 (0.987-0.991) | 0.982 (0.979-0.986) | 0.966 (0.962-0.969) | 0.963 (0.959-0.966) | ||||||
|
|
| MAR | 15,500 | 0.946 (0.940-0.952) | 0.988 (0.986-0.990) | 0.980 (0.976-0.983) | 0.967 (0.964-0.971) | 0.963 (0.959-0.966) | ||||||
|
| ||||||||||||||
|
|
| |||||||||||||
|
|
| MAD | 15,000 | 0.579 (0.568-0.590) | 0.988 (0.986-0.991) | 0.982 (0.978-0.985) | 0.682 (0.672-0.690) | 0.729 (0.719-0.737) | ||||||
|
|
| MAR | 15,000 | 0.970 (0.966-0.974) | 0.987 (0.984-0.989) | 0.988 (0.985-0.990) | 0.968 (0.964-0.972) | 0.979 (0.976-0.981) | ||||||
|
|
| |||||||||||||
|
|
| MAD | 16,500 | 0.781 (0.770-0.792) | 0.995 (0.994-0.996) | 0.989 (0.986-0.992) | 0.890 (0.884-0.895) | 0.873 (0.866-0.879) | ||||||
|
|
| MAR | 16,500 | 0.785 (0.775-0.796) | 0.995 (0.993-0.996) | 0.989 (0.985-0.991) | 0.892 (0.886-0.897) | 0.875 (0.869-0.882) | ||||||
|
|
| |||||||||||||
|
|
| MAD | 15,000 | 0.813 (0.805-0.822) | 0.875 (0.867-0.883) | 0.880 (0.873-0.888) | 0.805 (0.796-0.814) | 0.845 (0.839-0.852) | ||||||
|
|
| MAR | 15,000 | 0.865 (0.858-0.873) | 0.870 (0.863-0.878) | 0.883 (0.876-0.890) | 0.851 (0.842-0.859) | 0.874 (0.868-0.880) | ||||||
|
|
| |||||||||||||
|
|
| MAD | 15,500 | 0.635 (0.622-0.648) | 0.970 (0.967-0.974) | 0.929 (0.921-0.937) | 0.811 (0.804-0.818) | 0.754 (0.745-0.764) | ||||||
|
|
| MAR | 15,500 | 0.954 (0.948-0.959) | 0.988 (0.985-0.990) | 0.979 (0.976-0.983) | 0.972 (0.968-0.975) | 0.967 (0.963-0.970) | ||||||
aINPC: Indiana Network for Patient Care.
bMAD: missing as disagreement.
cMAR: missing at random.
dSSA: Social Security Administration.
eNBS: newborn screening.
fMCHD: Marion County Health Department.
Cross-tabulation of ground truth and classification results by the Fellegi-Sunter model under missing as disagreement (MAD) and missing at random (MAR) for the Social Security Administration use case.
| MAD | MAR | Values, N | ||
|
| Nonmatch | Match |
| |
|
| ||||
|
| Nonmatch | 1277 | 26 | 1303 |
|
| Match | 0 | 4647 | 4647 |
|
| Value, N | 1277 | 4673 | 5950 |
|
| ||||
|
| Nonmatch | 10,495 | 3 | 10,498 |
|
| Match | 1 | 51 | 52 |
|
| Value, N | 10,496 | 54 | 10,550 |