| Literature DB >> 35896508 |
Elmer V Bernstam1,2, Reuben Joseph Applegate1, Alvin Yu3, Deepa Chaudhari1, Tian Liu3, Alex Coda3, Jonah Leshin3.
Abstract
OBJECTIVE: Our objective was to evaluate tokens commonly used by clinical research consortia to aggregate clinical data across institutions.Entities:
Mesh:
Year: 2022 PMID: 35896508 PMCID: PMC9474266 DOI: 10.1055/a-1910-4154
Source DB: PubMed Journal: Appl Clin Inform ISSN: 1869-0327 Impact factor: 2.762
Match algorithms used in this evaluation
| A. Token descriptions | |||
|---|---|---|---|
| Name | Token description | ||
| Token 1 | Last name + 1st initial of first name + gender + DOB | ||
| Token 2 | Last name (soundex) + first name (soundex) + gender + DOB | ||
| Token 3 | Last name + first name + DOB + Zip 3 (three digit zip code) | ||
| Token 4 | Last name + first name + gender + DOB | ||
| Token 5 | SSN + gender + DOB | ||
| Token 7 | Last name + 1st three characters of first name + gender + DOB | ||
| Token 9 | First name + address | ||
| Token 16 | SSN + first name | ||
| Token 22 | Cell phone number (United States) | ||
|
| |||
|
|
|
|
|
| Single token match | 1 or 2, or 3 or 4, OR 5 or 16 | Two records match if they share at least a single token in common. | At least one of tokens 1,2,3,4,5, and16 is present |
| Demographic | 1 and 2 | Two records match on both of these tokens to indicate the records have the same name, age, and gender. | Tokens 1 and 2 are present |
| Net tokens | Any subset of 1, 2, 4, 5, 7, 9, 16 | Two records match if more tokens match than do not. | At least 3 of tokens 1,2,4,5,7,9, and16 are present |
| SSN | 5 or 16 | Tokens 5 and 16 use SSN (United States). Two records match if either token 5 or token 16 match. | Token 5 or 16 is present |
Abbreviations: DOB, date of birth; SSN, social security number.
Fig. 1Token generation.
Precision, recall, F1, and fill rates for the eight token types and algorithms tested in this evaluation
| Token or algorithm | True positives | False negatives | False positives |
Precision
|
Recall
|
F1
| Valid pairs | Pair fill rate |
|---|---|---|---|---|---|---|---|---|
| Token 1 | 1,098 | 118 | 24 | 97.9% | 90.3% | 94% | 20,002 | 100.00% |
| Token 2 | 955 | 259 | 14 | 98.6% | 78.7% | 88% | 20,000 | 99.99% |
| Token 4 | 787 | 427 | 1 | 99.9% | 64.8% | 79% | 20,000 | 99.99% |
| Token 5 | 355 | 50 | 1 | 99.7% | 87.7% | 93% | 779 | 3.89% |
| Token 7 | 1,076 | 138 | 16 | 98.5% | 88.6% | 93% | 20,000 | 99.99% |
| Token 9 | 271 | 888 | 2 | 99.3% | 23.4% | 38% | 18,163 | 90.81% |
| Token 16 | 247 | 157 | 1 | 99.6% | 61.1% | 76% | 778 | 3.89% |
| Token 22 | 476 | 437 | 22 | 95.6% | 52.1% | 67% | 13,603 | 68.01% |
| Single Token Match | 1,161 | 55 | 36 | 97.0% | 95.5% | 96% | 20,002 | 100.00% |
| Demographic | 925 | 289 | 4 | 99.6% | 76.2% | 86% | 20,000 | 99.99% |
| Net Tokens | 910 | 304 | 1 | 99.9% | 75.0% | 86% | 20,000 | 99.99% |
| SSN | 368 | 37 | 2 | 99.5% | 90.9% | 95% | 779 | 3.89% |
Abbreviations: SSN, social security number.
Recall = TP/(TP + FN).
Precision = TP/(TP + FP).
F1 = 2* [precision*recall]/[precision + recall].
Note: Token 3 is not listed because zip code was not included in the manual review data; therefore, the fill rate was 0%.
Study population and dataset ( n = 40,004; categories as listed in the dataset)
| Field | Value/Range | % | Fill rate (%) |
|---|---|---|---|
| Age | 99.5 | ||
| 0–10 | 11.05 | ||
| 11–20 | 10.33 | ||
| 21–30 | 16.15 | ||
| 31–40 | 21.32 | ||
| 41–50 | 16.35 | ||
| 51–60 | 12.09 | ||
| 61–70 | 7.00 | ||
| 71–80 | 3.32 | ||
| 81–90 | 1.59 | ||
| 91–100 | 0.29 | ||
| 101–110 | 0.03 | ||
| Gender | 100 | ||
| M | 44.5 | ||
| F | 55.5 | ||
| Other | 0.1 | ||
| Race | 58.4 | ||
| African American | 5.09 | ||
| All other | 12.9 | ||
| American Indian, Esk[i]mo, or Aleut | 0.08 | ||
| Asian or Pacific Islander | 0.25 | ||
| Caucasian | 7.43 | ||
| Hispanic or Latino | 1.29 | ||
| Latin American | 23.29 | ||
| Other | 7.72 | ||
| Other race | 0.39 | ||
| Ethnicity | 59.8 | ||
| Hispanic | 17.4 | ||
| Non-Hispanic | 41.6 | ||
| First name | 100 | ||
| Middle initial | 19.9 | ||
| Last name | 100 | ||
| Date of birth | 100 | ||
| Phone number (United States) | 94.6 | ||
| Address first line (United States) | 97.5 | ||
| Zip (three digit) | 0 | ||
| Social security number | 37.2 |
Precision, recall, and fill rates for the token types and algorithms by ethnicity
| Token or algorithm | Ethnicity | TP | FN | FP | Valid pairs | Pair fill rate | Precision | Recall | F1 |
|---|---|---|---|---|---|---|---|---|---|
| Token 1 | Not Hispanic | 1,029 | 110 | 23 | 13,890 | 69.44% | 97.81% | 90.34% | 94% |
| Hispanic | 69 | 8 | 1 | 6,112 | 30.56% | 98.57% | 89.61% | 94% | |
| Token 2 | Not Hispanic | 901 | 236 | 13 | 13,888 | 69.43% | 98.58% | 79.24% | 88% |
| Hispanic | 54 | 23 | 1 | 6,112 | 30.56% | 98.18% | 70.13% | 82% | |
| Token 4 | Not Hispanic | 744 | 393 | 1 | 13,888 | 69.43% | 99.87% | 65.44% | 79% |
| Hispanic | 34 | 0 | 6,112 | 30.56% | |||||
| Token 5 | Not Hispanic | 334 | 48 | 1 | 673 | 3.36% | 99.70% | 87.43% | 93% |
| Hispanic | 2 | 0 | 106 | 0.53% | |||||
| Token 7 | Not Hispanic | 1,007 | 130 | 15 | 13,888 | 69.43% | 98.53% | 88.57% | 93% |
| Hispanic | 69 | 8 | 1 | 6,112 | 30.56% | 98.57% | 89.61% | 94% | |
| Token 9 | Not Hispanic | 259 | 827 | 0 | 12,428 | 62.13% | 100.00% | 23.85% | 39% |
| Hispanic | 61 | 2 | 5,735 | 28.67% | |||||
| Token 16 | Not Hispanic | 233 | 148 | 1 | 672 | 3.36% | 99.57% | 61.15% | 94% |
| Hispanic | 9 | 0 | 106 | 0.53% | |||||
| Token 22 | Not Hispanic | 449 | 411 | 18 | 9,334 | 46.67% | 96.15% | 52.21% | 68% |
| Hispanic | 26 | 4 | 4,269 | 21.34% | |||||
| Single token match | Not Hispanic | 1,086 | 53 | 34 | 13,888 | 69.43% | 96.96% | 95.35% | 96% |
| Hispanic | 75 | 2 | 2 | 6,112 | 30.56% | 97.40% | 97.40% | 97% | |
| Demographic | Not Hispanic | 874 | 263 | 4 | 13,888 | 69.43% | 99.54% | 76.87% | 87% |
| Hispanic | 51 | 26 | 0 | 6,112 | 30.56% | 100.00% | 66.23% | 80% | |
| Net tokens | Not Hispanic | 859 | 278 | 1 | 673 | 3.36% | 99.88% | 75.55% | 86% |
| Hispanic | 51 | 26 | 0 | 106 | 0.53% | 100.00% | 66.23% | 80% | |
| SSN | Not Hispanic | 345 | 37 | 2 | 13,890 | 69.44% | 99.42% | 90.31% | 95% |
| Hispanic | 6,112 | 30.56% |
Abbreviations: FN, false negative; FP, false positive; SSN, social security number; TP, true positive.
Note: Token 3 is not listed because zip code was not included in the manual review data; therefore, the fill rate was 0%.
Fig. 2Precision and recall of different matching strategies.