| Literature DB >> 33167998 |
George C G Barbosa1, M Sanni Ali2,3,4, Bruno Araujo2, Sandra Reis2, Samila Sena2, Maria Y T Ichihara2, Julia Pescarini2, Rosemeire L Fiaccone2,5, Leila D Amorim2,5, Robespierre Pita2, Marcos E Barreto2,6,7, Liam Smeeth3, Mauricio L Barreto2,8.
Abstract
BACKGROUND: Record linkage is the process of identifying and combining records about the same individual from two or more different datasets. While there are many open source and commercial data linkage tools, the volume and complexity of currently available datasets for linkage pose a huge challenge; hence, designing an efficient linkage tool with reasonable accuracy and scalability is required.Entities:
Keywords: Accuracy; Data linkage; Entity resolution; Indexing; Information retrieval techniques; Scalability; Scoring Search
Year: 2020 PMID: 33167998 PMCID: PMC7654019 DOI: 10.1186/s12911-020-01285-w
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 2.796
Fig. 1CIDACS-RL Architecture. In the first step the dataset A is indexed. During the query step, each record in dataset B is used to retrieve the N most similar records in the index and store them in a logical dataset (candidates). Finally, in the query step, a custom comparison method is applied to classify the candidate pairs
Fig. 2Example of a comparison between two records with four attributes. Each attribute is compared based on its type, generating a score between zero and one. Then, a set of weights defined empirically by the researcher is used to average the scores from the four attributes into a single final score
Fig. 3Construction of the gold standard dataset
Fig. 4ROC curves for each record linkage tool over the gold standard dataset
Threshold analysis for each record linkage tool
| Method* | Threshold (TH) | Pairs above TH | Sensitivity | Specificity | FPs above TH | FNs below TH(%) | PPV |
|---|---|---|---|---|---|---|---|
| CIDACS-RL | 0.8827056 | 3026 (46.86) | 99.87 | 99.94 | 2 (0.07) | 4 (0.13) | 99.93 |
| AtyImo | 8777 | 3005 (46.54) | 98.91 | 99.39 | 21 (0.70) | 33 (1.09) | 99.30 |
| RecLink | 0.8075590 | 2243 (34.74) | 73.75 | 99.71 | 10 (0.45) | 795 (26.25) | 99.55 |
| Febrl | 3722604 | 2832 (43.86) | 90.58 | 97.40 | 89 (3.14) | 285 (9.41) | 96.86 |
| FRILL | 48 | 2351 (36.41) | 74.66 | 97.36 | 90 (3.83) | 767 (25.33) | 96.17 |
*Execution time (in minutes): CIDACS-RL < 1, AtyImo = 28, RecLink < 1, FRIL = 7, and Febrl = 130
Case study (CadUnico x SINAN-TB) dataset: linkage analysis
| Cut-off | Specificity | Sensitivity | Matches (%) | True matches (%) | False matches (%) | Missed true matches (%) |
|---|---|---|---|---|---|---|
| 0.860 | 75.0 | 97.1 | 16,443 (55.15) | 12,100 (73.59) | 4343 (26.41) | 361 (2.90) |
| 0.870 | 82.2 | 95.5 | 14,984 (50.25) | 11,901 (79.42) | 3083 (20.58) | 560 (4.49) |
| 0.880 | 87.7 | 94.5 | 13,901 (46.62) | 11,770 (84.67) | 2131 (15.33) | 691 (5.55) |
| 0.890 | 91.8 | 93.3 | 13,046 (43.76) | 11,621 (89.08) | 1425 (10.92) | 840 (6.74) |
| 0.896 | 93.5 | 92.5 | 12,661 (42.46) | 11,532 (91.08) | 1129 (8.92) | 929 (7.46) |
| 0.900 | 94.2 | 91.7 | 12,423 (41.67) | 11,424 (91.96) | 999 (8.04) | 1037 (8.32) |
| 0.910 | 95.8 | 89.8 | 11,931 (40.02) | 11,194 (93.82) | 737 (6.18) | 1267 (10.17) |
| 0.920 | 96.7 | 88.1 | 11,546 (38.72) | 10,972 (95.03) | 574 (4.97) | 1489 (11.95) |
| 0.930 | 98.0 | 85.4 | 10,984 (36.84) | 10,636 (96.83) | 348 (3.17) | 1825 (14.65) |
Fig. 5Scalability tests using two different hardware setups on pseudo-distributed mode of Spark