| Literature DB >> 29505402 |
Robespierre Pita, Clicia Pinto, Samila Sena, Rosemeire Fiaccone, Leila Amorim, Sandra Reis, Mauricio L Barreto, Spiros Denaxas, Marcos Ennes Barreto.
Abstract
Data linkage refers to the process of identifying and linking records that refer to the same entity across multiple heterogeneous data sources. This method has been widely utilized across scientific domains, including public health where records from clinical, administrative, and other surveillance databases are aggregated and used for research, decision making, and assessment of public policies. When a common set of unique identifiers does not exist across sources, probabilistic linkage approaches are used to link records using a combination of attributes. These methods require a careful choice of comparison attributes as well as similarity metrics and cutoff values to decide if a given pair of records matches or not and for assessing the accuracy of the results. In large, complex datasets, linking and assessing accuracy can be challenging due to the volume and complexity of the data, the absence of a gold standard, and the challenges associated with manually reviewing a very large number of record matches. In this paper, we present AtyImo, a hybrid probabilistic linkage tool optimized for high accuracy and scalability in massive data sets. We describe the implementation details around anonymization, blocking, deterministic and probabilistic linkage, and accuracy assessment. We present results from linking a large population-based cohort of 114 million individuals in Brazil to public health and administrative databases for research. In controlled and real scenarios, we observed high accuracy of results: 93%-97% true matches. In terms of scalability, we present AtyImo's ability to link the entire cohort in less than nine days using Spark and scaling up to 20 million records in less than 12s over heterogeneous (CPU+GPU) architectures.Entities:
Mesh:
Year: 2018 PMID: 29505402 PMCID: PMC7198121 DOI: 10.1109/JBHI.2018.2796941
Source DB: PubMed Journal: IEEE J Biomed Health Inform ISSN: 2168-2194 Impact factor: 5.772
Fig. 1Blocking predicate implemented by AtyImo.
Fig. 2Example of a Bloom filter encoding hashed bigrams.
Fig. 3Full probabilistic linkage approach comparing Bloom filters directly.
Fig. 4Hybrid linkage approach based on bespoke rules.
Variability of Best Dice Coefficients
| Samples | SIH | SINAN | ||||
|---|---|---|---|---|---|---|
| Dice | Sens. | PPV | Dice | Sens. | PPV | |
| SE | 9400 | 95.6% | 95.0% | 9300 | 96.7% | 95.9% |
| SC | 9100 | 99.0% | 96.0% | 9100 | 97.7% | 97.4% |
| BA | 9100 | 98.5% | 97.9% | 9200 | 95.7% | 95.5% |
| RO | 9300 | 94.1% | 94.2% | 9400 | 87.9% | 91.0% |
Governmental Databases
| Databases | Coverage |
|---|---|
| CADU (socioeconomic data) | 2007 to 2015 |
| PBF (cash benefits payments) | 2007 to 2015 |
| SIH (hospitalizations) | 1998 to 2011 |
| SIM (mortality) | 2000 to 2012 |
| SINAN (notifiable diseases) | 2000 to 2010 |
| SINASC (live births) | 2001 to 2012 |
AtyImo-Spark Code Organization
| Module | Purpose |
|---|---|
| Data cleansing and standardization Blocking (record grouping) | |
| Creation of Bloom filters Pairwise comparison and matching Generation of research datasets | |
| Data and Spark configuration |
Comparative Analysis–AtyImo × FRIL × Febrl
| FRIL | FRIL blocking | Febrl | Febrl blocking | AtyImo | AtyImo blocking | |
|---|---|---|---|---|---|---|
| TP | 486 | 484 | 480 | 479 | 486 | 486 |
| TN | 0 | 0 | 0 | 0 | 0 | 0 |
| FP | 1 | 0 | 1 | 0 | 0 | 0 |
| FN | 0 | 2 | 6 | 7 | 0 | 0 |
Linkage Results (sample: CADU Tuberculosis 2011)
| Databases (number of records) | Matched pairs | TPs (%) | ||
|---|---|---|---|---|
| Full | Hybrid | Full | Hybrid | |
| CADU 2011 × SIH SE | 40 | 24 | 23 | 23 |
| (1, 447 512) × (49) | (57.5%) | (95.8%) | ||
| CADU 2011 × SIH SC | 140 | 95 | 83 | 86 |
| (1 988 599) × (330) | (59.2%) | (90.5%) | ||
| CADU 2011 × SINAN SE | 398 | 311 | 309 | 299 |
| (1 447 512) × (624) | (77.6%) | (96.1%) | ||
| CADU 2011 × SINAN SC | 661 | 500 | 551 | 462 |
| (1 988 599) × (2049) | (83.3%) | (92.4%) | ||
Fig. 5Best coefficient and related results (CADU cohort × SIM, RO).
Fig. 7Best coefficient and related results (CADU cohort × SIM, SC).
Algorithm 1AtyImo code using OpenMP and CUDA.
Fig. 8Execution time (a) and speed up (b) of AtyImo hybrid.