| Literature DB >> 35846888 |
José Deney Araujo1, Juan Carlo Santos-E-Silva1, André Guilherme Costa-Martins1,2, Vanderson Sampaio3,4, Daniel Barros de Castro5, Robson F de Souza6, Jeevan Giddaluru1, Pablo Ivan P Ramos7, Robespierre Pita7, Mauricio L Barreto7, Manoel Barral-Netto7, Helder I Nakaya1,2,4,8.
Abstract
Background: Public health research frequently requires the integration of information from different data sources. However, errors in the records and the high computational costs involved make linking large administrative databases using record linkage (RL) methodologies a major challenge.Entities:
Keywords: BLAST; DNA-encoded; Epidemiology; Genomic tools; Record linkage
Year: 2022 PMID: 35846888 PMCID: PMC9281601 DOI: 10.7717/peerj.13507
Source DB: PubMed Journal: PeerJ ISSN: 2167-8359 Impact factor: 3.061
Figure 1Tucuxi-BLAST workflow and data organization scheme.
Four variables are selected in common between two datasets, then DNA coding is performed. The coding result is submitted to the BLAST algorithm and, finally, ML is applied to classify the RL (A). Codon wheel used in DNA coding (B), results of BLAST for RL (C), and Tucuxi-BW module for data deduplication (D).
Figure 2Competence in handling big data.
Tucuxi-Curumim was used to generate all simulated data with the data obtained from IBGE (Instituto Brasileiro de Geografia e Estatística) (A). The execution time and use of RAM memory for each RL simulation were evaluated (B and C, respectively). All simulations were performed on a 32 GB Intel Core i7-8700 Linux workstation.
Figure 3Exploration of databases.
Counting the number of records with errors and in which variables the errors occur (A). Total error rates in true positive linked records of SINAN databases against SIM mortality databases identifying any type of error (B). The networks demonstrate the substitution rate between numbers (C) and letters (D). The substitution rate between alphanumeric characters was calculated using records showing only mismatches in the BLAST results, i.e. fields from both records having the same length. The networks display the characters (nodes) and the frequency of substitutions between them (edges).
Figure 4Benchmark for the main record linkage tools.
ROC curves for the linkage runs of real data from disease databases of meningitis (MEN), HIV and tuberculosis (TB) using LR = Logistic Regression and RF = Random Forest (A). Performance metrics for the ML approach for each database (B). Accuracy percentage for each disease against death database using the different methods for the benchmark (C). Execution time spent (in log10 s) (D). RAM memory consumption in GB (E). Runs for the RecordLinkage R package applying non-blocking methods were not possible for the TB database using the workstation mentioned in the methods.