| Literature DB >> 34130670 |
Claudia Medina Coeli1, Valeria Saraceni2, Paulo Mota Medeiros3, Helena Pereira da Silva Santos3, Luis Carlos Torres Guillen3, Luís Guilherme Santos Buteri Alves3, Thomas Hone4, Christopher Millett4,5,6, Anete Trajman7,8, Betina Durovni9.
Abstract
BACKGROUND: Linking Brazilian databases demands the development of algorithms and processes to deal with various challenges including the large size of the databases, the low number and poor quality of personal identifiers available to be compared (national security number not mandatory), and some characteristics of Brazilian names that make the linkage process prone to errors. This study aims to describe and evaluate the quality of the processes used to create an individual-linked database for data-intensive research on the impacts on health indicators of the expansion of primary care in Rio de Janeiro City, Brazil.Entities:
Keywords: Brazil; Data accuracy; Medical record linkage; Primary healthcare
Mesh:
Year: 2021 PMID: 34130670 PMCID: PMC8204416 DOI: 10.1186/s12911-021-01550-6
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 2.796
Overview of data sources
| Database | Social Benefits National Registry (Cadastro Único—CadU) | Family Health Registry (Sistema de Cadastro da Estratégia de Saúde da Família—FHR) | Eletronic Medical Registry (Prontuário Eletrônico de Pacientes—EMR) | National Hospital Admission System (Sistema de Informações Hospitalares—SIH) | National Mortality Information System (Sistema de Informações sobre Mortalidade—SIM)a |
|---|---|---|---|---|---|
| Time period | 2008–2014 | 2009–2016 | 2011–2017 | 2011–2016 | 1999–2016 |
| Database size | 1,680,700 registrations—1,679,320 individuals | 3,732,688 registrations—3,594,623 individuals | 17,764,475 consultations—16,808,685 single consultations | 1,787,601 hospitalizations | 2,263,964 deaths |
| Personal identifiers | Name; Date of birth; Mother’s name; Address; Social Security Number (Cadastro de pessoa física—CPF); National register for social benefit (Número de inscrição social—NIS) | Name; Date of birth; Mother’s name; Address; Social Security Number (Cadastro de pessoa física—CPF); National register for social benefit (Número de inscrição social—NIS) | Name; Date of birth; Social Security Number (Cadastro de pessoa física—CPF) | Name; Date of birth; Mother’s name; Address | Name; Date of birth; Mother’s name; Address |
| Content variables | Demographic characteristics, detailed socioeconomic, and housing data | Demographic characteristics, detailed socioeconomic, housing data, self-reported chronic diseases, a summary of health care used, vital status | Patient data (e.g., demographic characteristics), the reason for encounter, laboratory and test requests, treatment plans | Patient data (e.g., demographic characteristics); hospital data (e.g., public or private); hospitalization data (e.g., admission and discharge dates, intermediary unit use, diagnosis upon discharge) | Demographic characteristics, socioeconomic data, date of death, cause of death, data on mother in the case of fetal death, and death of children under 1 year |
aDatabase from Rio de Janeiro State; all others from Rio de Janeiro City
Fig. 1Flow diagram of the record linkage process
Rules applied in the deterministic approach to classifying pairs as matches
| Rules | |
| (1) Exact agreement on the deterministic linkage key | |
| (2) Exact agreement on the social security number (CPF) | |
| (3) Exact agreement on the National register for social benefit (NIS) | |
| (4) Exact agreement on date of birth | |
| (5) The Levenshtein distance of the individual’s name < 3 | |
| (6) The Levenshtein distance of the mother’s name < 3 | |
| (7) Exact agreement on the individual’s name | |
| Linkage processes’ criteria | |
| CadU versus FHR | (1, 5 and 6) OR (2, 5 and 6) OR (3, 5 and 6) OR (2 and 4) OR (3 and 4) |
| CadU versus SIH | (1, 5 and 6) |
| CadU versus SIM | (1, 5 and 6) |
| FHR versus EMR | (1 and 5) OR (2 and 5) OR (4 and 7) |
The Levenshtein edit distance measures the minimum number of edits (insertions, deletions, or substitutions) required to change one name string into the other [18]
Description of the probabilistic linkage blocking passes
| Blocking pass | Indexing key | Comparison | Calculated score range according to estimated weights | Score cutoff values | ||
|---|---|---|---|---|---|---|
| CadU × HR | CadU × IH | CadU × IM | ||||
| 1 | Soundex first name + Soundex last name + sex + birthyear | Individual name + mother’s name + birth date | − 38.71 to 34.99 | 28.60 | 33.97 | 21.90 |
| 2 | Soundex first name + sex + birthyear | Individual name + mother’s name + birth date | − 38.71 to 34.99 | 31.67 | 32.88 | 31.91 |
| 3 | Soundex last name + sex + birthyear | Individual name + mother’s name + birth date | − 38.71 to 34.99 | 34.00 | 34.55 | 34.11 |
| 4 | Soundex first name + soundex last name + sex | Individual name + mother’s name + birth date | − 38.71 to 34.99 | 32.73 | 33.00 | 31.70 |
| 5 | Soundex first name + soundex last name + birthyear | Individual name + mother’s name + birth date | − 38.71 to 34.99 | 32.21 | 33.12 | 34.44 |
| 6 | Soundex first name + soundex last name + sex + birthyear | First individual name + last individual name + mother’s name + birth date | − 53.51 to 45.38 | 41.30 | 43.86 | 44.23 |
| 7 | Soundex first name + soundex last name + sex + birthyear | Individual name + birth date | − 32.57 to 21.9 | 17.20 | 17.68 | 17.32 |
Social Benefits National Registry (Cadastro Único—CadU); Family Health Registry (Sistema de Cadastro da Estratégia de Saúde da Família—FHR); National Hospital Admission System (Sistema de Informações Hospitalares—SIH); National Mortality Information System (Sistema de Informações sobre Mortalidade—SIM)
Fig. 2General rules for record pairs classification
Completeness of personal identifiers available for record linkage
| Database | Social Benefits National Registry (Cadastro Único—CadU) | Family Health Registry (Sistema de Cadastro da Estratégia de Saúde da Família—FHR) | Eletronic Medical Registry (Prontuário Eletrônico de Pacientes—EMR) | National Hospital Admission System (Sistema de Informações Hospitalares—SIH) | National Mortality Information System (Sistema de Informações sobre Mortalidade—SIM)a |
|---|---|---|---|---|---|
| N = 1,680,700 | N = 3,732,688 | N = 17,764,475 | N = 1,787,601 | N = 2,263,964 | |
| Identifiers | % | % | % | % | % |
| Name | 100 | 100 | 99.2 | 96.6 | 99.8 |
| Mother’s name | 99.6 | 100 | (–) | 90.2 | 96.6 |
| Date of birth | 100 | 100 | 100 | 100 | 97.2 |
| Sex | 100 | 100 | 100 | 100 | 100 |
| Address | 82.3 | 98.8 | 80.0 | 98.6 | 88.6 |
| Social security number | 56.5 | 82.7 | 84.0 | (–) | (–) |
| NIS | 99.6 | 7.3 | 7.3 | (–) | (–) |
Social security number (Cadastro de Pessoa Física—CPF)
National register for social benefit (Número de Inscrição Social—NIS)
(–) Attribute not available in the database
aExcluding fetal deaths and deaths of children under 1-year-old
Fig. 3Overview of matches identified in each approach. “Total of matches” excludes record pairs identified as wrongly assigned as matches in the procedure of preparing datasets for analysis
False match proportion acoording to the linkage approach
| Measures | CadU vs. FHR | CadU vs. SIH | CadU vs. SIM | FHR vs. EMR |
|---|---|---|---|---|
| Deterministic (N = 744) | 1.07 (0.47–2.11) | 0 | 0.13 (0.00–0.07) | 0.67 (0.22–1.56) |
| Probabilistic (N = 744) | 0.27 (0.03–0.97) | 0 | 0.67(0.22–1.56) | (–) |
| Clerical review (N = 744) | 0.40 (0.08–1.17) | 3.89 (2.63–5.55) | 2.55 (1.54–3.95) | (–) |
(–) Linkage approach did not execute
False match (records from different individuals that are linked), missed match (records from the same individual that are not linked)