| Literature DB >> 34872992 |
Ana Margarida Pereira1,2, Joao A Fonseca1,2, Cristina Costa-Santos3,2, Ana Luisa Neves1,2,4, Ricardo Correia1,2, Paulo Santos1,2, Matilde Monteiro-Soares1,2,5, Alberto Freitas1,2, Ines Ribeiro-Vaz1,2,6, Teresa S Henriques1,2, Pedro Pereira Rodrigues1,2, Altamiro Costa-Pereira1,2.
Abstract
OBJECTIVES: High-quality data are crucial for guiding decision-making and practising evidence-based healthcare, especially if previous knowledge is lacking. Nevertheless, data quality frailties have been exposed worldwide during the current COVID-19 pandemic. Focusing on a major Portuguese epidemiological surveillance dataset, our study aims to assess COVID-19 data quality issues and suggest possible solutions. SETTINGS: On 27 April 2020, the Portuguese Directorate-General of Health (DGS) made available a dataset (DGSApril) for researchers, upon request. On 4 August, an updated dataset (DGSAugust) was also obtained. PARTICIPANTS: All COVID-19-confirmed cases notified through the medical component of National System for Epidemiological Surveillance until end of June. PRIMARY AND SECONDARY OUTCOME MEASURES: Data completeness and consistency.Entities:
Keywords: COVID-19; epidemiology; health informatics; information management; public health; statistics & research methods
Mesh:
Year: 2021 PMID: 34872992 PMCID: PMC8649880 DOI: 10.1136/bmjopen-2020-047623
Source DB: PubMed Journal: BMJ Open ISSN: 2044-6055 Impact factor: 2.692
Figure 1Example of one possible information flow from the moment the data are introduced until the dataset is made available to researchers. The ⊗ symbol means that data are not sent and therefore not present in the research database (DB). The dashed line represents a manual cumbersome process that is many times executed by public health professionals and that is very susceptible to errors. DGS, Directorate-General of Health.
Figure 2Number of unique case identifiers presented in the datasets of COVID-19 cases diagnosed since the start of the pandemic until 27 April (date when the first database was made available) and after 27 April. DGS, Directorate-General of Health.
Data completeness (number and percentage of missing information) of each variable available in the DGSApril and DGSAugust datasets with COVID-19 cases provided by DGS
| DGSApril (n=20 293) | DGSAugust (n=38 545) | |||
| System missing | Coded as unknown | System missing | Coded as unknown | |
| Unique case identifier (RecordID) | 0 | 0 | 0 | 0 |
| RecordID of the linked cases | * | * | * | * |
| Age | 0 | 0 | 0 | 0 |
| Probable place of infection | 0 | 0 | 0 | 0 |
| Gender | 0 | 0 | 0 | 0 |
| Hospitalisation | 0 | 1623 (8) | 3 (0) | 3425 (9) |
| Outcome | 0 | 23 (0) | † | † |
| Patient has underlying condition | 0 | 2 (0) | 15 407 (40) | 2495 (6) |
| Date of first positive laboratory result | 19 268 (95) | 0 | 34 667 (90) | 0 |
| Date of diagnosis | ‡ | ‡ | 7 (0) | 0 |
| Date of disease onset | 4815 (24) | 0 | 15 045 (39) | 0 |
| Date of death | ‡ | ‡ | 37 390 (97) | 0 |
| Date of recovery | ‡ | ‡ | 21 499 (56) | 0 |
|
| n=2973 | n=4327 | ||
| Date of hospitalisation | 386 (13) | 0 | 860 (20) | 0 |
| Case required care in an intensive care unit | 0 | 2712 (91) | 1122 (26) | 0 |
| Level of respiratory support given to patient | 0 | 1573 (53) | 1364 (31) | 172 (4) |
| Date of hospital discharge | ‡ | ‡ | 3975 (92) | 0 |
*Variable neither available in DGSApril dataset nor in DGSAugust dataset but described in the metadata file provided by DGS.
†Variable available in DGSApril dataset and described in the metadata file provided by DGS but not provided in DGSAugust dataset.
‡Variable neither available in DGSApril dataset nor described in the metadata file provided by DGS but provided in DGSAugust dataset.
DGS, Directorate-General of Health.
Number and percentage of COVID-19 cases presented in both datasets (n=16 218) with information that did not match for each variable
| Healthcare data inconsistencies | |
| Patient has underlying condition | 8902 (55) |
| Age* | 8326 (51) |
| Hospitalisation | 253 (16) |
| Date of disease onset | 2008 (12) |
| Date of first positive laboratory result | 962 (6) |
| Probable place of infection | 46 (0) |
| Gender of the reported case | 1 (0) |
*The definition of ‘age’ was different in both datasets: in DGSApril is the age at the time of COVID-19 onset, and in DGSAugust, the age at the time of COVID-19 notification.
Number of COVID-19 cases and deaths due to COVID-19 reported by DGSAugust dataset and by the daily public report
| Month | COVID-19 cases reported by: | Deaths due to COVID-19 reported by: | ||||
| DGSAugust | Daily public report | Difference | DGSAugust | Daily public report | Difference | |
| March | 8920 | 8251 | +669 | 192 | 187 | +5 |
| April | 13 838 | 16 736 | −2898 | 750 | 820 | −70 |
| May | 7113 | 7713 | −600 | 213 | 417 | −204 |
| June | 8649 | 9823 | −1174 | 0 | 155 | −155 |
DGS, Directorate-General of Health.
Prevalence estimation for each precondition by DGSApril (used in Nogueira and colleagues’19 study) and by the updated dataset
| Precondition | Nogueira and colleagues’ study | Updated dataset |
| Asthma | 1.36 (1.20 to 1.53) | 4.74 (4.44 to 5.08) |
| Cancer | 3.01 (2.78 to 3,26) | 5.45 (5.12 to 5.81) |
| Cardiac disease | 0.27 (0.20 to 0.35) | – |
| Haematological disorder | 1.08 (0.09 to 1.24) | 2.00 (1.79 to 2.22) |
| Diabetes | 5.64 (5.33 to 5.97) | 12.3 (11.8 to 12.8) |
| HIV/other immune deficiencies | 0.53 (0.43 to 0.64) | 1.35 (1.18 to 1.54) |
| Kidney disorder | 1.98 (1.79 to 2.18) | 4.33 (4.02 to 4.65) |
| Liver disorder | 0.53 (0.43 to 0.64) | 1.27 (1.11 to 1.46) |
| Lung disorder | 3.39 (3.15 to 3.65) | 4.50 (4.19 to 4.82) |
| Neuromuscular disorder | 3.92 (3.66 to 4.19) | 3.50 (3.23 to 3.79) |
| At least one precondition | 16.6 (16.1 to 17.1) | 40.3 (39.7 to 41.0) |
DGS, Directorate-General of Health.
Most frequent data quality issues and possible solutions
| Issues | Solutions |
| ‘Missing’ versus ‘absent’ variable coding | Automatically code blank cells as system missing |
| Differences in cases included | Guarantee same unique case identifier by recording it in the registry database |
| Data (in)completeness | Determine a core of mandatory variables |
| Data (in)consistency | Maintain same variables (and respective definitions) along time |
| Data entry errors | Improve information system (by determining possible values and limits) |