| Literature DB >> 33027509 |
Carlos Sáez1, Nekane Romero1, J Alberto Conejero2, Juan M García-Gómez1.
Abstract
OBJECTIVE: The lack of representative coronavirus disease 2019 (COVID-19) data is a bottleneck for reliable and generalizable machine learning. Data sharing is insufficient without data quality, in which source variability plays an important role. We showcase and discuss potential biases from data source variability for COVID-19 machine learning.Entities:
Keywords: COVID-19; biases; data quality; data sharing; dataset shift; distributed research networks; heterogeneity; machine learning; multi-site data; variability
Mesh:
Year: 2021 PMID: 33027509 PMCID: PMC7797735 DOI: 10.1093/jamia/ocaa258
Source DB: PubMed Journal: J Am Med Inform Assoc ISSN: 1067-5027 Impact factor: 4.497
Figure 1.COVID-19 (coronavirus disease 2019) patient subgroups in the nCov2019 dataset, in which the 2 most prevalent countries, China and the Philippines, divided into separate subgroups with distinct severity manifestations. (A) Scatterplot of subgroups embedded by multiple correspondence analysis on 3 dimensions from symptoms and comorbidities. (B) The same scatterplot but labeled by the country of the case. Subgroups 2, 3, 5, and 6 belong to data from the Philippines. Subgroups 4 and 1 mostly represent data from China. Subgroup 1 comprised young patients with mild disease (acute nasopharyngitis) and no comorbidities. Subgroup 2 comprised elderly patients with severe pulmonary disease (pneumonia, acute respiratory distress syndrome) and comorbidities (hypertension, diabetes mellitus, chronic kidney disease). Subgroup 3 comprised middle-aged patients with severe pulmonary disease (pneumonia, acute respiratory distress syndrome) and no comorbidities—similar to subgroup 2, with no remarkable comorbidities. Subgroup 4 comprised elderly patients with mild disease (acute nasopharyngitis) and no comorbidities. The negative outcome within this subgroup might be explained by either poor in-hospital evolution or unreported comorbidities, in which the lack of complete patient data might lie at the root of potential bias. Subgroup 5 comprised elderly patients with severe systemic disease (septic shock, acute kidney injury) and comorbidities. Subgroup 6 comprised elderly patients with severe pulmonary disease (pneumonia) and heart failure due to acute coronary syndrome (most likely diagnosed on admission). For further details, see http://covid19sdetool.upv.es/?tab=ncov2019.