| Literature DB >> 33709065 |
Khaled El Emam1,2,3, Lucy Mosquera3, Elizabeth Jonker2, Harpreet Sood4,5.
Abstract
BACKGROUND: Concerns about patient privacy have limited access to COVID-19 datasets. Data synthesis is one approach for making such data broadly available to the research community in a privacy protective manner.Entities:
Keywords: data access; data sharing; data synthesis; synthetic data
Year: 2021 PMID: 33709065 PMCID: PMC7936723 DOI: 10.1093/jamiaopen/ooab012
Source DB: PubMed Journal: JAMIA Open ISSN: 2574-2531
Fields in the Canadian COVID-19 case dataset used for our study
| Variables | Definitions |
|---|---|
| Date reported | Number of days since January 1, 2020 |
| Health region | 34 unique regions |
| Age group | Decades from 20 to 80+ (ordinal) |
| Gender | |
| Exposure | Close contact, outbreak, travel, not reported |
| Case status | Recovered, deceased, active |
Fields included on the health region community
| Variables | Definitions |
|---|---|
| Proportion living in rural areas | Rural areas are defined as all territory lying outside population centers (population centers have a population of at least 1000 and a density of 400 or more persons per square kilometer) |
| Proportion of immigrants | An immigrant as a person who is, or who has ever been, a landed immigrant or permanent resident. Such a person has been granted the right to live in Canada permanently by immigration authorities. Immigrants who have obtained Canadian citizenship by naturalization are included in this group. |
| Proportion of aboriginal population | Aboriginal identity is based on whether the person identified with the Aboriginal peoples of Canada. This includes those who are First Nations, Métis or Inuk (Inuit) and/or those who are registered or treaty Indians (i.e. registered under the Indian Act of Canada) and/or those who have membership in a First Nation or Indian band. |
| Prevalence of diabetes | Population age 12 and older who reported having been diagnosed by a health professional as having type 1 or type 2 diabetes; includes females age 15 and older who reported having been diagnosed with gestational diabetes. |
| Prevalence of COPD | Population age 35 and older who reported being diagnosed by a health professional with chronic bronchitis, emphysema or chronic obstructive pulmonary disease (COPD). |
| Prevalence of high blood pressure | Population age 12 and older who reported that they have been diagnosed by a health professional as having high blood pressure. |
| Family medicine physicians per 100 000 population | The number of family medicine physicians per 100 000 population. |
| Proportion reporting Moderate-to-severe Food Insecurity | Food security is commonly understood to exist in a household when all people, at all times, have access to sufficient safe and nutritious food for an active and healthy life. Conversely, food insecurity occurs when food quality and/or quantity are compromised and is typically associated with limited financial resources. |
The definitions are taken from the source document [51].
Figure 1.Process diagram for the analysis method. The diagram shows the steps for each iteration of the bootstrap sampling. The testing data is the out-of-sample subset in each bootstrap iteration. cPDP stands for conditional partial dependency plot.
Summary statistics on the variables analyzed (n = 90 514 Ontario cases)
| Variable | Mean (SD) | Proportion |
|---|---|---|
| Date reported | 214.43 (82.66) | |
| (days since January 1, 2020) | ||
| Gender | ||
| Male | 48.5% | |
| Age group | ||
| <20 | 11.2% | |
| [20–29] | 20.8% | |
| [30–39] | 15.5% | |
| [40–49] | 13.8% | |
| [50–59] | 14.7% | |
| [60–69] | 9.4% | |
| [70–79] | 5.3% | |
| 80+ | 9.3% | |
| Exposure | ||
| Travel related | 3.4% | |
| Close contact | 40% | |
| Outbreak | 24.6% | |
| Not reported | 32% | |
| % living in rural areas | 6.98 (12) | |
| % of immigrants | 37.04 (14.59) | |
| % of aboriginal population | 1.64 (2.04) | |
| Prevalence of diabetes | 7.73 (1.45) | |
| Prevalence of COPD | 3.26 (1.38) | |
| Prevalence of high blood pressure | 17.29 (2.2) | |
| Family medicine physicians per 100 000 Population | 112.57 (102.5) | |
| Proportion reporting moderate-to-severe food insecurity | 7.99 (1.81) |
Mean model accuracy results for the real and synthetic datasets with the 95% bootstrap confidence interval
| Accuracy metric | Real data | Synthetic data | CI overlap |
|---|---|---|---|
|
| 0.945 (0.941–0.948) | 0.940 (0.936–0.945) | 45.50% |
|
| 0.340 (0.314–0.368) | 0.313 (0.286–0.342) | 52.02% |
The confidence interval overlap between the real and synthetic CIs is also shown in the last column.
Figure 2.Variable importance using the permutation method with AUROC as the accuracy metric and the 95% bootstrap confidence interval. The values on the side are the confidence interval overlap values between the real and synthetic datasets.
Figure 3.Variable importance using the permutation method with AUPRC as the accuracy metric and the 95% bootstrap confidence interval. The values on the side are the confidence interval overlap values between the real and synthetic datasets.
Figure 4.Conditional partial dependence plot for date reported with bootstrap confidence intervals on the real and synthetic datasets. The date reported is measured as the number of days since January 1, 2020.
Figure 5.Conditional partial dependence plot for age with 95% bootstrap confidence intervals on the real and synthetic data. Confidence interval overlap is annotated at the top of the plot for each age group.