| Literature DB >> 34559671 |
Randi Foraker1,2, Aixia Guo2, Jason Thomas3, Noa Zamstein4, Philip Ro Payne1,2, Adam Wilcox3.
Abstract
BACKGROUND: Computationally derived ("synthetic") data can enable the creation and analysis of clinical, laboratory, and diagnostic data as if they were the original electronic health record data. Synthetic data can support data sharing to answer critical research questions to address the COVID-19 pandemic.Entities:
Keywords: COVID-19; data analysis; electronic health records and systems; protected health information; synthetic data
Mesh:
Year: 2021 PMID: 34559671 PMCID: PMC8491642 DOI: 10.2196/30697
Source DB: PubMed Journal: J Med Internet Res ISSN: 1438-8871 Impact factor: 5.428
Comparison of patient characteristics of available demographic and clinical variables: original vs synthetic data.
|
| Original data (n=230,703) | Synthetic data (n=230,650) | |
| Age (years), mean (SD) | 41.6 (20.4) | 41.6 (20.4) | |
| Gender (male), n (%) | 108,194 (46.9) | 107,892 (46.8) | |
|
| |||
|
| White | 121,706 (52.8) | 121,564 (52.7) |
|
| Black | 40,930 (17.7) | 40,824 (17.7) |
|
| Asian | 5203 (2.3) | 5117 (2.2) |
|
| Other/unknown | 62,864 (27.2) | 62,733 (27.2) |
|
| |||
|
| 1 | 29,875 (12.9) | 28,617 (12.4) |
|
| 2 | 21,191 (9.2) | 20,671 (9.0) |
|
| 3 | 21,045 (9.1) | 20,319 (9.0) |
|
| 4 | 18,006 (7.8) | 16,998 (7.4) |
|
| 5 | 14,391 (6.2) | 13,840 (6.0) |
|
| |||
|
| 1 | 33,413 (14.5) | 32,743 (14.2) |
|
| 2 | 24,533 (10.6) | 23,986 (10.4) |
|
| 3 | 15,578 (6.8) | 15,065 (6.5) |
|
| 4 | 11,870 (5.1) | 11,255 (4.9) |
|
| 5 | 11,354 (4.9) | 10,850 (4.7) |
| Household income (US $), median (IQR) | 56,738 (45,214, 71,250) | 56,662 (45,223, 71,029) | |
| BMI, mean (SD) | 30.3 (8.4) | 30.3 (8.2) | |
| Admission start date (days from reference), mean (SD) | 2.1 (3.3) | 2.0 (3.2) | |
| Minimum oxygen saturation, mean (SD) | 90.9 (10.1) | 91.0 (9.7) | |
| Diabetes, n (%) | 31,942 (13.8) | 31,929 (13.8) | |
| Dyspnea, n (%) | 20,867 (9.0) | 20,826 (9.0) | |
| Chronic kidney disease, n (%) | 11,225 (4.9) | 11,194 (4.9) | |
| Fever, n (%) | 30,210 (13.1) | 30,200 (13.1) | |
| Cough, n (%) | 39,703 (17.2) | 39,689 (17.2) | |
| Deceased, n (%) | 1133 (0.5) | 1008 (0.4) | |
Logistic regression for admission: original vs synthetic data.
|
| Univariate LRa, ORb (95% CI) | Multivariable LR, OR (95% CI) | ||
|
| Original data | Synthetic data | Original data | Synthetic data |
| Age | 1.04 (1.04-1.04) | 1.04 (1.04-1.04) | 1.00 (1.00-1.00) | 1.00 (1.00-1.00) |
| Male gender | 1.20 (1.16-1.24) | 1.14 (1.10-1.17) | 1.11 (0.99-1.23) | 1.03 (0.93-1.15) |
| Black race | 2.15 (2.07-2.22) | 2.09 (2.02-2.17) | 0.99 (0.87-1.12) | 0.93 (0.82-1.06) |
| Median household income | 1.00 (1.00-1.00) | 1.00 (1.00-1.00) | 1.00 (1.00-1.00) | 1.00 (1.00-1.00) |
| BMI | 1.02 (1.01-1.02) | 1.02 (1.01-1.02) | 0.97 (0.97-0.98) | 1.01 (1.00-1.02) |
| Minimum oxygen saturation | 0.97 (0.96-0.97) | 0.97 (0.96-0.97) | 0.97 (0.97-0.98) | 0.97 (0.97-0.98) |
| Diabetes | 6.14 (5.94-6.34) | 6.15 (5.95-6.36) | 1.45 (1.29-1.62) | 1.46 (1.30-1.63) |
| Dyspnea | 4.79 (4.62-4.97) | 4.79 (4.61-4.97) | 1.23 (1.09-1.38) | 1.25 (1.11-1.41) |
| Chronic kidney disease | 7.20 (6.89-7.52) | 7.17 (6.87-7.49) | 1.23 (1.07-1.42) | 1.26 (1.09-1.45) |
| Fever | 2.62 (2.52-2.71) | 2.62 (2.53-2.72) | 1.44 (1.29-1.61) | 1.45 (1.30-1.62) |
| Cough | 1.38 (1.33-1.43) | 1.38 (1.32-1.43) | 1.50 (1.32-1.70) | 1.45 (1.28-1.65) |
aLR: logistic regression.
bOR: odds ratio.
Figure 1Prediction performance for the two models by receiver operating characteristic curves (A, C) and precision-recall curves (B, D) by using original and synthetic data. Results for the RF model are in the first row (A, B); the second row (C, D) is for LR. AUC: area under the curve; LR: logistic regression; RF: random forest.
Figure 2Model performance metrics from original (green) and synthetic (gold) data by accuracy, specificity, precision, sensitivity, and F1-score: RF model (A) and LR model (B). LR: logistic regression; RF: random forest.
Figure 3Feature importance for the 11 variables in RF (a) and LR (b) models: original vs synthetic data. CKD: chronic kidney disease; LR: logistic regression; RF: random forest.
Figure 4Original data (light blue) and synthetic data (light red), with their overlap (purple).
Epidemic curves aggregate cases’ paired statistical tests, comparing original to synthetic data.
| Metric | Date range | Wilcoxon result | Wilcoxon | ||
| Counts | 335 | 26,288 | .50 | –0.002 | >.99 |
| 7-day moving average | 329 | 26,005 | .78 | –0.006 | >.99 |
| 7-day slope | 329 | 25,788.5 | .90 | –0.002 | >.99 |
SDOH values for zip codes that were uncensored (n=5819) compared to censored (n=11,222) zip codes.
| SDOHa and censored status | Mean | SD | Median | IQR | % missing | |
|
| ||||||
|
| Uncensored | 63,536 | 26,755 | 57,352 | 28,692 | 3.28 |
|
| Censored | 60,544 | 26,549 | 54,358 | 27,067 | 10.98 |
|
| Difference (%) | +2992 (4.9) | +206 (0.8) | +2994 (5.5) | +1625 (6.0) | –7.70 (70.1) |
|
| ||||||
|
| Uncensored | 12.89 | 8.74 | 10.80 | 10.40 | 2.92 |
|
| Censored | 13.87 | 10.15 | 11.60 | 11.50 | 9.12 |
|
| Difference (%) | –0.98 (7.1) | –1.41 (13.9) | –0.80 (6.9) | –1.10 (9.6) | –6.20 (68.0) |
|
| ||||||
|
| Uncensored | 8.52 | 5.09 | 7.50 | 6.50 | 2.84 |
|
| Censored | 9.65 | 7.09 | 8.10 | 8.00 | 9.00 |
|
| Difference (%) | –1.13 (11.7) | –2.00 (28.2) | –0.60 (7.4) | –1.50 (18.8) | –6.16 (68.4) |
|
| ||||||
|
| Uncensored | 17,363 | 16,128 | 12,263 | 23,172 | 2.73 |
|
| Censored | 14,540 | 17,317 | 7048 | 21,436 | 8.69 |
|
| Difference (%) | +2823 (19.4) | –1189 (6.9) | +5215 (74.0) | +1736 (8.1) | –5.96 (68.6) |
aSDOH: social determinants of health.