| Literature DB >> 34079930 |
Graciela Muniz-Terrera1, Ofer Mendelevitch2, Rodrigo Barnes3, Michael D Lesh2,4.
Abstract
When attempting to answer questions of interest, scientists often encounter hurdles that may stem from limited access to existing adequate datasets as a consequence of poor data sharing practices, constraining administrative practices. Further, when attempting to integrate data, differences in existing datasets also impose challenges that limit opportunities for data integration. As a result, the pace of scientific advancements is suboptimal. Synthetic data and virtual cohorts generated using innovative computational techniques represent an opportunity to overcome some of these limitations and consequently, to advance scientific developments. In this paper, we demonstrate the use of virtual cohorts techniques to generate a synthetic dataset that mirrors a deeply phenotyped sample of preclinical dementia research participants.Entities:
Keywords: AI/ML; cohorts; dementia; synthetic data; virtual cohort
Year: 2021 PMID: 34079930 PMCID: PMC8165312 DOI: 10.3389/frai.2021.613956
Source DB: PubMed Journal: Front Artif Intell ISSN: 2624-8212
Descriptive statistics of the EPAD V1500 sample used and the synthetic data generated.
| 1,498 | 1,498 | |||||
| Height | 41 | 166.7 (9.3) | 38 | 167.0 (8.5) | 0.32 | |
| Weight | 37 | 73.4 (14.5) | 38 | 74.5 (14.0) | 0.42 | |
| BMI | 41 | 26.3 (4.5) | 38 | 26.6 (4.5) | 0.07 | |
| P-tau | 236 | 19.0 (10.2) | 320 | 19.6 (10.1) | 0.11 | |
| T-tau | 236 | 219.6 (93.1) | 320 | 227.1 (95.5) | 0.06 | |
| ABeta_1_42 | 235 | 1247.4 (420.8) | 320 | 1279 (421.1) | 0.94 | |
| Abeta_calc | 1,130 | 2276.5 (633.0) | 1130 | 2283.4 (634.7) | 0.88 | |
| Rad_pct | 106 | 5681.2 (1055.3) | 51 | 5800.7 (1131.7) | 0.11 | |
| Age | 0 | 65.6 (7.2) | 0 | 65.5 (7.4) | 0.78 | |
| Edu_years | 8 | 14.5 (3.7) | 1 | 14.3 (3.8) | 0.14 | |
| Rbans_total | 62 | 103.4 (13.7) | 38 | 104.5 (13.0) | 0.21 | |
| Sex | Female | 0 | 852 (56.9%) | 0 | 969 (64.7%) | 0.06 |
| Male | 646 (43.1%) | 529 (35.3%) | 0.06 | |||
| Ethnicity | Asian | 359 | 2 (0.2%) | 322 | 1 (0.1%) | 0.56 |
| Black | 2 (0.2%) | 2 (0.2%) | 0.99 | |||
| Caucasian/White | 1,128 (99%) | 1,165 (99.1%) | 0.43 | |||
| Other | 1 (0.1%) | 2 (0.2%) | 0.90 | |||
| Hispanic | 1 (0.1%) | 3 (0.3%) | 0.31 | |||
| Latin American | 1 (0.1%) | 2 (0.2%) | 0.56 | |||
| Mauricienne | 1 (0.1%) | 0 (0%) | 0.31 | |||
| Moroccan | 1 (0.1%) | 1 (0.1%) | 0.90 | |||
| South East Asian | 1 (0.1%) | 0 (0%) | 0.31 | |||
| Family history | No | 0 | 575 (38.4%) | 0 | 635 (42.4%) | 0.08 |
| Yes | 923 (61.6%) | 863 (57.6%) | 0.15 | |||
| ABeta_1_42 < 1,000 | No | 235 | 855 (67.7%) | 320 | 830 (70.5%) | 0.56 |
| Yes | 408 (32.3%) | 348 (29.5%) | 0.14 | |||
| ApoE | e2/e2 | 178 | 3 (0.2%) | 123 | 5 (0.4%) | 0.34 |
| e2/e3 | 110 (8.3%) | 106 (7.7%) | 0.56 | |||
| e2/e4 | 43 (3.3%) | 58 (4.2%) | 0.21 | |||
| e3/e3 | 709 (53.7%) | 620 (45.1%) | 0.05 | |||
| e3/e4 | 404 (30.6%) | 526 (38.3%) | 0.05 | |||
| e4/e4 | 51 (3.9%) | 60 (4.4%) | 0.51 | |||
Figure 1Univariate distributions of Age, BMI, t-tau, ApoE gene, Sex, and family history.
Measures of statistical fidelity of the synthetic data with the real data.
| ABeta_1_42 | 0.4175 | Abeta_1_42 < 1,000 | 0.0344 |
| Abeta_calc | 0.9999 | Apoe4_result | 0.0345 |
| Age | 0.9999 | Ethnicity | 0.0031 |
| Body mass index | 0.9999 | Family history | 0.0005 |
| Education (years) | 0.9999 | Sex | 0.0175 |
| Height (cm) | 0.9999 | ||
| Weight (kg) | 0.9999 | ||
| p-tau_result | 0.9991 | ||
| t-tau_result | 0.9980 | ||
| rad_pct | 0.9999 | ||
| Rbans score _total | 0.9870 |
Figure 2Pairwise distributions.
Figure 3Receiver operating curves for real vs. synthetic (single run).
Figure 4Feature importance using SHAP values compared.
Figure 5UMAP visualization of real (left) vs. synthetic (right) data points in 2D.
Figure 6DCR – real on the left (blue), synthetic on the right (green).