| Literature DB >> 31797635 |
Abstract
Clinical trials generate a large amount of data that have been underutilized due to obstacles that prevent data sharing including risking patient privacy, data misrepresentation, and invalid secondary analyses. In order to address these obstacles, we developed a novel data sharing method which ensures patient privacy while also protecting the interests of clinical trial investigators. Our flexible and robust approach involves two components: (1) an advanced cloud-based querying language that allows users to test hypotheses without direct access to the real clinical trial data and (2) corresponding synthetic data for the query of interest that allows for exploratory research and model development. Both components can be modified by the clinical trial investigator depending on factors such as the type of trial or number of patients enrolled. To test the effectiveness of our system, we first implement a simple and robust permutation based synthetic data generator. We then use the synthetic data generator coupled with our querying language to identify significant relationships among variables in a realistic clinical trial dataset.Entities:
Mesh:
Year: 2020 PMID: 31797635 PMCID: PMC6954005
Source DB: PubMed Journal: Pac Symp Biocomput ISSN: 2335-6928
Figure 1.Graphical overview of the data sharing model. (A) Real data is stored securely on the cloud and can be accessed by user queries. Synthetic data matching the specified query is also generated and returned to the user for downstream analysis. (B) Synthetic data for the model presented in panel A was obtained by first performing PCA to reduce the dimensionality of real data(a). One sample from the projected data was randomly obtained and the nearest k-neighbors for that sample was determined using a PC weighted distance metric(b). Aggregating data from the sample and its nearest neighbors was used to construct a synthetic sample(c). This process was repeated to generate 773 synthetic patients.
Querying language examples
| operation | example |
|---|---|
| view available features | df.describe() |
| filtering | df.filter(feature A >10) |
| grouping | df.groupby(feature B) |
| aggregation | df.groupby(feature B).aggregate(feature C).summarize(max, min, mean, median, sd) |
| correlation | df.pearson corr(feature A, feature B) statistical tests df.t test(feature A, feature B) visualization df.hist(feature A) |
| combination of above | df.filter(feature A>10).groupby(feature B).t_test(feature C, feature D) |
| Algorithm 1 Synthetic Data Generation |
|---|
| Use PCA to generate |
| |
| store |
| |
| |
| |
| |
| |
| |
| |
| |
| |
Figure 2.Comparing the real clinical trial dataset and the corresponding synthetically generated dataset. (A) Real data distribution (orange) and synthetic data distribution (green) of six clinical features. Distributions were obtained from the empiricle data via kernel density estimation (B) Comparison of the fraction of positively labeled data between the real and synthetic datasets, for 45 binary features with more than 100 non-missing values.
Figure 3.The runtime comparison between our cloud based model versus running computation locally. The runtime is measured as the number seconds required for aggregating the data and creating a contingency table
Contingency table for real and synthetic data: Comparing the recurrence rate for drug treated and placebo population
| synthetic data | real data | ||||
|---|---|---|---|---|---|
| recurrence | no recurrence | recurrence | no recurrence | ||
| drug treated | 27 | 246 | 30 | 267 | |
| placebo | 56 | 275 | 47 | 255 | |
| (p-value, OR) | (0.01, 0.54) | (0.05, 0.61) | |||
Contingency table for real and synthetic data: Comparing the adverse event rate for drug treated and placebo population
| synthetic data | real data | ||||
|---|---|---|---|---|---|
| adverse event | no adverse event | adverse event | no adverse event | ||
| drug treated | 30 | 215 | 34 | 252 | |
| placebo | 29 | 274 | 40 | 258 | |
| (p-value, OR) | (0.33, 1.32) | (0.62, 0.87) | |||