| Literature DB >> 33186440 |
Khaled El Emam1,2,3, Lucy Mosquera3, Chaoyi Zheng3.
Abstract
OBJECTIVE: With the growing demand for sharing clinical trial data, scalable methods to enable privacy protective access to high-utility data are needed. Data synthesis is one such method. Sequential trees are commonly used to synthesize health data. It is hypothesized that the utility of the generated data is dependent on the variable order. No assessments of the impact of variable order on synthesized clinical trial data have been performed thus far. Through simulation, we aim to evaluate the variability in the utility of synthetic clinical trial data as variable order is randomly shuffled and implement an optimization algorithm to find a good order if variability is too high.Entities:
Keywords: clinical trial transparency; data sharing; data synthesis; privacy enhancing technologies; secondary use
Mesh:
Year: 2021 PMID: 33186440 PMCID: PMC7810457 DOI: 10.1093/jamia/ocaa249
Source DB: PubMed Journal: J Am Med Inform Assoc ISSN: 1067-5027 Impact factor: 4.497
Summary of the 6 oncology trials used in the analysis with the National Clinical Trial number and the primary sponsor indicated, as well as the number of patients and variables used in the synthesis
| Dataset | Individuals | Variables | |||
|---|---|---|---|---|---|
| Total | Binary/Categorical | Discrete/Continuous | |||
|
| |||||
| Tests if postsurgery receipt of imatinib could reduce the recurrence of GISTs. Imatinib is an Food and Drug Administration approved protein-tyrosine kinase inhibitor for treating certain cancers of the blood cells. This drug is hypothesized to be effective against GIST as imatinib inhibits the kinase which experiences gain of function mutations in up to 90% of GIST patients. | 773 | 129 |
71 (55) |
58 (45) | |
|
| |||||
|
Most pancreatic cancer patients have advanced inoperable disease and potentially metastases. At the time of this trial the first line therapy for patients with inoperable disease was gemcitabine monotherapy. One transporter (hENT1: human equilibrative nucleoside transporter-1) has been identified as a potential predictor of successful treatment via gemcitabine. This trial compares standard gemcitabine therapy to a novel fatty acid derivative of gemcitabine. This is hypothesized to be superior to gemcitabine in metastatic pancreatic adenocarcinoma patients with low hENT1 activity as it exhibits anticancer activity independent of nucleoside transporters like hENT1, while gemcitabine seems to require nucleoside transporters for anticancer activity. | 367 | 88 |
24 (27.2) |
64 (72.3) | |
|
| |||||
| This phase 3 trial compares adjuvant anthracycline chemotherapy (fluorouracil, doxorubicin, and cyclophosphamide) with anthracycline taxane chemotherapy (docetaxel, doxorubicin, and cyclophosphamide) in women with lymph node positive early breast cancer. | 746 | 239 |
148 (61.9) |
91 (38.1) | |
|
| |||||
| This was a randomized phase 3 trial examining whether panitumumab, when combined with best supportive care, improves progression-free survival among patients with metastatic colorectal cancer, compared with those receiving best supportive care alone. |
463 (sponsor only provided 370 in the dataset) | 59 |
22 (37.2) |
37 (62.8) | |
|
| |||||
| This was also a randomized phase 3 trial on panitumumab but among patients with metastatic and/or recurrent squamous cell carcinoma of the head and neck. The treatment group received panitumumab in addition to other chemotherapy (cisplatin and fluorouracil), while the control group received cisplatin and fluorouracil as first-line therapy. |
657 (sponsor only provided 520 in the dataset) | 401 |
162 (40.3) |
239 (59.6) | |
|
| |||||
| This was a randomized and blinded Phase 3 trial aimed at evaluating whether “increasing or maintaining hemoglobin concentrations with darbepoetin alfa” improves survival among patients with previously untreated extensive-stage small cell lung cancer. The treatment group received darbepoetin alfa with platinum-containing chemotherapy, whereas the control group received placebo instead of darbepoetin alfa. |
600 (sponsor only provided 479 in the dataset) | 382 |
82 (21.4) |
300 (78.6) | |
Values are n (%). Variables are classified as either binary/categorical or ordered discrete/continuous. Dates are converted to relative days and therefore are considered continuous.
GIST: gastrointestinal stromal tumor.
Figure 1.A description of the sequential data synthesis process using classification and regression trees. Although any set of classification and regression methods can be used in principle.
Figure 2.The Hellinger distance median value and 95% confidence intervals for the 6 clinical trial datasets. This is a value between 0 and 1, with lower values indicating that the univariate distributions of the real and synthetic variables are similar. In general, values in the lowest decile (≤0.1) would be indicative of reasonable similarity.
Figure 3.The relative absolute difference in area under the receiver-operating characteristic curve (AUROC) median value and 95% confidence intervals for the 6 clinical trial datasets. This is a value between 0 and 1, with lower values indicating that the multivariate models built using the real and synthetic datasets are similar. In general, values in the lowest decile (≤0.1) would be indicative of reasonable similarity.
Figure 4.The distinguishability score median value and 95% confidence intervals for the 6 clinical trial datasets. This is a value between 0 and 0.25, with lower values indicating that the overall real and synthetic datasets are not distinguishable from each other by a discriminative model. In general, values in the lowest quintile (≤0.05) would be indicative of reasonable nondistinguishability.
Utility results for the curriculum learning variable order
| Trial | Distinguishability | Hellinger | AUROC |
|---|---|---|---|
|
| 0.114 | 0.0147 | 0.002 |
|
| 0.064 | 0.026 | 0.001 |
|
| 0.2 | 0.011 | 0.003 |
|
| 0.034 | 0.03 | 0.059 |
|
| 0.101 | 0.019 | 0.012 |
|
| 0.232 | 0.023 | 0.008 |
AUROC: area under the receiver-operating characteristic curve.
Utility results after the optimal variable order was selected with optimization on distinguishability
| Trial | Distinguishability | Hellinger | AUROC |
|---|---|---|---|
|
| 0.0113 | 0.0118 | 0.0019 |
|
| 0.033 | 0.027 | 0.001 |
|
| 0.049 | 0.017 | 0.0026 |
|
| 0.02 | 0.0204 | 0.0584 |
|
| 0.044 | 0.0135 | 0.0118 |
|
| 0.0388 | 0.0277 | 0.009 |
AUROC: area under the receiver-operating characteristic curve.