| Literature DB >> 35136677 |
Hooman H Rashidi1, Imran H Khan1, Luke T Dang1, Samer Albahra1, Ujjwal Ratan2, Nihir Chadderwala2, Wilson To2, Prathima Srinivas2, Jeffery Wajda3, Nam K Tran1.
Abstract
High-quality medical data is critical to the development and implementation of machine learning (ML) algorithms in healthcare; however, security, and privacy concerns continue to limit access. We sought to determine the utility of "synthetic data" in training ML algorithms for the detection of tuberculosis (TB) from inflammatory biomarker profiles. A retrospective dataset (A) comprised of 278 patients was used to generate synthetic datasets (B, C, and D) for training models prior to secondary validation on a generalization dataset. ML models trained and validated on the Dataset A (real) demonstrated an accuracy of 90%, a sensitivity of 89% (95% CI, 83-94%), and a specificity of 100% (95% CI, 81-100%). Models trained using the optimal synthetic dataset B showed an accuracy of 91%, a sensitivity of 93% (95% CI, 87-96%), and a specificity of 77% (95% CI, 50-93%). Synthetic datasets C and D displayed diminished performance measures (respective accuracies of 71% and 54%). This pilot study highlights the promise of synthetic data as an expedited means for ML algorithm development. Copyright:Entities:
Keywords: Artificial intelligence; biomarkers; data accessibility; electronic medical record; privacy; simulation
Year: 2022 PMID: 35136677 PMCID: PMC8794034 DOI: 10.4103/jpi.jpi_75_21
Source DB: PubMed Journal: J Pathol Inform
Figure 1Study design
Figure 2Distribution of Dataset A vs. Dataset B
Figure 3Overview of MILO workflow
Figure 4QQ plot of Dataset A vs. Dataset B: The figure shows the Q-Q (quantile-quantile) plot for each attribute in the original dataset and the synthetic dataset. It shows that the distribution of each attribute is similar across the two datasets
Descriptive Statistics for Biomarkers in Dataset A (real data) vs. Dataset B (synthetic ×1)
Performance comparison of the Models trained on real data versus synthetic data
| Model performances based on the “real” secondary dataset | Trained on dataset A real data (95% CI) | Trained on dataset B (synthetic data ×1) | Trained on dataset C (synthetic data ×2) | Trained on dataset D (synthetic data ×5) |
|---|---|---|---|---|
|
| ||||
| MILO’s best models | MILO GBM | MILO SVM | MILO DNN | MILO DNN |
| ROC-AUC | 0.95 (0.87-1) | 0.83 (0.63-1) | 0.91 (0.8-1) | 0.55 (0.48-0.62) |
| Accuracy | 90 (84-95) | 91 (85-95) | 71 (63-78) | 54 (46-62) |
| Sensitivity | 89 (83-94) | 93 (87-96) | 67 (59-75) | 49 (40-58) |
| Specificity | 100 (81-100) | 77 (50-93) | 100 (81-100) | 94 (71-99) |
| MILO’s best RF models | MILO RF | MILO RF | MILO RF | MILO RF |
| ROC-AUC | 0.96 (0.82-1) | 0.77 (0.67-0.87) | 0.87 (0.77-0.97) | 0.66 (0.52-0.8) |
| Accuracy | 89 (83-93) | 71 (63-78) | 74 (66-81) | 56 (48-64) |
| Sensitivity | 88 (81-93) | 69 (60-76) | 72 (64-80) | 53 (44-61) |
| Specificity | 100 (81-100) | 88 (64-99) | 88 (64-99) | 82 (57-96) |
| Non-MILO RF models | Non-MILO RF | Non-MILO RF | Non-MILO RF | Non-MILO RF |
| ROC-AUC | 0.97 (0.94-1) | 0.73 (0.60-0.88) | 0.83 (0.71-0.92) | 0.68 (0.57-0.82) |
| Accuracy | 77 (70-84) | 62 (54-69) | 64 (56-72) | 39 (31-47) |
| Sensitivity | 75 (66-82) | 61 (52-69) | 64 (55-72) | 40 (32-49) |
| Specificity | 100(81-100) | 71(44-90) | 71(44-90) | 29 (10-56) |
DNN = deep neural network, GBM = gradient boosting machine, RF = random forest, SVM = support vector machine
Figure 5Paradigm for AI/ML development in healthcare. Synthetic data may help to improve access to clinical data if it is shown to reduce regulatory hurdles