| Literature DB >> 36052317 |
Jethro C C Kwong1,2, Lauren Erdman2,3,4, Adree Khondker5, Marta Skreta3, Anna Goldenberg2,3,4, Melissa D McCradden2,6,7,8, Armando J Lorenzo1,5, Mandy Rickard5.
Abstract
As more artificial intelligence (AI) applications are integrated into healthcare, there is an urgent need for standardization and quality-control measures to ensure a safe and successful transition of these novel tools into clinical practice. We describe the role of the silent trial, which evaluates an AI model on prospective patients in real-time, while the end-users (i.e., clinicians) are blinded to predictions such that they do not influence clinical decision-making. We present our experience in evaluating a previously developed AI model to predict obstructive hydronephrosis in infants using the silent trial. Although the initial model performed poorly on the silent trial dataset (AUC 0.90 to 0.50), the model was refined by exploring issues related to dataset drift, bias, feasibility, and stakeholder attitudes. Specifically, we found a shift in distribution of age, laterality of obstructed kidneys, and change in imaging format. After correction of these issues, model performance improved and remained robust across two independent silent trial datasets (AUC 0.85-0.91). Furthermore, a gap in patient knowledge on how the AI model would be used to augment their care was identified. These concerns helped inform the patient-centered design for the user-interface of the final AI model. Overall, the silent trial serves as an essential bridge between initial model development and clinical trials assessment to evaluate the safety, reliability, and feasibility of the AI model in a minimal risk environment. Future clinical AI applications should make efforts to incorporate this important step prior to embarking on a full-scale clinical trial.Entities:
Keywords: artificial intelligence; bias; dataset drift; feasibility; stakeholder attitudes
Year: 2022 PMID: 36052317 PMCID: PMC9424628 DOI: 10.3389/fdgth.2022.929508
Source DB: PubMed Journal: Front Digit Health ISSN: 2673-253X
Major themes to explore during the silent trial before transitioning to the clinical trial phase. Each theme is associated with a suggested list of questions that should be considered.
| Themes | Key questions |
|---|---|
|
Are there any changes as to how data are defined and collected? Are there any changes to patient demographics, clinical settings, or unexpected events (i.e.: COVID-19) that would impact the patient population in which the model is applied? Are there any changes in clinical practice such as indication, standard of care, or patient preference, that would influence the data being collected? | |
|
Which subset of patients benefit from the model? Which subset of patients are harmed by the model? | |
|
How much time does it take for the end-user (i.e.: clinician) to input the necessary variables to generate a prediction? How is the clinical workflow or duration of a clinic visit impacted with the use of the AI intervention? Importantly, does it slow down clinical workflow without a clear benefit? Is the user interface simple enough to be used at point of care with minimal or no training? Are the model predictions easy to understand? Are the model explanations easy to interpret? How much computing resources or infrastructure are required to maintain the AI model at scale? | |
|
Does the AI intervention facilitate patient counseling, decision-making, or treatment planning? Are patients comfortable with the use of AI interventions to support their care? What are the patient’s priorities or goals of care regarding their condition and are they addressed by the AI intervention? |
Based on Finlayson et al. (13).
Figure 1Silent trial workflow for model development. Initially, the model was trained and tested on a random 20% split of the initial dataset. Following successful generalization in this random split, the model was evaluated on new patients using prospectively collected data, Silent Trial 1. From this dataset, we identified any weaknesses in our model preventing it from generalizing successfully and adapted our initial model to overcome these limitations. Once the model generalized in this new set, the model was re-trained on both the initial and Silent Trial 1 datasets. This updated model was then tested on another prospectively collected data set, Silent Trial 2.
Baseline characteristics of each dataset.
| Non-obstructed | Obstructed | Total | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Variable | Training | Silent Trial 1 | Silent Trial 2 | Training | Silent Trial 1 | Silent Trial 2 | Training | Silent Trial 1 | Silent Trial 2 |
| Sex | |||||||||
| Male | 981 | 326 | 530 | 138 | 104 | 69 | 1,119 | 430 | 599 |
| Female | 247 | 61 | 106 | 42 | 32 | 6 | 289 | 93 | 112 |
| Age groups | |||||||||
| <2 years | 1,025 | 359 | 561 | 171 | 128 | 71 | 1,196 | 487 | 632 |
| 2–5 years | 143 | 28 | 72 | 9 | 6 | 0 | 152 | 34 | 72 |
| >5 years | 60 | 0 | 3 | 0 | 1 | 3 | 60 | 1 | 6 |
| Ultrasound number | |||||||||
| 1 | 403 | 127 | 214 | 69 | 46 | 28 | 472 | 173 | 242 |
| 2 | 316 | 110 | 184 | 50 | 39 | 24 | 366 | 149 | 208 |
| 3 | 248 | 74 | 130 | 34 | 24 | 11 | 282 | 98 | 141 |
| 4 | 161 | 39 | 63 | 19 | 12 | 8 | 180 | 51 | 71 |
| 5 | 112 | 16 | 32 | 8 | 5 | 2 | 120 | 21 | 34 |
| 6 | 84 | 11 | 10 | 3 | 3 | 1 | 87 | 14 | 11 |
| 7 | 63 | 6 | 2 | 4 | 3 | 1 | 67 | 9 | 3 |
| 8 | 37 | 3 | 1 | 0 | 2 | 0 | 37 | 5 | 1 |
| 9 | 18 | 1 | 0 | 0 | 1 | 0 | 18 | 2 | 0 |
| 10 | 13 | 0 | 0 | 0 | 1 | 0 | 13 | 1 | 0 |
| 11 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| Ultrasound Machine | |||||||||
| Philips | 891 | 88 | 155 | 101 | 33 | 23 | 992 | 121 | 178 |
| Samsung | 34 | 59 | 125 | 2 | 21 | 17 | 36 | 78 | 125 |
| Toshiba | 448 | 229 | 347 | 69 | 48 | 32 | 517 | 277 | 379 |
| GE | 37 | 1 | 0 | 8 | 9 | 1 | 45 | 10 | 1 |
| Acuson | 23 | 0 | 0 | 2 | 0 | 0 | 25 | 0 | 0 |
| ATL | 17 | 0 | 0 | 5 | 0 | 0 | 22 | 0 | 0 |
| Siemens | 4 | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 |
| Outside | 0 | 10 | 9 | 0 | 25 | 2 | 0 | 35 | 11 |
| APD Group | |||||||||
| <6 mm | 113 | 157 | 284 | 6 | 1 | 0 | 119 | 158 | 284 |
| 6–9 mm | 119 | 92 | 150 | 6 | 10 | 5 | 125 | 102 | 155 |
| 9–14 mm | 190 | 69 | 131 | 29 | 34 | 7 | 219 | 103 | 138 |
| >14 mm | 187 | 64 | 70 | 139 | 90 | 62 | 326 | 154 | 132 |
| Not measured | 847 | 5 | 1 | 7 | 1 | 1 | 854 | 6 | 2 |
| Kidney view side | |||||||||
| Right | 737 | 192 | 222 | 68 | 57 | 14 | 805 | 249 | 236 |
| Left | 719 | 195 | 414 | 119 | 79 | 61 | 838 | 274 | 475 |
| Hydronephrosis side | |||||||||
| Right | 673 | 126 | 83 | 56 | 52 | 13 | 729 | 178 | 96 |
| Left | 635 | 143 | 275 | 106 | 61 | 61 | 741 | 204 | 336 |
| Bilateral | 148 | 118 | 278 | 25 | 23 | 1 | 173 | 141 | 279 |
| Overall observations | 1,456 | 387 | 636 | 187 | 136 | 75 | 1,643 | 523 | 711 |
| Overall unique patients | 240 | 105 | 174 | 54 | 45 | 28 | 294 | 150 | 202 |
APD, anterior-posterior diameter.
Iterative model performance.
| Row | Train | Test | Model | AUROC | AUPRC | Sensitivity | Specificity |
|---|---|---|---|---|---|---|---|
| 1 | Original set | Random 20% from original set | Image only | 0.90 (0.85, 0.95) | 0.58 (0.39, 0.74) | 0.92 (0.81, 1.0) | 0.69 (0.63, 0.74) |
| 2 | Original set | Silent trial 1 | Image only | 0.50 (0.50, 0.50) | 0.26 (0.21, 0.32) | 1.00 (1.00, 1.00) | 0.0 (0.0, 0.0) |
| 3 | Original set | Silent trial 1 | Age and side covariates | 0.51 (0.506, 0.52) | 0.26 (0.22, 0.32) | 1.00 (1.00, 1.00) | 0.0 (0.0, 0.0) |
| 4 | Original set | Silent trial 1 | Age-ablated | 0.57 (0.55, 0.59) | 0.28 (0.24, 0.35) | 1.00 (1.00, 1.00) | 0.005 (0.0, 0.01) |
| 5 | Original set | Silent trial 1 | Side-ablated | 0.54 (0.52, 0.55) | 0.27 (0.22, 0.34) | 1.00 (1.00, 1.00) | 0.005 (0.0, 0.01) |
| 6 | Original set | Silent trial 1 | Revised data prep, with covariates | 0.85 (0.81, 0.88) | 0.67 (0.58, 0.75) | 0.98 (0.95, 1.00) | 0.32 (0.27, 0.36) |
| 7 | Original set | Silent trial 1 | Revised data prep, image only | 0.84 (0.80, 0.88) | 0.65 (0.57, 0.74) | 0.99 (0.96, 1.00) | 0.26 (0.22, 0.31) |
| 8 | Original set + silent trial 1 | Silent trial 2 | Revised data prep, with covariates | 0.91 (0.88, 0.94) | 0.52 (0.41, 0.64) | 0.97 (0.93, 1.00) | 0.54 (0.50, 0.57) |
| 9 | Original set + silent trial 1 | Silent trial 2 | Revised data prep, image only | 0.92 (0.88, 0.95) | 0.52 (0.41, 0.64) | 0.99 (0.95, 1.00) | 0.52 (0.48, 0.56) |
Values reflect performance of data in the Test column. Model formulation described in the Model column, indicating iterative experiments performed to rescue Silent trial performance. Sensitivity and specificity thresholds set in validation set targeting 90% sensitivity.
Figure 2Dataset drift between our original training set and Silent Trial 1. (A) The shift in age to younger individuals in the Silent Trial 1 dataset. (B) The shift between left and right-sided kidneys in which a larger proportion of right-sided obstructed kidneys were found relative to the left in the Silent Trial 1 set. (C) The qualitative shift in images despite the same cropping and normalization procedures for both datasets.
Figure 3Original and updated models used to overcome dataset drift. (A) The original model used from the initial dataset. (B) Updated model with covariates for age and kidney laterality, with the goal of overcoming the generalization failure observed on the Silent Trial 1 dataset.
Bias assessment of our final AI model.
| Variable | AUROC | AURPC | Sensitivity | Specificity |
|---|---|---|---|---|
| Sex | ||||
| Male | 0.91 (0.87, 0.94) | 0.52 (0.42, 0.65) | 0.97 (0.93, 1.00) | 0.53 (0.49, 0.57) |
| Female | 0.96 (0.91, 1.00) | 0.38 (0.12, 0.80) | 1.00 (1.00, 1.00) | 0.59 (0.50, 0.68) |
| Side of hydronephrosis | ||||
| Left | 0.88 (0.84, 0.93) | 0.57 (0.45, 0.72) | 0.97 (0.92, 1.00) | 0.48 (0.43, 0.53) |
| Right | 0.96 (0.91, 0.99) | 0.61 (0.39, 0.86) | 1.00 (1.00, 1.00) | 0.60 (0.50, 0.71) |
| Both | 0.98 (0.96, 0.99) | 0.08 (0.05, 0.30) | 1.00 (1.00, 1.00) | 0.58 (0.52, 0.63) |
| Ultrasound machine | ||||
| Philips | 0.89 (0.83, 0.95) | 0.50 (0.31, 0.71) | 0.96 (0.84, 1.00) | 0.53 (0.46, 0.62) |
| Samsung | 0.92 (0.86, 0.96) | 0.50 (0.30, 0.71) | 1.00 (1.00, 1.00) | 0.58 (0.50, 0.66) |
| Toshiba | 0.93 (0.86, 0.97) | 0.53 (0.39, 0.72) | 0.97 (0.90, 1.00) | 0.53 (0.48, 0.58) |
| Postal code | ||||
| K | 1.00 (1.00, 1.00) | 0.86 (0.67, 0.91) | 1.00 (1.00, 1.00) | 0.50 (0.24, 0.82) |
| L | 0.90 (0.85, 0.95) | 0.49 (0.37, 0.65) | 0.95 (0.89, 1.00) | 0.58 (0.52, 0.63) |
| M | 0.91 (0.86, 0.97) | 0.57 (0.35, 0.76) | 1.00 (1.00, 1.00) | 0.50 (0.45, 0.56) |
| N | NA | NA | NA | 0.75 (0.25, 1.00) |
| P | 1.00 (1.00, 1.00) | 0.86 (0.00, 0.91) | 1.00 (1.00, 1.00) | 0.57 (0.24, 0.89) |
Performance of our model was stratified by sex, side of hydronephrosis, ultrasound machine, and postal code in our Silent Trial 2 set.
Figure 4User-interface for image-only model. A basic user-interface was developed to allow clinicians and researchers who are not computer scientists to test the model. (A) The user-interface with no data input, in which a user can specify a sagittal and transverse ultrasound image file of the kidney, along with an option for the program to further crop the image and where to save the output. The lower-half of the interface is blank at this point, as it will display the uploaded images. (B) This view now shows the user-interface once the model has run. This displays the probability of surgery, the original input images following the preprocessing procedure, and a gradient-based class activation maps to the image to indicate which part of the image is most important for the prediction.