| Literature DB >> 31754637 |
Bhavik N Patel1, Louis Rosenberg2, Gregg Willcox2, David Baltaxe2, Mimi Lyons2, Jeremy Irvin3, Pranav Rajpurkar3, Timothy Amrhein4, Rajan Gupta4, Safwan Halabi1, Curtis Langlotz1, Edward Lo1, Joseph Mammarappallil4, A J Mariano1, Geoffrey Riley1, Jayne Seekins1, Luyao Shen1, Evan Zucker1, Matthew Lungren1.
Abstract
Human-in-the-loop (HITL) AI may enable an ideal symbiosis of human experts and AI models, harnessing the advantages of both while at the same time overcoming their respective limitations. The purpose of this study was to investigate a novel collective intelligence technology designed to amplify the diagnostic accuracy of networked human groups by forming real-time systems modeled on biological swarms. Using small groups of radiologists, the swarm-based technology was applied to the diagnosis of pneumonia on chest radiographs and compared against human experts alone, as well as two state-of-the-art deep learning AI models. Our work demonstrates that both the swarm-based technology and deep-learning technology achieved superior diagnostic accuracy than the human experts alone. Our work further demonstrates that when used in combination, the swarm-based technology and deep-learning technology outperformed either method alone. The superior diagnostic accuracy of the combined HITL AI solution compared to radiologists and AI alone has broad implications for the surging clinical AI deployment and implementation strategies in future practice.Entities:
Keywords: Computer science; Radiography
Year: 2019 PMID: 31754637 PMCID: PMC6861262 DOI: 10.1038/s41746-019-0189-7
Source DB: PubMed Journal: NPJ Digit Med ISSN: 2398-6352
Diagnostic performance parameters for individual particpants, swarm sessions, and AI models.
| Participants | Diagnostic performance parametersa | ||||
|---|---|---|---|---|---|
| No. of correct (%) | Mean absolute error | Brier score | AUC | F1 score | |
| Swarm sessions | |||||
| Group A ( | |||||
| Individual average | 37 (75) [35, 40] | 0.269 [0.231, 0.306] | 0.188 [0.152, 0.274] | 0.763* [0.709, 0.817] | 0.687 [0.606, 0.736] |
| Crowd-based Majority | 39 (78) [33, 44] | N/A | N/A | N/A | 0.686 [0.533, 0.872] |
| Crowd-based mean probability | 40 (80) [34, 45] | 0.269 [0.198, 0.347] | 0.145 [0.084, 0.213] | 0.838 [0.686, 0.963] | 0.722 [0.533, 0.872] |
| Swarm interpolation | 42 (84) [37, 47] | 0.235 [0.60, 0.324] | 0.139 [0.070, 0.225] | 0.840 [0.691, 0.937] | 0.778 [0.588, 0.905] |
| Group B ( | |||||
| Individual average | 39 (78)¶ [37,41] | 0.260¶ [0.226, 0.295] | 0.166¶ [0.135, 0.199] | 0.814¶ [0.755, 0.870] | 0.717¶ [0.626, 0.771] |
| Crowd-based majority | 40 (80) [34, 45] | N/A | N/A | N/A | 0.706 [0.483, 0.864] |
| Crowd-based mean probability | 40 (80) [34, 45] | 0.260¶ [0.189, 0.334] | 0.135 [0.079, 0.208] | 0.873 [0.730, 0.969] | 0.706 [0.483, 0.864] |
| Swarm interpolation | 42 (84) [37, 47] | 0.231 [0.163, 0.314] | 0.128 [0.069, 0.202] | 0.883 [0.751, 0.964] | 0.778 [0.600, 0.909] |
| Combined ( | |||||
| Individual average | 38 (76)† [36.5, 40] | 0.266† [0.240, 0.344] | 0.179† [0.154, 0.275] | 0.785† [0.740, 0.957] | 0.698† [0.635, 0.731] |
| Crowd-based majority | 40 (80)Ø [34, 45] | N/A | N/A | N/A | 0.722† [0.529, 0.867 |
| Crowd-based mean probability | 40 (80)Ø [34, 45] | 0.264† [0.196, 0.344] | 0.140 [0.083, 0.221] | 0.853 [0.686, 0.960] | 0.722Ø [0.529, 0.867] |
| Swarm interpolation | 84 (84) [78, 90] | 0.233 [0.177, 0.279] | 0.134 [0.096, 0.175] | 0.868 [0.801, 0.933] | 0.778 [0.685, 0.862] |
| Deep-learning models | |||||
| CheXNet | 35 (70)‡† [29, 41] | 0.397¶† [0.336, 0.461] | 0.210¶† [0.152, 0.274] | 0.685*¶† [0.520, 0.854] | 0.545¶† [0.333, 0.733] |
| CheXMax | 41 (82) [35, 46] | 0.357† [0.249, 0.476] | 0.287† [0.184, 0.389] | 0.938† [0.864, 0.994] | 0.800 [0.667, 0.917] |
| Augmented HITL model (combined swarm and CheXMax) | 91 (91)†? [86, 96] | 0.356† [0.297, 0.418] | 0.287†Ψ [0.211, 0.319] | N/Ab | 0.886†? [0.819, 0.945] |
N/A not applicable
*Indicates a statistically significant difference (p < 0.05) compared to group A swarm interpolation
¶Indicates a statistically significant difference (p < 0.01) compared to group B swarm interpolation
‡Indicates a statistically significant difference (p < 0.05) compared to group B swarm interpolation
†Indicates a statistically significant difference (p < 0.01) compared to combined swarm interpolation
ØIndicates a statistically significant difference (p < 0.05) compared to combined swarm interpolation
ϕIndicates a statistically significant difference (p < 001) compared to CheXMax
ΨIndicates a statistically significant difference (p < 0.05) compared to CheXMax
aData reported as mean [95% confidence interval] as applicable, unless otherwise specified
bAUC not applicable here as distribution of probabilities for swarm and CheXMax are centered about different averages (i.e. 50% vs. 4%, respectively)
Sensitvity and specificity for individual particpants, swarm sessions, and AI models.
| Participants | Diagnostic performance parametersa | |
|---|---|---|
| Sensitivity | Specificity | |
| Swarm sessions | ||
| Group A ( | ||
| Individual average | 0.642 [0.579, 0.709] | 0.819* [0.777, 0.862] |
| Crowd-based majority | 0.650 [0.412, 0.783] | 0.900 [0.800, 0.972] |
| Crowd-based mean probability | 0.650 [0.462, 0.824] | 0.900 [0.806, 1.00] |
| Swarm interpolation | 0.700 [0.526, 0.875] | 0.933 [0.852, 1.00] |
| Group B ( | ||
| Individual average | 0.633 [0.558, 0.704] | 0.883 [0.845, 0.920] |
| Crowd-based majority | 0.600 [0.421, 0.789] | 0.933 [0.846, 1.00] |
| Crowd-based mean probability | 0.600 [0.421, 0.789] | 0.933 [0.846, 1.00] |
| Swarm interpolation | 0.700 [0.500, 0.867] | 0.933 [0.844, 1.00] |
| Combined ( | ||
| Individual average | 0.519ϕ† [0.471, 0.568] | 0.690 [0.654, 0.724] |
| Crowd-based majority | 0.625ϕØ [0.477, 0.721] | 0.917 [0.857, 0.968] |
| Crowd-based mean probability | 0.625ϕ† [0.500, 0.744] | 0.917Ø [0.852, 0.968] |
| Swarm interpolation | 0.700Ψ [0.578, 0.814] | 0.933ϕ [0.855, 0.968] |
| Deep learning models | ||
| CheXNet | 0.450*‡† [0.326, 0.579] | 0.867 [0.793, 0.932] |
| CheXMax | 0.900^‡† [0.773, 1.00] | 0.767*‡† [0.672, 0.857] |
| Augmented HITL model (combined swarm and CheXMax) | 0.875*‡† [0.783, 0.956] | 0.933ϕ [0.877, 0.983] |
N/A not applicable
^Indicates a statistically significant difference (p < 0.01) compared to group A swarm interpolation
*Indicates a statistically significant difference (p < 0.05) compared to group A swarm interpolation
¶Indicates a statistically significant difference (p < 0.01) compared to group B swarm interpolation
‡Indicates a statistically significant difference (p < 0.05) compared to group B swarm interpolation
†Indicates a statistically significant difference (p < 0.01) compared to combined swarm interpolation
ØIndicates a statistically significant difference (p < 0.05) compared to combined swarm interpolation
ϕIndicates a statistically significant difference (p < 0.01) compared to CheXMax
ΨIndicates a statistically significant difference (p < 0.05) compared to CheXMax
aData reported as mean [95% confidence interval] as applicable, unless otherwise specified
Fig. 1Bootstrapped average AUC curves. AUC curves show that the swarms (blue bars) outperform group A (left image), group B (middle image), and combined group (right image). Radiologists (orange bars) performances in diagnosing pneumonia. Swarm also outperforms CheXNet (green bars).
Fig. 2Scatterplot of swarm vs. CheXMax probabilistic diagnoses, with cases colored by ground truth. The scatterplots show that CheXMax and human swarms assign very different probabilities to each case (left image). The gray “Augmented Cases” range shows cases that were sent from CheXMax to the Swarm for augmentation. CheXMax has a high incidence of True Positives (blue-colored cases below the horizontal CheXMax Threshold line), but when the CheXMax gives a weak positive diagnosis (between 0.04008 and 0.055 on the y-axis), it is often incorrect (11 out of 15 cases correct, or an accuracy of 73%). Using a human swarm to re-classify these weak positive cases results in correctly labeling 14 out of 15 of the cases—an accuracy improvement of 20%. The cases on which the two diagnostic methods disagreed are more clearly visualized in the scatterplot of diagnostic disagreement (right image).
Fig. 3Case examples. Each of the three rows a–c represent three different patients. Grayscale image is on the left with the corresponding class activation map to its right. The top row example a includes a patient with pneumonia in the left lung, correctly predicted by CheXMax but incorrectly by swarm. The middle row b is an example of a patient with metastatic disease but without pneumonia, correctly predicted by swarm and incorrectly by CheXMax. The bottom row c is an example of an augmented case, where CheXMax provided a low confidence positive prediction (p = 0.41) but was correctly predicted as negative by swarm.
Fig. 4Sensitivity analysis of augmented model accuracy. The shape of the average accuracy line shows a consistent increase in the accuracy of the augmented model when the 0–14% lowest-confidence cases are sent to the swarm, from 82% correct of CheXMax (sending 0% of cases) to 90% correct when sending the 14% of lowest-confidence positive and negative cases to the swarm. The model performs similarly when 16–32% of cases are sent to the swarm, achieving between 88% and 92% accuracy across this sensitivity range. If more than 32% of cases are sent to the swarm, the accuracy of the system decreases, until the limit of sending all diagnoses to the swarm is reached (100% of cases swarmed), where the accuracy returns to the swarm score of 84%.
Fig. 5Sensitivity analysis of accuracy increase relative to CheXMax. Sensitivity analysis shows a band between 6% and 34%, where the 90% confidence interval is only ever >0%. This indicates that when sending between 6% and 34% of the lowest-confidence cases to the swarm using this method, there is high confidence that the augmented model would diagnose the cases more accurately than the CheXMax alone. If the range is limited between 14% and 28%, the average improvement in accuracy is 7.75% correct.
Fig. 6Bootstrapped average specificity and sensitivity of aggregate diagnostic methods. The bootstrapped specificity histograms show that the swarms in the combined group (blue bars) outperform CheXMax (green bars) in terms of specificity (left image), but CheXMax outperforms the swarms in terms of sensitivity (right image). The HITL Combined model combines the best of both the CheXMax and swarm diagnostic methods, by attaining swarm-level specificity and CheXMax-level sensitivity.
Fig. 7Swarm platform. A system diagram (left image) of the Swarm platform shows the connection of networked human users. A Swarm engine algorithm received continuous input from the humans as they are making their decision and provides real-time collaborative feedback back to the humans to create a dynamic feedback loop. Swarm Platform positioned next to a second screen for viewing radiograph (middle image). A snapshot (right image) of the real-time swarm of six radiologists (group B) shows small magnets controlled by radiologists pulling on the circular puck in the process of collectively converging towards a probability of pneumonia. To view a video of the above question being answered in the Swarm platform, visit the following link: https://unanimous.ai/wp-content/uploads/2019/05/Radiology-Swarm.gif.
Fig. 8Support density visualization. In this support density visualization corresponding to the swarm in Fig. 1, the puck’s trajectory is shown as a white dotted line, and the distribution of force over the hex is plotted as a Gaussian kernel density heatmap. Notice that this swarm was split between the “5–25%” and “0–5%” bins, and more force was directed towards the 5–25%. This aggregate behavior is reflected in the swarm’s interpolated diagnosis of 11.1%.