| Literature DB >> 36056152 |
Carlo Reverberi1,2, Tommaso Rigon3, Aldo Solari4,3, Cesare Hassan5,6, Paolo Cherubini7,4,8, Andrea Cherubini9,10.
Abstract
Artificial Intelligence (AI) systems are precious support for decision-making, with many applications also in the medical domain. The interaction between MDs and AI enjoys a renewed interest following the increased possibilities of deep learning devices. However, we still have limited evidence-based knowledge of the context, design, and psychological mechanisms that craft an optimal human-AI collaboration. In this multicentric study, 21 endoscopists reviewed 504 videos of lesions prospectively acquired from real colonoscopies. They were asked to provide an optical diagnosis with and without the assistance of an AI support system. Endoscopists were influenced by AI ([Formula: see text]), but not erratically: they followed the AI advice more when it was correct ([Formula: see text]) than incorrect ([Formula: see text]). Endoscopists achieved this outcome through a weighted integration of their and the AI opinions, considering the case-by-case estimations of the two reliabilities. This Bayesian-like rational behavior allowed the human-AI hybrid team to outperform both agents taken alone. We discuss the features of the human-AI interaction that determined this favorable outcome.Entities:
Mesh:
Year: 2022 PMID: 36056152 PMCID: PMC9440124 DOI: 10.1038/s41598-022-18751-2
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.996
Figure 1Left panel: The stimuli used in the experiment were prospectively collected in a real-world clinical setting using an ai medical device supporting mds for lesion detection (cade) and categorization (cadx) as adenomatous or non-adenomatous[24]. Right panel: An international group of endoscopists were asked to optically diagnose the same set of lesions, presented as short video clips, in two experimental sessions. In the first session (top-right panel) the ai only highlights the target lesion, while in the second session (bottom-right panel) ai also dynamically offers an optical diagnosis. For more details on the ai device see Appendix A.1.2.
Figure 2md-ai team. An endoscopist subject to under-reliance discounts the added information given by the ai (a). An endoscopist subject to over-reliance supinely accepts the ai suggestion (b). The optimal use of ai should rest on an in-between, well-calibrated approach where the endoscopist uses the ai opinion for coherently revising their confidence in their initial evaluation. In this way, the medical decision-making process would benefit from a collaboration between the two intelligences (c).
Measured and transformed variables for each of the 21 subjects and 504 lesions.
| Variable name | Description |
|---|---|
| Histologic evaluation | The ground truth of each lesion. Its possible values were: “Adenoma”, “Non-Adenoma” |
| Human judgment, S1 and S2 | The optical diagnosis of the lesion by an endoscopist in each session, mapped as mentioned above. It takes the values: “Adenoma”, ‘Non-adenoma”, and “Uncertain” |
| Human confidence, S1 and S2 | The confidence of the previous judgment expressed by the endoscopist. It takes the values: “Very high”, “High”, “Low”, “Very low”, “Uncertain”. We classified the confidence as “Uncertain” whenever the associated lesion evaluation was “Uncertain” |
| Algorithmic | The diagnosis about a given lesion provided by the |
| Perceived | The endoscopists’ interpretation of the |
| Evaluation of | The endoscopists’ appreciation of the level of reliability of the |
| Human correct diagnosis, S1 and S2 | A binary variable that indicates whether each lesion was correctly diagnosed by each endoscopist, in each session |
| Accuracy | A binary variable indicating whether each lesion is correctly diagnosed by the |
| Accuracy | A binary variable indicating whether each lesion is correctly diagnosed by the |
| Confidence score, S1 and S2 | A discrete numerical variable ranging 1 to 9 that measures the belief of each endoscopist about each lesion in each session. The score of 9 indicates a strong belief that the lesion is an adenoma. At the other extreme, the score of 1 denotes a strong belief that the lesion is not an adenoma. The score of 5 indicates “Uncertain” diagnoses |
Odds-ratios (or) for each of the main endpoints. We report in brackets the confidence intervals for the odds ratios.
| Endpoint | Estimate | |
|---|---|---|
| 1. Influence of the | 3.05 [2.76, 3.39] | |
| 2. Diagnostic accuracy | 1.39 [1.28, 1.51] | |
| 3. Effectiveness | 3.48 [3.07, 3.98] | |
| 4. Safety | 0.54 [0.48, 0.62] |
Figure 3Influence of the ai: alluvial diagrams representing changes in endoscopist’s opinion between the two sessions as a function of perceived ai response.
Proportions and sample sizes of the correct human diagnosis in S1, S2 and the ai perceived correct diagnosis, against different human confidence levels and the ai perceived confidence levels, respectively. Following the standard in the field, accuracys does not consider wrong lesions where ai opinion was perceived as “uncertain” or was “not noticed”. Evaluations of the confidence of the ai were asked only when the opinion was “Adenoma” or “Non-Adenoma”.
| Confidence | Very low | Low | High | Very high | Overall |
|---|---|---|---|---|---|
| S1 accuracy | 0.644 (236) | 0.685 (2665) | 0.806 (5263) | 0.853 (2241) | 0.768 (10,584) |
| S2 accuracy | 0.543 (184) | 0.679 (1859) | 0.839 (5235) | 0.882 (3094) | 0.802 (10,584) |
| 0.667 (216) | 0.718 (1456) | 0.863 (4608) | 0.909 (2807) | 0.849 (9086) |
Change in agreement between endoscopists’ and ai, measured as the amount of times each endoscopist changes its opinion and follows ai’s suggestion. We report proportions and sample sizes of the change in agreement for different human confidence levels (S1) and the ai perceived confidence levels, respectively.
| Very low | Low | High | Very high | |
|---|---|---|---|---|
| Human conf. | 0.703 (64) | 0.738 (577) | 0.668 (689) | 0.598 (194) |
| 0.278 (97) | 0.438 (457) | 0.827 (684) | 0.888 (286) |
Odds-ratios (or) for each endpoint, estimated separately for experts and non experts. We report in brackets the confidence intervals.
| Endpoint | Expert | Non-expert | |
|---|---|---|---|
| 1. Influence of the | 2.88 [2.48, 3.34] | 3.20 [2.80, 3.65] | |
| 2. Diagnostic accuracy | 1.15 [1.01, 1.30] | 1.61 [1.44, 1.79] | |
| 3. Effectiveness | 3.22 [2.64, 3.93] | 3.65 [3.11, 4.28] | |
| 4. Safety | 0.45 [0.37, 0.54] | 0.63 [0.54, 0.75] |