| Literature DB >> 34009348 |
Raelina S Howell1, Helen H Liu1, Aziz A Khan1, Jon S Woods1, Lawrence J Lin2, Mayur Saxena3, Harshit Saxena3, Michael Castellano1,4, Patrizio Petrone1,4, Eric Slone1, Ernest S Chiu2, Brian M Gillette1,5, Scott A Gorenstein1,4.
Abstract
Importance: Accurate assessment of wound area and percentage of granulation tissue (PGT) are important for optimizing wound care and healing outcomes. Artificial intelligence (AI)-based wound assessment tools have the potential to improve the accuracy and consistency of wound area and PGT measurement, while improving efficiency of wound care workflows. Objective: To develop a quantitative and qualitative method to evaluate AI-based wound assessment tools compared with expert human assessments. Design, Setting, and Participants: This diagnostic study was performed across 2 independent wound centers using deidentified wound photographs collected for routine care (site 1, 110 photographs taken between May 1 and 31, 2018; site 2, 89 photographs taken between January 1 and December 31, 2019). Digital wound photographs of patients were selected chronologically from the electronic medical records from the general population of patients visiting the wound centers. For inclusion in the study, the complete wound edge and a ruler were required to be visible; circumferential ulcers were specifically excluded. Four wound specialists (2 per site) and an AI-based wound assessment service independently traced wound area and granulation tissue. Main Outcomes and Measures: The quantitative performance of AI tracings was evaluated by statistically comparing error measure distributions between test AI traces and reference human traces (AI vs human) with error distributions between independent traces by 2 humans (human vs human). Quantitative outcomes included statistically significant differences in error measures of false-negative area (FNA), false-positive area (FPA), and absolute relative error (ARE) between AI vs human and human vs human comparisons of wound area and granulation tissue tracings. Six masked attending physician reviewers (3 per site) viewed randomized area tracings for AI and human annotators and qualitatively assessed them. Qualitative outcomes included statistically significant difference in the absolute difference between AI-based PGT measurements and mean reviewer visual PGT estimates compared with PGT estimate variability measures (ie, range, standard deviation) across reviewers.Entities:
Mesh:
Year: 2021 PMID: 34009348 PMCID: PMC8134996 DOI: 10.1001/jamanetworkopen.2021.7234
Source DB: PubMed Journal: JAMA Netw Open ISSN: 2574-3805
Patient Demographic Characteristics
| Characteristic | No. (%) (N = 199) |
|---|---|
| Women | 127 (63.8) |
| Men | 72 (36.2) |
| Age, mean (SD) [range], y | 64 (18) [17-95] |
| Wound types | |
| Venous leg ulcer | 47 (23.6) |
| Pressure ulcer | 41 (20.6) |
| Surgical wound | 32 (16.1) |
| Trauma wound | 25 (12.5) |
| Diabetic foot ulcer | 21 (10.5) |
| Arterial ulcer | 7 (3.5) |
| Abscess | 5 (2.5) |
| Lymphedema | 3 (1.5) |
| Radiation | 3 (1.5) |
| Burn | 2 (1.0) |
| Other | 13 (6.5) |
Figure 1. Overview of Methodology for Evaluation of Artificial Intelligence (AI)–Based Digital Wound Assessment Tools
A, Example images of wound area and granulation tissue tracings by humans and AI for wounds of diverse types, shapes, and sizes. B, Illustration of quantitative method for comparing wound area and granulation tissue tracings between humans and between humans and AI. One human tracing was selected as a reference trace (R), a second tracing (other human or AI) was selected as the test trace (T), and the overlapping area (O) was determined. The error measures (false-negative area [FNA], false positive area [FPA], and absolute relative error [ARE]) between the reference and test tracings were then calculated.
Figure 2. Quantitative Evaluation of Human and Artificial Intelligence (AI) Wound Area and Percent Granulation Tissue (PGT) Measurements
A and B, Violin plots showing distributions of wound area (A) and PGT (B) error measures of false-negative area (FNA), false-positive area (FPA), and absolute relative error (ARE) for AI vs human and human vs human comparisons. Dashed lines indicate the median and quartiles of the error measure distributions. Outliers above the 98th percentile are not shown to aid visualization of the distributions but were included in the statistical analysis. C, Scatter plot showing ARE vs the reference wound area for AI vs human and human vs human comparisons.
aP < .05.
Masked Reviewer Survey Responses for the Qualitative Evaluation of Digital Wound Assessments
| Question | Annotator | No./total No. (%) | |||||
|---|---|---|---|---|---|---|---|
| Site 1 | Site 2 | ||||||
| R1 | R2 | R3 | R1 | R2 | R3 | ||
| 1. Area tracing meets definition? | AI | 42/100 (42.0) | 53/110 (48.2) | 67/110 (60.9) | 65/89 (73.0) | 78/85 (91.8) | 47/89 (52.8) |
| H1 | 65/100 (65.0) | 59/110 (53.6) | 82/110 (74.5) | 67/89 (75.3) | 79/85 (92.9) | 63/89 (70.8) | |
| H2 | 51/100 (51.0) | 53/110 (48.2) | 72/110 (65.5) | 65/89 (73.0) | 82/85 (96.5) | 55/89 (61.8) | |
| .01 | .73 | .11 | .88 | .41 | .04 | ||
| 2. Which is AI? | AI | 37/105 (35.2) | 42/109 (38.5) | 42/109 (38.5) | 3/89 (3.4) | 42/85 (49.4) | 24/89 (27.0) |
| H1 | 39/105 (37.1) | 27/109 (24.8) | 33/109 (30.3) | 36/89 (40.4) | 20/85 (23.5) | 44/89 (49.4) | |
| H2 | 29/105 (27.6) | 40/109 (36.7) | 34/109 (31.2) | 50/89 (56.2) | 23/85 (27.1) | 21/89 (23.6) | |
| .51 | .21 | .48 | <.001 | .004 | .004 | ||
| 3. Which is most accurate? | AI | 32/91 (35.2) | 39/108 (36.1) | 35/109 (32.1) | 19/89 (21.3) | 25/85 (29.4) | 24/89 (27.0) |
| H1 | 27/91 (29.7) | 32/108 (29.6) | 42/109 (38.5) | 48/89 (53.9) | 38/85 (44.7) | 44/89 (49.4) | |
| H2 | 32/91 (35.2) | 37/108 (34.3) | 32/109 (29.4) | 22/89 (24.7) | 22/85 (25.9) | 21/89 (23.6) | |
| .78 | .76 | .45 | <.001 | .04 | .004 | ||
Abbreviations: AI, artificial intelligence; H, human; R, reviewer.
Statistically significant differences in frequency of yes answers for Q1 between AI and human traces for Fisher exact test P values (P < .05) and statistically significant bias in frequency of selection vs random selection for χ2 P values (P < .05).
Figure 3. Quantitative Evaluation of Blinded Reviewer and Artificial Intelligence (AI) Percent Granulation Tissue (PGT) Assessments
A, Histogram of absolute difference between the mean of the 3 reviewers’ visual PGT estimates and the AI PGT measurement for the subset of photographs with AI granulation tissue tracings at each site. B-C, Histograms of variability measures (range and SD) of the 3 reviewers’ visual PGT estimates for all photographs at each site.