| Literature DB >> 34003898 |
Philippe Burlina1,2,3, Neil Joshi1, William Paul1, Katia D Pacheco4, Neil M Bressler2.
Abstract
Purpose: This study evaluated generative methods to potentially mitigate artificial intelligence (AI) bias when diagnosing diabetic retinopathy (DR) resulting from training data imbalance or domain generalization, which occurs when deep learning systems (DLSs) face concepts at test/inference time they were not initially trained on.Entities:
Mesh:
Year: 2021 PMID: 34003898 PMCID: PMC7884292 DOI: 10.1167/tvst.10.2.13
Source DB: PubMed Journal: Transl Vis Sci Technol ISSN: 2164-2591 Impact factor: 3.283
Characteristic Table Showing the Number of Samples Used for Each Population Including: HL (Healthy Lighter-Skin), RL (Referable DR Lighter-Skin), HD (Healthy Darker-Skin), RD (Referable Darker-Skin), and H (Total Healthy), R (Total Referable DR), LS (Total Lighter-Skin) and DS (Total Darker-Skin), and Broken Down by Rows Corresponding to the Training (Including Training and Validation) for Both Baseline and Debiased DLS, as Well as Test Datasets. Healthy Implies no Diabetic Retinopathy (DR) Warranting Referable to a Health Care Provider.
| Healthy | Referable-DR; Lighter-Skin (RL) | Healthy | Referable-DR; Darker-Skin (RD) | Total | Healthy | Referable (R) | Lighter-Skin (LS) | Darker-Skin (DS) | |
|---|---|---|---|---|---|---|---|---|---|
| Train (baseline) | 5330 | 10,660 | 5330 | 0 | 21,320 | 10,660 | 10,660 | 15,990 | 5330 |
| Train (debiased) | 10,660 | 10,660 | 10,660 | 10,660 | 42,640 | 21,320 | 21,320 | 21,320 | 21,320 |
| Test | 100 | 100 | 100 | 100 | 400 | 200 | 200 | 200 | 200 |
Denotes that these are oversampled by two to maintain the balance of healthy and diseased factors.
Denotes that these numbers are synthetic images.
Characteristic Table Showing the Number of Samples Used for Each Population Broken Down by the Original 5-Class Severity of Diabetic Retinopathy, Which Includes: DR0-4 for Both Lighter-Skin and Darker-Skin Individuals
| Lighter-Skin (LS) | Darker-Skin (DS) | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| DR0 | DR1 | DR2 | DR3 | DR4 | DR0 | DR1 | DR2 | DR3 | DR4 | |
| Train (baseline) | 4880 | 450 | 8312 | 1346 | 1002 | 4828 | 502 | 0 | 0 | 0 |
| Test | 90 | 10 | 68 | 18 | 14 | 80 | 20 | 67 | 16 | 17 |
Note that numbers for the debiased dataset are not given because there is no original 5-class grade for synthetic images.
Figure 1.The figures show right versus left pairs of synthetically created images that demonstrate alterations done automatically via our generative methods, that take a synthetic retinal image as input (shown on the left), to generate a new retinal image that is of an individual with diabetic retinopathy warranting referral to health care provider and of a darker-skin individual as defined in the Methods section. These generative methods are used in this study to generate images that are originally missing from the training set (i.e., of referable DR from darker-skin individuals), creating a condition of unbalance and bias. Pairs of images in (a1) to (a3) illustrate how the proposed generative methods are used to generate new retinal images that take an input retinal image on the left, of a darker-skin individual, and accentuate the attribute “DR-referable” in the image on the right, when compared to the left image, and leave the amount of coloration reflective of the melanin concentration within the uveal melanocytes and all other markers unchanged. The first pair (a1) starts from a retina that is not referable but of a darker-skin individual (left image has DR level 0 or 1; i.e. no or mild DR) and converts it into one that is referable (right image is DR level 2; i.e. moderate DR) while minimally changing other attributes of the retina (the right image is also from a darker-skin individual and vasculature aspect is unchanged). Likewise, the left image pair in (a2) is of a retina from a darker-skin individual, that is not referable (left image is DR level 0 or 1) and our method then accentuates the referable attribute (right image is DR level 2) to make it referable. The same explanation applies to (a3). Pairs of images in (b1) to (b3) instead demonstrate our complementary approach: taking as input retinal images that are already referable (left images in the pair) and altering them to accentuate the attribute “darker-skin individuals,” while preserving the DR lesions as well as the vasculature, in order to generate output images that are referable and from darker-skin individuals (right images in the pairs). The image in (b1) in particular is already from a retina, which is referable and with higher concentration of melanin within the uveal melanocytes and the method visibly accentuates melanin concentration in the right image, and both input (left image) and output (right image) have moderate DR (DR level 2). The image in (b2) is an example where the left image is of a lighter-skin individual and already referable and our method modifies it by generating a related image of a darker-skin individual; but the method preserves the DR level, as both right and left images have visibly unchanged level 2 DR, with potential retinal hemorrhages seen. The image in (b3) is a similar example, accentuating the left retinal image, which is of a lighter-skin individual and referable DR, and turning it into the right retinal image of a darker-skin individual, without altering the DR level (again here both right and left images have apparently DR level 2 with retinal hemorrhages).
Figure 2.This figure details the flow chart for the debiasing algorithmic and experimental pipeline.
Comparing the Performance of the Baseline DLS (Left Column) and Debiased DLSs (Middle and Right Columns) for Metrics Including Accuracy, Specificity, Sensitivity, and for Darker-Skin Individuals Versus Lighter-Skin Individuals
| Baseline DLS | Debiased DLS (Retina Appearance Optimized) | Debiased DLS (DR-Status Optimized) | |
|---|---|---|---|
|
| |||
| Accuracy (overall) | 66.75 (4.62) [62.13, 71.37] | 74.75 (4.26) [70.49, 79.01] | 71.75 (4.41) [67.34, 76.16] |
| Accuracy (lighter-skin individuals) | 73.0 (6.15) [66.85,79.15] | 78.5 (5.69) [72.81, 84.19] | 72.0 (6.22) [65.78, 78.22] |
| Accuracy (darker-skin individuals) | 60.5 (6.78) [53.72,67.28] | 71.0 (6.29) [64.71, 77.29] | 71.5 (6.26) [65.24, 77.76] |
| Delta-parity (signed) value | 12.5 (9.15) [3.35, 21.7] | 7.5 (8.48) [‒1.0, 16.0] |
|
| Specificity (lighter-skin individuals) | 61.0 (9.56) [51.44, 70.56] | 83.0 (7.36) [75.64, 90.36] | 66.0 (9.28) [56.72, 75.28] |
| Sensitivity (lighter-skin individuals) | 85.0 (7.0) [78.0, 92.0] | 74.0 (8.6) [65.40, 82.6] | 78.0 (8.12) [69.88, 86.12] |
| Specificity (darker-skin individuals) | 86.0 (6.8) [79.2, 92.8] | 86.0 (6.8) [79.2, 92.8] | 85.0 (7.0) [78.0, 92.0] |
| Sensitivity (darker-skin individuals) | 35.0 (9.35) [25.65, 44.35] | 56.0 (9.73) [46.27, 65.73] | 58.0 (9.67) [48.33, 67.67] |
|
| |||
| Sensitivity (darker-skin individuals) (= accuracy) | 38.48 (1.2) [37.28, 39.68] | 52.63 (1.23) [51.4, 53.86] | 49.75 (1.24) [48.51, 50.99] |
Also showing are the 95% error margins in parenthesis and 95% confidence intervals in brackets. Values are in %.
Figure 3.The receiver operating characteristic (ROC) curves for each population of lighter-skin and darker-skin individuals for both the baseline and debiased DLS (for both retinal appearance and DR optimized approaches). DS: dark skin, LS: light skin.